Data Extracts

Data extracts consist of target table rows containing at least one cell that was found to be a failing test unit. Many of the validation methods check values down a column according to some rule (e.g., values are not null/None, values are greater than zero, etc.). So if any of those test units (which are really cells) failed during a validation step, the row is marked as failing for the purposes of data extract collection. This article will:

Data extracts can be useful after interrogation since they reveal which rows resulted in failures during interrogation. It is hoped that having quick access to entire rows of data with failing elements can be useful in uncovering the root causes of data quality issues.

The Validation Methods that Work with Data Extracts

The following validation methods are row-based and will have rows extracted when there are failing test units:

An extracted row means that a test unit failed for that row in the validation step. The extracted rows are a subset of the original table and are useful for further analysis or understanding the nature of the failing test units.

Data Extracts as CSV Data in the Validation Report

Data extracts are embedded within validation report tables. Let’s look at an example, using the small_table dataset, where data extracts are collected in a single validation step due to failing test units:

validation = (
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_lt(columns="d", value=3000)
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_lt
col_vals_lt()
d 3000 13 9
0.69
4
0.31

The single validation step checks whether values in d are less than 3000. Within that column values range from 108.34 to 9999.99, so it makes sense that we can see 4 failing test units in the FAIL column.

If you look at the far right of the validation report you’ll find there’s a CSV button. Pressing it initiates the download of a CSV, and that CSV contains the data extract for this validation step. The CSV button only appears when:

  1. there is a non-zero number of failing test units
  2. the validation step is based on the use of a row-based validation method (the methods outlined above)

Access to CSV data for the row-based errors is useful when the validation report is shared with other data quality stakeholders, since it is easily accessible and doesn’t require futher use of Pointblank.

Using the get_data_extracts() Method to Collect

Aside from the low-tech CSV buttons in validation report tables, we can more directly pull out the data extracts from the validation object created above. We do that with the get_data_extracts() method, supplying the step number (1) to the i= parameter:

extract_1 = validation.get_data_extracts(i=1, frame=True)

extract_1
shape: (4, 9)
_row_num_date_timedateabcdef
u32datetime[μs]datei64stri64f64boolstr
12016-01-04 11:00:002016-01-042"1-bcd-345"33423.29true"high"
22016-01-04 00:32:002016-01-043"5-egh-163"89999.99true"low"
42016-01-06 17:23:002016-01-062"5-jdo-903"null3892.4false"mid"
62016-01-11 06:15:002016-01-114"2-dhe-923"43291.03true"mid"

The extracted table is of the same type (a Polars DataFrame) as the target table. We used load_dataset() with the tbl_type="polars" option to fetch the dataset in that form.

Notice that the frame=True option was used above. What this does is return the table itself as normally the return type is a dictionary. This only works if what’s provided to i= is a scalar integer (which is the case here).

Also notice that within the DataFrame returned, we get all the columns of the original dataset (i.e., not just the column being checked in the validation step) plus an additional column: _row_num_. That column provides the 1-indexed row numbers from the original dataset. The combination of rows in their entirety plus row numbers is to provide more context on where data failures occurred.

Viewing Data Extracts with preview()

To get a consistent HTML representation of any data extract (regardless of the table type), we can use the preview() function:

pb.preview(data=extract_1)
PolarsRows4Columns9
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291.03 True mid

The view is optimized for readability, with column names and data types displayed in a compact format. Notice that the _row_num_ column is now part of the table stub and doesn’t steal focus from the table’s original columns.

The preview() function is designed to provide the head and tail (5 rows each) of the table so very large extracts won’t overflow the display.