= (
validation =pb.load_dataset(dataset="small_table", tbl_type="polars"))
pb.Validate(data="d", value=3000)
.col_vals_lt(columns
.interrogate()
)
validation
Data Extracts
Data extracts consist of target table rows containing at least one cell that was found to be a failing test unit. Many of the validation methods check values down a column according to some rule (e.g., values are not null/None, values are greater than zero, etc.). So if any of those test units (which are really cells) failed during a validation step, the row is marked as failing for the purposes of data extract collection. This article will:
- describe which validation methods collect data extracts
- provide information on how you can access data extracts
Data extracts can be useful after interrogation since they reveal which rows resulted in failures during interrogation. It is hoped that having quick access to entire rows of data with failing elements can be useful in uncovering the root causes of data quality issues.
The Validation Methods that Work with Data Extracts
The following validation methods are row-based and will have rows extracted when there are failing test units:
col_vals_gt()
col_vals_lt()
col_vals_ge()
col_vals_le()
col_vals_eq()
col_vals_ne()
col_vals_between()
col_vals_outside()
col_vals_in_set()
col_vals_not_in_set()
col_vals_null()
col_vals_not_null()
col_vals_regex()
An extracted row means that a test unit failed for that row in the validation step. The extracted rows are a subset of the original table and are useful for further analysis or understanding the nature of the failing test units.
Data Extracts as CSV Data in the Validation Report
Data extracts are embedded within validation report tables. Let’s look at an example, using the small_table
dataset, where data extracts are collected in a single validation step due to failing test units:
The single validation step checks whether values in d
are less than 3000
. Within that column values range from 108.34
to 9999.99
, so it makes sense that we can see 4 failing test units in the FAIL
column.
If you look at the far right of the validation report you’ll find there’s a CSV
button. Pressing it initiates the download of a CSV, and that CSV contains the data extract for this validation step. The CSV
button only appears when:
- there is a non-zero number of failing test units
- the validation step is based on the use of a row-based validation method (the methods outlined above)
Access to CSV data for the row-based errors is useful when the validation report is shared with other data quality stakeholders, since it is easily accessible and doesn’t require futher use of Pointblank.
Using the get_data_extracts()
Method to Collect
Aside from the low-tech CSV
buttons in validation report tables, we can more directly pull out the data extracts from the validation
object created above. We do that with the get_data_extracts()
method, supplying the step number (1
) to the i=
parameter:
= validation.get_data_extracts(i=1, frame=True)
extract_1
extract_1
_row_num_ | date_time | date | a | b | c | d | e | f |
---|---|---|---|---|---|---|---|---|
u32 | datetime[μs] | date | i64 | str | i64 | f64 | bool | str |
1 | 2016-01-04 11:00:00 | 2016-01-04 | 2 | "1-bcd-345" | 3 | 3423.29 | true | "high" |
2 | 2016-01-04 00:32:00 | 2016-01-04 | 3 | "5-egh-163" | 8 | 9999.99 | true | "low" |
4 | 2016-01-06 17:23:00 | 2016-01-06 | 2 | "5-jdo-903" | null | 3892.4 | false | "mid" |
6 | 2016-01-11 06:15:00 | 2016-01-11 | 4 | "2-dhe-923" | 4 | 3291.03 | true | "mid" |
The extracted table is of the same type (a Polars DataFrame) as the target table. We used load_dataset()
with the tbl_type="polars"
option to fetch the dataset in that form.
Notice that the frame=True
option was used above. What this does is return the table itself as normally the return type is a dictionary. This only works if what’s provided to i=
is a scalar integer (which is the case here).
Also notice that within the DataFrame returned, we get all the columns of the original dataset (i.e., not just the column being checked in the validation step) plus an additional column: _row_num_
. That column provides the 1-indexed row numbers from the original dataset. The combination of rows in their entirety plus row numbers is to provide more context on where data failures occurred.
Viewing Data Extracts with preview()
To get a consistent HTML representation of any data extract (regardless of the table type), we can use the preview()
function:
=extract_1) pb.preview(data
PolarsRows4Columns9 |
||||||||
The view is optimized for readability, with column names and data types displayed in a compact format. Notice that the _row_num_
column is now part of the table stub and doesn’t steal focus from the table’s original columns.
The preview()
function is designed to provide the head and tail (5 rows each) of the table so very large extracts won’t overflow the display.