import pointblank as pb
= (
validation =pb.load_dataset(dataset="small_table", tbl_type="polars"))
pb.Validate(data
.col_vals_lt(="d",
columns=3000
value
)
.interrogate()
)
validation
Working with Data Extracts
When validating data, identifying exactly which rows failed is critical for diagnosing and resolving data quality issues. This is where data extracts come in. Data extracts consist of target table rows containing at least one cell that failed validation. While the validation report provides an overview of pass/fail statistics, data extracts give you the actual problematic records for deeper investigation.
This article will cover:
- which validation methods collect data extracts
- multiple ways to access and work with data extracts
- practical examples of using extracts for data quality improvement
- advanced techniques for analyzing extract patterns
Understanding Data Extracts
When Pointblank validates data, it evaluates individual values (test units) according to specific rules. For example, when using col_vals_gt(columns="value", value=0)
, each value in the value
column is a test unit that either passes or fails the ‘greater than zero’ test.
When any test unit fails in a row, that entire row is marked for extraction. This approach gives you complete context around failing values rather than just the problematic cells in isolation.
The Validation Methods that Work with Data Extracts
The following validation methods are row-based and will have rows extracted when there are failing test units:
col_vals_gt()
col_vals_lt()
col_vals_ge()
col_vals_le()
col_vals_eq()
col_vals_ne()
col_vals_between()
col_vals_outside()
col_vals_in_set()
col_vals_not_in_set()
col_vals_null()
col_vals_not_null()
col_vals_regex()
rows_distinct()
rows_complete()
conjointly()
Note that some validation methods like col_exists()
or col_schema_match()
don’t generate row-based data extracts because they validate structural aspects of the data rather than row values.
Accessing Data Extracts
There are three primary ways to access data extracts in Pointblank:
- CSV buttons in validation reports
- the
get_data_extracts()
method - direct access via step reports
Let’s explore each approach using examples.
Data Extracts as CSV Data in the Validation Report
Data extracts are embedded within validation report tables. Let’s look at an example, using the small_table
dataset, where data extracts are collected in a single validation step due to failing test units:
The single validation step checks whether values in d
are less than 3000
. Within that column, values range from 108.34
to 9999.99
so it makes sense that we can see 4 failing test units in the FAIL
column.
If you look at the far right of the validation report you’ll find there’s a CSV
button. Pressing it initiates the download of a CSV file, and that file contains the data extract for this validation step. The CSV
button only appears when:
- there is a non-zero number of failing test units
- the validation step is based on the use of a row-based validation method (the methods outlined above)
Access to CSV data for the row-based errors is useful when the validation report is shared with other data quality stakeholders, since it is easily accessible and doesn’t require further use of Pointblank. The stakeholder can simply open the downloaded CSV in their preferred spreadsheet software, import it into a different analysis environment like R or Julia, or process it with any tool that supports CSV files. This cross-platform compatibility makes the CSV export particularly valuable in mixed-language data teams where different members might be working with different tools.
Using the get_data_extracts()
Method
For programmatic access to data extracts, Pointblank provides the get_data_extracts()
method. This allows you to work with extract data directly in your Python workflow:
# Get data extracts from step 1
= validation.get_data_extracts(i=1, frame=True)
extract_1
extract_1
_row_num_ | date_time | date | a | b | c | d | e | f |
---|---|---|---|---|---|---|---|---|
u32 | datetime[μs] | date | i64 | str | i64 | f64 | bool | str |
1 | 2016-01-04 11:00:00 | 2016-01-04 | 2 | "1-bcd-345" | 3 | 3423.29 | true | "high" |
2 | 2016-01-04 00:32:00 | 2016-01-04 | 3 | "5-egh-163" | 8 | 9999.99 | true | "low" |
4 | 2016-01-06 17:23:00 | 2016-01-06 | 2 | "5-jdo-903" | null | 3892.4 | false | "mid" |
6 | 2016-01-11 06:15:00 | 2016-01-11 | 4 | "2-dhe-923" | 4 | 3291.03 | true | "mid" |
The extracted table is of the same type (a Polars DataFrame) as the target table. Previously we used load_dataset()
with the tbl_type="polars"
option to fetch the dataset in that form.
Note these important details about using get_data_extracts()
:
- the parameter
i=1
corresponds to the step number shown in the validation report (1-indexed, not 0-indexed) - setting
frame=True
returns the data as a DataFrame rather than a dictionary (only works wheni
is a single integer) - the extract includes all columns from the original data, not just the column being validated
- an additional
_row_num_
column is added to identify the original row positions
Direct Access via Step Reports
Step reports provide another way to access and visualize failing data. When you generate a step report for a validation step that has failing rows, those failing rows are displayed directly in the report:
# Get a step report for the first validation step
= validation.get_step_report(i=1)
step_report
step_report
Report for Validation Step 1 ASSERTION
4 / 13 TEST UNIT FAILURES IN COLUMN 6 EXTRACT OF ALL 4 ROWS (WITH TEST UNIT FAILURES IN RED): |
||||||||
Step reports offer several advantages for working with data extracts as they:
- provide immediate visual context by highlighting the specific column being validated
- format the data for better readability, especially useful when sharing results with colleagues
- include additional metadata about the validation step and failure statistics
For steps with many failures, you can customize how many rows to display:
# Limit to just 2 rows of failing data
= validation.get_step_report(i=1, limit=2)
limited_report
limited_report
Report for Validation Step 1 ASSERTION
4 / 13 TEST UNIT FAILURES IN COLUMN 6 EXTRACT OF FIRST 2 ROWS (WITH TEST UNIT FAILURES IN RED): |
||||||||
Step reports are particularly valuable when you want to quickly inspect the failing data without extracting it into a separate DataFrame. They provide a bridge between the high-level validation report and the detailed data extracts.
Viewing Data Extracts with preview()
To get a consistent HTML representation of any data extract (regardless of the table type), we can use the preview()
function:
=extract_1) pb.preview(data
PolarsRows4Columns9 |
||||||||
The view is optimized for readability, with column names and data types displayed in a compact format. Notice that the _row_num_
column is now part of the table stub and doesn’t steal focus from the table’s original columns.
The preview()
function is designed to provide the head and tail (5 rows each) of the table so very large extracts won’t overflow the display.
Working with Multiple Validation Steps
When validating data with multiple steps, you can extract failing rows from any step:
# Create a validation with multiple steps
= (
multi_validation =pb.load_dataset(dataset="small_table", tbl_type="polars"))
pb.Validate(data="a", value=3) # Step 1
.col_vals_gt(columns="d", value=3000) # Step 2
.col_vals_lt(columns="b", pattern="^[0-9]-[a-z]{3}-[0-9]{3}$") # Step 3
.col_vals_regex(columns
.interrogate()
)
multi_validation
You can then access extracts from any specific step:
# Get extracts from step 2 (`d < 3000` validation)
= multi_validation.get_data_extracts(i=2, frame=True)
less_than_failures
less_than_failures
_row_num_ | date_time | date | a | b | c | d | e | f |
---|---|---|---|---|---|---|---|---|
u32 | datetime[μs] | date | i64 | str | i64 | f64 | bool | str |
1 | 2016-01-04 11:00:00 | 2016-01-04 | 2 | "1-bcd-345" | 3 | 3423.29 | true | "high" |
2 | 2016-01-04 00:32:00 | 2016-01-04 | 3 | "5-egh-163" | 8 | 9999.99 | true | "low" |
4 | 2016-01-06 17:23:00 | 2016-01-06 | 2 | "5-jdo-903" | null | 3892.4 | false | "mid" |
6 | 2016-01-11 06:15:00 | 2016-01-11 | 4 | "2-dhe-923" | 4 | 3291.03 | true | "mid" |
If a step has no failing rows, an empty DataFrame will be returned:
# Get extracts from step 3 (regex check)
= multi_validation.get_data_extracts(i=3, frame=True)
regex_failures
regex_failures
_row_num_ | date_time | date | a | b | c | d | e | f |
---|---|---|---|---|---|---|---|---|
u32 | datetime[μs] | date | i64 | str | i64 | f64 | bool | str |
Analyzing Patterns in Failing Data
Once you have data extracts, you can perform additional analysis to identify patterns in the failing data:
# Get extracts from step 2
= multi_validation.get_data_extracts(i=2, frame=True)
d_value_failures
# Basic statistical analysis of the failing values
if len(d_value_failures) > 0:
print(f"Min failing value: {d_value_failures['d'].min()}")
print(f"Max failing value: {d_value_failures['d'].max()}")
print(f"Mean failing value: {d_value_failures['d'].mean()}")
Min failing value: 3291.03
Max failing value: 9999.99
Mean failing value: 5151.6775
Combining Extracts from Multiple Steps
You might want to find rows that failed multiple validation checks:
# Get row numbers from each extract
= set(multi_validation.get_data_extracts(i=1, frame=True)["_row_num_"])
step1_rows = set(multi_validation.get_data_extracts(i=2, frame=True)["_row_num_"])
step2_rows
# Find rows that failed both validations
= step1_rows.intersection(step2_rows)
common_failures print(f"Rows failing both step 1 and step 2: {common_failures}")
Rows failing both step 1 and step 2: {1, 2, 4}
Using Extracts for Data Quality Improvement
Data extracts are especially valuable for:
- Root Cause Analysis: examining the full context of failing rows to understand why they failed
- Data Cleaning: creating targeted cleanup scripts that focus only on problematic records
- Feedback Loops: sharing specific examples with data providers to improve upstream quality
- Pattern Recognition: identifying systemic issues by analyzing groups of failing records
Here’s an example of using extracts to create a corrective action plan:
import polars as pl
# Create a new sample of an extract DF
= pl.DataFrame({
sample_extract "id": range(1, 11),
"value": [3500, 4200, 3800, 9800, 5500, 7200, 8300, 4100, 7600, 3200],
"category": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "B"],
"region": ["South", "South", "North", "East", "South", "South", "East", "South", "West", "South"]
})
# Identify which regions have the most failures
= (
region_counts
sample_extract"region")
.group_by(len().alias("failure_count"))
.agg(pl."failure_count", descending=True)
.sort(
)
region_counts
region | failure_count |
---|---|
str | u32 |
"South" | 6 |
"East" | 2 |
"West" | 1 |
"North" | 1 |
Analysis shows that 6 out of 10 failing records (60%) are from the "South"
region, making it the highest priority area for data quality investigation. This suggests a potential systemic issue with data collection or processing in that specific region.
Best Practices for Working with Data Extracts
When incorporating data extracts into your data quality workflow:
Use extracts for investigation, not just reporting: the real value is in the insights you gain from analyzing the problematic data
Combine with other Pointblank features: data extracts work well with step reports and can inform threshold settings for future validations
Consider sampling for very large datasets: if your extracts contain thousands of rows, focus your investigation on a representative sample
Look beyond individual validation steps: cross-reference extracts from different steps to identify complex issues that span multiple validation rules
Document patterns in failing data: record and share insights about common failure modes to build organizational knowledge about data quality issues.
By integrating these practices into your data validation workflow, you’ll transform data extracts from simple error lists into powerful diagnostic tools. The most successful data quality initiatives treat extracts as the starting point for investigation rather than the end result of validation. When systematically analyzed and documented, patterns in failing data can reveal underlying issues in data systems, collection methods, or business processes that might otherwise remain hidden. Remember that the ultimate goal isn’t just to identify problematic records, but to use that information to implement targeted improvements that prevent similar issues from occurring in the future.
Conclusion
Data extracts bridge the gap between high-level validation statistics and the detailed context needed to fix data quality issues. By providing access to the actual failing records, Pointblank enables you to:
- pinpoint exactly which data points caused validation failures
- understand the full context around problematic values
- develop targeted strategies for data cleanup and quality improvement
- communicate specific examples to stakeholders
Whether you’re accessing extracts through CSV downloads, the get_data_extracts()
method, or step reports, this feature provides the detail needed to move from identifying problems to implementing solutions.