Working with Data Extracts

When validating data, identifying exactly which rows failed is critical for diagnosing and resolving data quality issues. This is where data extracts come in. Data extracts consist of target table rows containing at least one cell that failed validation. While the validation report provides an overview of pass/fail statistics, data extracts give you the actual problematic records for deeper investigation.

This article will cover:

Understanding Data Extracts

When Pointblank validates data, it evaluates individual values (test units) according to specific rules. For example, when using col_vals_gt(columns="value", value=0), each value in the value column is a test unit that either passes or fails the ‘greater than zero’ test.

When any test unit fails in a row, that entire row is marked for extraction. This approach gives you complete context around failing values rather than just the problematic cells in isolation.

The Validation Methods that Work with Data Extracts

The following validation methods are row-based and will have rows extracted when there are failing test units:

Note that some validation methods like col_exists() or col_schema_match() don’t generate row-based data extracts because they validate structural aspects of the data rather than row values.

Accessing Data Extracts

There are three primary ways to access data extracts in Pointblank:

  1. CSV buttons in validation reports
  2. the get_data_extracts() method
  3. direct access via step reports

Let’s explore each approach using examples.

Data Extracts as CSV Data in the Validation Report

Data extracts are embedded within validation report tables. Let’s look at an example, using the small_table dataset, where data extracts are collected in a single validation step due to failing test units:

import pointblank as pb

validation = (
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_lt(
        columns="d",
        value=3000
    )
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_lt
col_vals_lt()
d 3000 13 9
0.69
4
0.31

The single validation step checks whether values in d are less than 3000. Within that column, values range from 108.34 to 9999.99 so it makes sense that we can see 4 failing test units in the FAIL column.

If you look at the far right of the validation report you’ll find there’s a CSV button. Pressing it initiates the download of a CSV file, and that file contains the data extract for this validation step. The CSV button only appears when:

  1. there is a non-zero number of failing test units
  2. the validation step is based on the use of a row-based validation method (the methods outlined above)

Access to CSV data for the row-based errors is useful when the validation report is shared with other data quality stakeholders, since it is easily accessible and doesn’t require further use of Pointblank. The stakeholder can simply open the downloaded CSV in their preferred spreadsheet software, import it into a different analysis environment like R or Julia, or process it with any tool that supports CSV files. This cross-platform compatibility makes the CSV export particularly valuable in mixed-language data teams where different members might be working with different tools.

Using the get_data_extracts() Method

For programmatic access to data extracts, Pointblank provides the get_data_extracts() method. This allows you to work with extract data directly in your Python workflow:

# Get data extracts from step 1
extract_1 = validation.get_data_extracts(i=1, frame=True)

extract_1
shape: (4, 9)
_row_num_date_timedateabcdef
u32datetime[μs]datei64stri64f64boolstr
12016-01-04 11:00:002016-01-042"1-bcd-345"33423.29true"high"
22016-01-04 00:32:002016-01-043"5-egh-163"89999.99true"low"
42016-01-06 17:23:002016-01-062"5-jdo-903"null3892.4false"mid"
62016-01-11 06:15:002016-01-114"2-dhe-923"43291.03true"mid"

The extracted table is of the same type (a Polars DataFrame) as the target table. Previously we used load_dataset() with the tbl_type="polars" option to fetch the dataset in that form.

Note these important details about using get_data_extracts():

  • the parameter i=1 corresponds to the step number shown in the validation report (1-indexed, not 0-indexed)
  • setting frame=True returns the data as a DataFrame rather than a dictionary (only works when i is a single integer)
  • the extract includes all columns from the original data, not just the column being validated
  • an additional _row_num_ column is added to identify the original row positions

Direct Access via Step Reports

Step reports provide another way to access and visualize failing data. When you generate a step report for a validation step that has failing rows, those failing rows are displayed directly in the report:

# Get a step report for the first validation step
step_report = validation.get_step_report(i=1)

step_report
Report for Validation Step 1
ASSERTION d < 3000
4 / 13 TEST UNIT FAILURES IN COLUMN 6
EXTRACT OF ALL 4 ROWS (WITH TEST UNIT FAILURES IN RED):
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291.03 True mid

Step reports offer several advantages for working with data extracts as they:

  1. provide immediate visual context by highlighting the specific column being validated
  2. format the data for better readability, especially useful when sharing results with colleagues
  3. include additional metadata about the validation step and failure statistics

For steps with many failures, you can customize how many rows to display:

# Limit to just 2 rows of failing data
limited_report = validation.get_step_report(i=1, limit=2)

limited_report
Report for Validation Step 1
ASSERTION d < 3000
4 / 13 TEST UNIT FAILURES IN COLUMN 6
EXTRACT OF FIRST 2 ROWS (WITH TEST UNIT FAILURES IN RED):
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low

Step reports are particularly valuable when you want to quickly inspect the failing data without extracting it into a separate DataFrame. They provide a bridge between the high-level validation report and the detailed data extracts.

Viewing Data Extracts with preview()

To get a consistent HTML representation of any data extract (regardless of the table type), we can use the preview() function:

pb.preview(data=extract_1)
PolarsRows4Columns9
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291.03 True mid

The view is optimized for readability, with column names and data types displayed in a compact format. Notice that the _row_num_ column is now part of the table stub and doesn’t steal focus from the table’s original columns.

The preview() function is designed to provide the head and tail (5 rows each) of the table so very large extracts won’t overflow the display.

Working with Multiple Validation Steps

When validating data with multiple steps, you can extract failing rows from any step:

# Create a validation with multiple steps
multi_validation = (
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_gt(columns="a", value=3)                                  # Step 1
    .col_vals_lt(columns="d", value=3000)                               # Step 2
    .col_vals_regex(columns="b", pattern="^[0-9]-[a-z]{3}-[0-9]{3}$")   # Step 3
    .interrogate()
)

multi_validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_gt
col_vals_gt()
a 3 13 6
0.46
7
0.54
#4CA64C66 2
col_vals_lt
col_vals_lt()
d 3000 13 9
0.69
4
0.31
#4CA64C 3
col_vals_regex
col_vals_regex()
b ^[0-9]-[a-z]{3}-[0-9]{3}$ 13 13
1.00
0
0.00

You can then access extracts from any specific step:

# Get extracts from step 2 (`d < 3000` validation)
less_than_failures = multi_validation.get_data_extracts(i=2, frame=True)

less_than_failures
shape: (4, 9)
_row_num_date_timedateabcdef
u32datetime[μs]datei64stri64f64boolstr
12016-01-04 11:00:002016-01-042"1-bcd-345"33423.29true"high"
22016-01-04 00:32:002016-01-043"5-egh-163"89999.99true"low"
42016-01-06 17:23:002016-01-062"5-jdo-903"null3892.4false"mid"
62016-01-11 06:15:002016-01-114"2-dhe-923"43291.03true"mid"

If a step has no failing rows, an empty DataFrame will be returned:

# Get extracts from step 3 (regex check)
regex_failures = multi_validation.get_data_extracts(i=3, frame=True)

regex_failures
shape: (0, 9)
_row_num_date_timedateabcdef
u32datetime[μs]datei64stri64f64boolstr

Analyzing Patterns in Failing Data

Once you have data extracts, you can perform additional analysis to identify patterns in the failing data:

# Get extracts from step 2
d_value_failures = multi_validation.get_data_extracts(i=2, frame=True)

# Basic statistical analysis of the failing values
if len(d_value_failures) > 0:
    print(f"Min failing value: {d_value_failures['d'].min()}")
    print(f"Max failing value: {d_value_failures['d'].max()}")
    print(f"Mean failing value: {d_value_failures['d'].mean()}")
Min failing value: 3291.03
Max failing value: 9999.99
Mean failing value: 5151.6775

Combining Extracts from Multiple Steps

You might want to find rows that failed multiple validation checks:

# Get row numbers from each extract
step1_rows = set(multi_validation.get_data_extracts(i=1, frame=True)["_row_num_"])
step2_rows = set(multi_validation.get_data_extracts(i=2, frame=True)["_row_num_"])

# Find rows that failed both validations
common_failures = step1_rows.intersection(step2_rows)
print(f"Rows failing both step 1 and step 2: {common_failures}")
Rows failing both step 1 and step 2: {1, 2, 4}

Using Extracts for Data Quality Improvement

Data extracts are especially valuable for:

  1. Root Cause Analysis: examining the full context of failing rows to understand why they failed
  2. Data Cleaning: creating targeted cleanup scripts that focus only on problematic records
  3. Feedback Loops: sharing specific examples with data providers to improve upstream quality
  4. Pattern Recognition: identifying systemic issues by analyzing groups of failing records

Here’s an example of using extracts to create a corrective action plan:

import polars as pl

# Create a new sample of an extract DF
sample_extract = pl.DataFrame({
    "id": range(1, 11),
    "value": [3500, 4200, 3800, 9800, 5500, 7200, 8300, 4100, 7600, 3200],
    "category": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "B"],
    "region": ["South", "South", "North", "East", "South", "South", "East", "South", "West", "South"]
})

# Identify which regions have the most failures
region_counts = (
    sample_extract
    .group_by("region")
    .agg(pl.len().alias("failure_count"))
    .sort("failure_count", descending=True)
)

region_counts
shape: (4, 2)
regionfailure_count
stru32
"South"6
"East"2
"West"1
"North"1

Analysis shows that 6 out of 10 failing records (60%) are from the "South" region, making it the highest priority area for data quality investigation. This suggests a potential systemic issue with data collection or processing in that specific region.

Best Practices for Working with Data Extracts

When incorporating data extracts into your data quality workflow:

  1. Use extracts for investigation, not just reporting: the real value is in the insights you gain from analyzing the problematic data

  2. Combine with other Pointblank features: data extracts work well with step reports and can inform threshold settings for future validations

  3. Consider sampling for very large datasets: if your extracts contain thousands of rows, focus your investigation on a representative sample

  4. Look beyond individual validation steps: cross-reference extracts from different steps to identify complex issues that span multiple validation rules

  5. Document patterns in failing data: record and share insights about common failure modes to build organizational knowledge about data quality issues.

By integrating these practices into your data validation workflow, you’ll transform data extracts from simple error lists into powerful diagnostic tools. The most successful data quality initiatives treat extracts as the starting point for investigation rather than the end result of validation. When systematically analyzed and documented, patterns in failing data can reveal underlying issues in data systems, collection methods, or business processes that might otherwise remain hidden. Remember that the ultimate goal isn’t just to identify problematic records, but to use that information to implement targeted improvements that prevent similar issues from occurring in the future.

Conclusion

Data extracts bridge the gap between high-level validation statistics and the detailed context needed to fix data quality issues. By providing access to the actual failing records, Pointblank enables you to:

  • pinpoint exactly which data points caused validation failures
  • understand the full context around problematic values
  • develop targeted strategies for data cleanup and quality improvement
  • communicate specific examples to stakeholders

Whether you’re accessing extracts through CSV downloads, the get_data_extracts() method, or step reports, this feature provides the detail needed to move from identifying problems to implementing solutions.