Data Extracts

When validating data, identifying exactly which rows failed is critical for diagnosing and resolving data quality issues. This is where data extracts come in. Data extracts consist of target table rows containing at least one cell that failed validation. While the validation report provides an overview of pass/fail statistics, data extracts give you the actual problematic records for deeper investigation.

This article will cover:

which validation methods collect data extracts
multiple ways to access and work with data extracts
practical examples of using extracts for data quality improvement
advanced techniques for analyzing extract patterns

The Validation Methods that Work with Data Extracts

The following validation methods operate on column values and will have rows extracted when there are failing test units in those rows:

These row-based validation methods will also have rows extracted should there be failing rows:

Note that some validation methods like col_exists() and col_schema_match() don’t generate data extracts because they validate structural aspects of the table rather than checking column values.

Accessing Data Extracts

There are three primary ways to access data extracts in Pointblank:

the CSV buttons in validation reports
through the get_data_extracts() method
inspecting a subset of failed rows in step reports

Let’s explore each approach using examples.

CSV Data from Validation Reports

Data extracts are embedded within validation report tables. Let’s look at an example, using the small_table dataset, where data extracts are collected in a single validation step due to failing test units:

import pointblank as pb

validation = (
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_lt( columns="d", value=3000)
    .interrogate()
)

validation

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_lt()	d	3000		✓	13	9 0.69	4 0.31	—	—	—

The single validation step checks whether values in d are less than 3000. Within that column, values range from 108.34 to 9999.99 so it makes sense that we can see 4 failing test units in the FAIL column.

If you look at the far right of the validation report you’ll find there’s a CSV button. Pressing it initiates the download of a CSV file, and that file contains the data extract for this validation step. The CSV button only appears when:

there is a non-zero number of failing test units
the validation step is based on the use of a column-value or a row-based validation method (the methods outlined in the section entitled The Validation Methods that Work with Data Extracts)

Access to CSV data for the test unit errors is useful when the validation report is shared with other data quality stakeholders, since it is easily accessible and doesn’t require further use of Pointblank. The stakeholder can simply open the downloaded CSV in their preferred spreadsheet software, import it into a different analysis environment like R or Julia, or process it with any tool that supports CSV files. This cross-platform compatibility makes the CSV export particularly valuable in mixed-language data teams where different members might be working with different tools.

`get_data_extracts()`

For programmatic access to data extracts, Pointblank provides the get_data_extracts() method. This allows you to work with extract data directly in your Python workflow:

# Get data extracts from step 1
extract_1 = validation.get_data_extracts(i=1, frame=True)

extract_1

shape: (4, 9)

_row_num_	date_time	date	a	b	c	d	e	f
u32	datetime[μs]	date	i64	str	i64	f64	bool	str
1	2016-01-04 11:00:00	2016-01-04	2	"1-bcd-345"	3	3423.29	true	"high"
2	2016-01-04 00:32:00	2016-01-04	3	"5-egh-163"	8	9999.99	true	"low"
4	2016-01-06 17:23:00	2016-01-06	2	"5-jdo-903"	null	3892.4	false	"mid"
6	2016-01-11 06:15:00	2016-01-11	4	"2-dhe-923"	4	3291.03	true	"mid"

The extracted table is of the same type (a Polars DataFrame) as the target table. Previously we used load_dataset() with the tbl_type="polars" option to fetch the dataset in that form.

Note these important details about using get_data_extracts():

the parameter i=1 corresponds to the step number shown in the validation report (1-indexed, not 0-indexed)
setting frame=True returns the data as a DataFrame rather than a dictionary (only works when i is a single integer)
the extract includes all columns from the original data, not just the column being validated
an additional _row_num_ column is added to identify the original row positions

Step Reports

Step reports provide another way to access and visualize failing data. When you generate a step report for a validation step that has failing rows, those failing rows are displayed directly in the report:

# Get a step report for the first validation step
step_report = validation.get_step_report(i=1)

step_report

	date_time Datetime	date Date	a Int64	b String	c Int64	d Float64	e Boolean	f String
Report for Validation Step 1 ASSERTION `d < 3000` 4 / 13 TEST UNIT FAILURES IN COLUMN 6 EXTRACT OF ALL 4 ROWS (WITH TEST UNIT FAILURES IN RED):
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	True	low
4	2016-01-06 17:23:00	2016-01-06	2	5-jdo-903	None	3892.4	False	mid
6	2016-01-11 06:15:00	2016-01-11	4	2-dhe-923	4	3291.03	True	mid

Step reports offer several advantages for working with data extracts as they:

provide immediate visual context by highlighting the specific column being validated
format the data for better readability, especially useful when sharing results with colleagues
include additional metadata about the validation step and failure statistics

For steps with many failures, you can customize how many rows to display:

# Limit to just 2 rows of failing data
limited_report = validation.get_step_report(i=1, limit=2)

limited_report

	date_time Datetime	date Date	a Int64	b String	c Int64	d Float64	e Boolean	f String
Report for Validation Step 1 ASSERTION `d < 3000` 4 / 13 TEST UNIT FAILURES IN COLUMN 6 EXTRACT OF FIRST 2 ROWS (WITH TEST UNIT FAILURES IN RED):
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	True	low

Step reports are particularly valuable when you want to quickly inspect the failing data without extracting it into a separate DataFrame. They provide a bridge between the high-level validation report and the detailed data extracts.

Viewing Data Extracts with `preview()`

To get a consistent HTML representation of any data extract (regardless of the table type), we can use the preview() function:

pb.preview(data=extract_1)

	date_time Datetime	date Date	a Int64	b String	c Int64	d Float64	e Boolean	f String
PolarsRows4Columns9
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	True	low
4	2016-01-06 17:23:00	2016-01-06	2	5-jdo-903	None	3892.4	False	mid
6	2016-01-11 06:15:00	2016-01-11	4	2-dhe-923	4	3291.03	True	mid

The view is optimized for readability, with column names and data types displayed in a compact format. Notice that the _row_num_ column is now part of the table stub and doesn’t steal focus from the table’s original columns.

The preview() function is designed to provide the head and tail (5 rows each) of the table so very large extracts won’t overflow the display.

Working with Multiple Validation Steps

When validating data with multiple steps, you can extract failing rows from any step or combine extracts from multiple steps:

# Create a validation with multiple steps
multi_validation = (
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_gt(columns="a", value=3)                                  # Step 1
    .col_vals_lt(columns="d", value=3000)                               # Step 2
    .col_vals_regex(columns="b", pattern="^[0-9]-[a-z]{3}-[0-9]{3}$")   # Step 3
    .interrogate()
)

multi_validation

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_gt()	a	3	✓	13	6 0.46	7 0.54	—	—	—
#4CA64C66	2	col_vals_lt()	d	3000	✓	13	9 0.69	4 0.31	—	—	—
#4CA64C	3	col_vals_regex()	b	^[0-9]-[a-z]{3}-[0-9]{3}$	✓	13	13 1.00	0 0.00	—	—	—	—

Extracting Data from a Specific Step

You can access extracts from any specific validation step:

# Get extracts from step 2 (`d < 3000` validation)
less_than_failures = multi_validation.get_data_extracts(i=2, frame=True)

less_than_failures

shape: (4, 9)

_row_num_	date_time	date	a	b	c	d	e	f
u32	datetime[μs]	date	i64	str	i64	f64	bool	str
1	2016-01-04 11:00:00	2016-01-04	2	"1-bcd-345"	3	3423.29	true	"high"
2	2016-01-04 00:32:00	2016-01-04	3	"5-egh-163"	8	9999.99	true	"low"
4	2016-01-06 17:23:00	2016-01-06	2	"5-jdo-903"	null	3892.4	false	"mid"
6	2016-01-11 06:15:00	2016-01-11	4	"2-dhe-923"	4	3291.03	true	"mid"

Using frame=True means that returned value will be a DataFrame (not a dictionary that contains a single DataFrame).

If a step has no failing rows, an empty DataFrame will be returned:

# Get extracts from step 3 (regex check)
regex_failures = multi_validation.get_data_extracts(i=3, frame=True)

regex_failures

shape: (0, 9)

_row_num_	date_time	date	a	b	c	d	e	f
u32	datetime[μs]	date	i64	str	i64	f64	bool	str

Getting All Extracts at Once

To retrieve extracts from all steps with failures in one command:

# Get all extracts ()
all_extracts = multi_validation.get_data_extracts()

# Display the step numbers that have extracts
print(f"Steps with data extracts: {list(all_extracts.keys())}")

Steps with data extracts: [1, 2, 3]

A dictionary of DataFrames is returned and only steps with failures will appear in this dictionary.

Getting Specific Extracts

You can also retrieve data extracts from several specified steps as a dictionary:

# Get extracts from steps 1 and 2 as a dictionary
extract_dict = multi_validation.get_data_extracts(i=[1, 2])

# The keys are the step numbers
print(f"Dictionary keys: {list(extract_dict.keys())}")

# Get the number of failing rows in each extract
for step, extract in extract_dict.items():
    print(f"Step {step}: {len(extract)} failing rows")

Dictionary keys: [1, 2]
Step 1: 7 failing rows
Step 2: 4 failing rows

Note that frame=True cannot be used when retrieving multiple extracts.

Applications of Data Extracts

Once you have extracted the failing data, there are numerous ways to analyze and use this information to improve data quality. Let’s explore some practical applications.

Finding Patterns Across Validation Steps

You can analyze patterns across different validation steps by combining extracts:

# Get a consolidated view of all rows that failed any validation
all_failure_rows = set()
for step, extract in all_extracts.items():
    if len(extract) > 0:
        all_failure_rows.update(extract["_row_num_"])

print(f"Total unique rows with failures: {len(all_failure_rows)}")
print(f"Row numbers with failures: {sorted(all_failure_rows)}")

Total unique rows with failures: 8
Row numbers with failures: [1, 2, 4, 6, 9, 10, 12, 13]

Identifying Rows with Multiple Failures

You might want to find rows that failed multiple validation checks, as these often represent more serious data quality issues:

# Get row numbers from each extract
step1_rows = set(multi_validation.get_data_extracts(i=1, frame=True)["_row_num_"])
step2_rows = set(multi_validation.get_data_extracts(i=2, frame=True)["_row_num_"])

# Find rows that failed both validations
common_failures = step1_rows.intersection(step2_rows)
print(f"Rows failing both step 1 and step 2: {common_failures}")

Rows failing both step 1 and step 2: {1, 2, 4}

Statistical Analysis of Failing Values

Once you have data extracts, you can perform statistical analysis to identify patterns in the failing data:

# Get extracts from step 2
d_value_failures = multi_validation.get_data_extracts(i=2, frame=True)

# Basic statistical analysis of the failing values
if len(d_value_failures) > 0:
    print(f"Min failing value: {d_value_failures['d'].min()}")
    print(f"Max failing value: {d_value_failures['d'].max()}")
    print(f"Mean failing value: {d_value_failures['d'].mean()}")

Min failing value: 3291.03
Max failing value: 9999.99
Mean failing value: 5151.6775

These analysis techniques help you thoroughly investigate data quality issues by examining failing data from multiple perspectives. Rather than treating failures as isolated incidents, you can identify patterns that might indicate systematic problems in your data pipeline.

Detailed Analysis with `col_summary_tbl()`

For a more comprehensive view of the statistical properties of your extract data, you can use the col_summary_tbl() function:

# Get extracts from step 2
d_value_failures = multi_validation.get_data_extracts(i=2, frame=True)

# Generate a comprehensive statistical summary of the failing data
pb.col_summary_tbl(d_value_failures)

	Column	NA	UQ	Mean	SD	Min	P5	Q1	Med	Q3	P95	Max	IQR
PolarsRows4Columns9
	_row_num_ UInt32	0 0	4 1	3.25	2.22	1	1.01	1.75	3	4.5	5.7	6	2.75
	date_time Datetime(time_unit='us', time_zone=None)	0 0	4 1	-	-	2016 01 04 00:32:00	-	-	-	-	-	2016 01 11 06:15:00	-
	date Date	0 0	3 0.75	-	-	2016 01 04	-	-	-	-	-	2016 01 11	-
	a Int64	0 0	3 0.75	2.75	0.96	2	2	2	2.5	3.25	3.85	4	1.25
	b String	0 0	4 1	9	0	9	9	9	9	9	9	9	0
	c Int64	1 0.25	4 1	5	2.65	3	3.01	3.5	4	6	7.6	8	2.5
	d Float64	0 0	4 1	5,151.68	3,242.49	3291.03	3,293.01	3,390.22	3,657.85	5,419.3	9,083.85	9999.99	2,029.07
	e Boolean	0 0	T0.75 F0.25	-	-	-	-	-	-	-	-	-	-
	f String	0 0	3 0.75	3.25	0.5	3	3	3	3	3.25	3.85	4	0.25
String columns statistics regard the string's length.

This statistical overview provides:

a count of values (including missing values)
type information for each column
distribution metrics like min, max, mean, and quartiles for numeric columns
frequency of common values for categorical columns
missing value counts and proportions

Using col_summary_tbl() on data extracts lets you quickly understand the characteristics of failing data without writing custom analysis code. This approach is particularly valuable when:

You need to understand the statistical properties of failing records
You want to compare distributions of failing vs passing data
You’re looking for anomalies or unexpected patterns within the failing rows

For example, if values failing a validation check are concentrated at certain quantiles or have an unusual distribution shape, this might indicate a systematic data collection or processing issue rather than random errors.

Using Extracts for Data Quality Improvement

Data extracts are especially valuable for:

Root Cause Analysis: examining the full context of failing rows to understand why they failed
Data Cleaning: creating targeted cleanup scripts that focus only on problematic records
Feedback Loops: sharing specific examples with data providers to improve upstream quality
Pattern Recognition: identifying systemic issues by analyzing groups of failing records

Here’s an example of using extracts to create a corrective action plan:

import polars as pl

# Create a new sample of an extract DF
sample_extract = pl.DataFrame({
    "id": range(1, 11),
    "value": [3500, 4200, 3800, 9800, 5500, 7200, 8300, 4100, 7600, 3200],
    "category": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "B"],
    "region": [
        "South", "South", "North", "East", "South",
        "South", "East", "South", "West", "South"
    ]
})

# Identify which regions have the most failures
region_counts = (
    sample_extract
    .group_by("region")
    .agg(pl.len().alias("failure_count"))
    .sort("failure_count", descending=True)
)

region_counts

shape: (4, 2)

region	failure_count
str	u32
"South"	6
"East"	2
"North"	1
"West"	1

Analysis shows that 6 out of 10 failing records (60%) are from the "South" region, making it the highest priority area for data quality investigation. This suggests a potential systemic issue with data collection or processing in that specific region.

Best Practices for Working with Data Extracts

When incorporating data extracts into your data quality workflow:

Use extracts for investigation, not just reporting: the real value is in the insights you gain from analyzing the problematic data
Combine with other Pointblank features: data extracts work well with step reports and can inform threshold settings for future validations
Consider sampling for very large datasets: if your extracts contain thousands of rows, focus your investigation on a representative sample
Look beyond individual validation steps: cross-reference extracts from different steps to identify complex issues that span multiple validation rules
Document patterns in failing data: record and share insights about common failure modes to build organizational knowledge about data quality issues.

By integrating these practices into your data validation workflow, you’ll transform data extracts from simple error lists into powerful diagnostic tools. The most successful data quality initiatives treat extracts as the starting point for investigation rather than the end result of validation. When systematically analyzed and documented, patterns in failing data can reveal underlying issues in data systems, collection methods, or business processes that might otherwise remain hidden. Remember that the ultimate goal isn’t just to identify problematic records, but to use that information to implement targeted improvements that prevent similar issues from occurring in the future.

Conclusion

Data extracts bridge the gap between high-level validation statistics and the detailed context needed to fix data quality issues. By providing access to the actual failing records, Pointblank enables you to:

pinpoint exactly which data points caused validation failures
understand the full context around problematic values
develop targeted strategies for data cleanup and quality improvement
communicate specific examples to stakeholders

Whether you’re accessing extracts through CSV downloads, the get_data_extracts() method, or step reports, this feature provides the detail needed to move from identifying problems to implementing solutions.