While validation reports provide a comprehensive overview of all validation steps, sometimes you need to focus on a specific validation step in greater detail. This is where step reports come in. A step report is a detailed examination of a single validation step, providing in-depth information about the test units that were validated and their pass/fail status.
Step reports are especially useful when debugging validation failures, investigating problematic data, or communicating detailed findings to colleagues who are responsible for specific data quality issues.
Creating a Step Report
To create a step report, you first need to run a validation and then use the Validate.get_step_report() method, specifying which validation step you want to examine:
import pointblank as pb
import polars as pl
# Sample data as a Polars DataFrame
data = pl.DataFrame({
"id": range(1, 11),
"value": [10, 20, 3, 35, 50, 2, 70, 8, 20, 4],
"category": ["A", "B", "C", "A", "D", "F", "A", "E", "H", "G"],
"ratio": [0.5, 0.7, 0.3, 1.2, 0.8, 0.9, 0.4, 1.5, 0.6, 0.2],
"status": ["active", "active", "inactive", "active", "inactive",
"active", "inactive", "active", "active", "inactive"]
})
# Create a validation
validation = (
pb.Validate(data=data, tbl_name="example_data")
.col_vals_gt(
columns="value",
value=10
)
.col_vals_in_set(
columns="category",
set=["A", "B", "C"]
)
.interrogate()
)
# Get step report for the second validation step (i=2)
step_report = validation.get_step_report(i=2)
step_report
Report for Validation Step 2ASSERTION category ∈ {A, B, C} 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED): |
|
|
|
|
|
|
| 5 |
5 |
50 |
D |
0.8 |
inactive |
| 6 |
6 |
2 |
F |
0.9 |
active |
| 8 |
8 |
8 |
E |
1.5 |
active |
| 9 |
9 |
20 |
H |
0.6 |
active |
| 10 |
10 |
4 |
G |
0.2 |
inactive |
In this example, we first create and interrogate a validation object with two steps. We then generate a step report for the second validation step (i=2), which checks if the values in the category column are in the set ["A", "B", "C"].
Note that step numbers in Pointblank start at 1, matching what you see in the validation report’s STEP column (i.e., not 0-based indexing). So the first step is referred to with i=1, the second step with i=2, and so on.
Understanding Step Report Components
A step report consists of several key components that provide detailed information about the validation step:
- Header: displays the validation step number, type of validation, and a brief description
- Table Body: presents either the failing rows, a sample of completely passing data, or an expected/actual comparison (for a
Validate.col_schema_match() step)
The step report table highlights passing and failing rows, making it easy to identify problematic data points. This is especially useful for diagnosing issues when dealing with large datasets.
Different Types of Step Reports
It’s important to note that step reports vary in appearance and structure depending on the type of validation method used:
Additionally, step reports for value-based validations and uniqueness checks operate in two distinct modes:
- When errors are present: The report shows only the failing rows and, for value-based validations, clearly highlights the column under study
- When no errors exist: The report header clearly indicates success, and a sample of the data is shown (along with the studied column highlighted, for value-based validations)
This variation in reporting style allows step reports to effectively communicate the specific type of validation being performed and display relevant information in the most appropriate format. When you’re working with different validation types, expect to see different step report layouts optimized for each context.
Value-Based Validation Step Reports
Value-based step reports focus on showing individual rows where values in the target column failed the validation check. These reports highlight the specific column being validated and clearly display which values violated the condition.
# Create sample data with some validation failures
data = pl.DataFrame({
"id": range(1, 8),
"value": [120, 85, 47, 210, 30, 10, 5],
"category": ["A", "B", "C", "A", "D", "B", "E"]
})
# Create a validation with a value-based check
validation_values = (
pb.Validate(data=data, tbl_name="sales_data")
.col_vals_gt(
columns="value",
value=50,
brief="Sales values should exceed $50"
)
.interrogate()
)
# Display the step report for the value-based validation
validation_values.get_step_report(i=1)
Report for Validation Step 1ASSERTION value > 50 4 / 7 TEST UNIT FAILURES IN COLUMN 2 EXTRACT OF ALL 4 ROWS (WITH TEST UNIT FAILURES IN RED): |
|
|
|
|
| 3 |
3 |
47 |
C |
| 5 |
5 |
30 |
D |
| 6 |
6 |
10 |
B |
| 7 |
7 |
5 |
E |
This report clearly identifies which rows contain values that don’t meet our threshold, making it easy to investigate these specific data points.
Uniqueness Validation Step Reports
Uniqueness checks produce a different type of step report that groups duplicate records together. This format makes it easy to identify patterns in duplicate data.
# Create sample data with some duplicate rows based on the combination of columns
data = pl.DataFrame({
"customer_id": [101, 102, 103, 101, 104, 105, 102],
"order_date": ["2023-01-15", "2023-01-16", "2023-01-16",
"2023-01-15", "2023-01-17", "2023-01-18", "2023-01-19"],
"product": ["Laptop", "Phone", "Tablet", "Laptop",
"Monitor", "Keyboard", "Headphones"]
})
# Create a validation checking for unique customer-product combinations
validation_duplicates = (
pb.Validate(data=data, tbl_name="order_data")
.rows_distinct(
columns_subset=["customer_id", "product"],
brief="Customer should not order the same product twice"
)
.interrogate()
)
# Display the step report for the uniqueness validation
validation_duplicates.get_step_report(i=1)
Report for Validation Step 1Rows are distinct across a subset of columns 2 / 7 TEST UNIT FAILURES EXTRACT OF ALL 2 ROWS: |
|
|
|
| 1 |
101 |
Laptop |
| 4 |
101 |
Laptop |
The report organizes duplicate records together, making it easy to see which combinations are repeated and how many times they appear.
Schema Validation Step Reports
Schema validation step reports have a completely different structure, comparing expected versus actual column data types and presence.
schema = pb.Schema(
columns=[
("date_time", "timestamp"),
("dates", "date"),
("a", "int64"),
("b",),
("c",),
("d", "float64"),
("e", ["bool", "boolean"]),
("f", "str"),
]
)
validation_schema = (
pb.Validate(
data=pb.load_dataset(dataset="small_table", tbl_type="duckdb"),
tbl_name="small_table",
label="Step report for a schema check"
)
.col_schema_match(schema=schema)
.interrogate()
)
# Display the step report for the schema validation
validation_schema.get_step_report(i=1)
Report for Validation Step 1 ✗COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64 |
|
TARGET
|
EXPECTED
|
|
COLUMN |
DATA TYPE |
|
COLUMN |
|
DATA TYPE |
|
| 1 |
date_time |
timestamp(6) |
1 |
date_time |
✓ |
timestamp |
✗ |
| 2 |
date |
date |
2 |
dates |
✗ |
date |
— |
| 3 |
a |
int64 |
3 |
a |
✓ |
int64 |
✓ |
| 4 |
b |
string |
4 |
b |
✓ |
— |
|
| 5 |
c |
int64 |
5 |
c |
✓ |
— |
|
| 6 |
d |
float64 |
6 |
d |
✓ |
float64 |
✓ |
| 7 |
e |
boolean |
7 |
e |
✓ |
bool | boolean |
✓ |
| 8 |
f |
string |
8 |
f |
✓ |
str |
✗ |
Supplied Column Schema: [('date_time', 'timestamp'), ('dates', 'date'), ('a', 'int64'), ('b',), ('c',), ('d', 'float64'), ('e', ['bool', 'boolean']), ('f', 'str')]
|
This report style focuses on comparing the expected schema against the actual table structure, highlighting mismatches in data types or missing/extra columns. The table format makes it easy to see exactly where the schema expectations differ from reality.
Customizing Step Reports
Step reports can be customized with several parameters to better focus your analysis and tailor the output to your specific needs. The Validate.get_step_report() method offers multiple customization options to help you create more effective reports.
When a dataset has many columns, you might want to focus on just those relevant to your analysis. You can create a step report containing only a subset of the columns in the target table:
validation.get_step_report(
i=2,
# Only show these columns ---
columns_subset=["id", "category", "status"]
)
Report for Validation Step 2ASSERTION category ∈ {A, B, C} 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED): |
|
|
|
|
| 5 |
5 |
D |
inactive |
| 6 |
6 |
F |
active |
| 8 |
8 |
E |
active |
| 9 |
9 |
H |
active |
| 10 |
10 |
G |
inactive |
This approach makes step reports much easier to interpret by highlighting just the essential columns that help understand the validation failures.
For large datasets with many failing rows, you might want to use limit= to set a cap on the number of rows shown in the report:
validation.get_step_report(
i=2,
# Only show up to 2 failing rows ---
limit=2
)
Report for Validation Step 2ASSERTION category ∈ {A, B, C} 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF FIRST 2 ROWS (WITH TEST UNIT FAILURES IN RED): |
|
|
|
|
|
|
| 5 |
5 |
50 |
D |
0.8 |
inactive |
| 6 |
6 |
2 |
F |
0.9 |
active |
The report header can also be extensively customized to provide more specific context. You can replace the default header with plain text or Markdown formatting:
validation.get_step_report(
i=2,
header="Category Values Validation: *Critical Analysis*"
)
| Category Values Validation: Critical Analysis |
|
|
|
|
|
|
| 5 |
5 |
50 |
D |
0.8 |
inactive |
| 6 |
6 |
2 |
F |
0.9 |
active |
| 8 |
8 |
8 |
E |
1.5 |
active |
| 9 |
9 |
20 |
H |
0.6 |
active |
| 10 |
10 |
4 |
G |
0.2 |
inactive |
For more advanced header customization, you can use the templating system with the {title} and {details} elements to retain parts of the default header while adding your own content. The {title} template is the default title whereas {details} provides information on the assertion, number of failures, etc. Let’s move away from the default template of {title}{details} and provide a custom title to go with the details text:
validation.get_step_report(
i=2,
header="Custom Category Validation Report {details}"
)
Custom Category Validation Report ASSERTION category ∈ {A, B, C} 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED): |
|
|
|
|
|
|
| 5 |
5 |
50 |
D |
0.8 |
inactive |
| 6 |
6 |
2 |
F |
0.9 |
active |
| 8 |
8 |
8 |
E |
1.5 |
active |
| 9 |
9 |
20 |
H |
0.6 |
active |
| 10 |
10 |
4 |
G |
0.2 |
inactive |
We can keep {title} and {details} and add some more context in between the two:
validation.get_step_report(
i=2,
header=(
"{title}<br>"
"<span style='font-size: 0.75em;'>"
"This validation is critical for our data quality standards."
"</span><br>"
"{details}"
)
)
Report for Validation Step 2 This validation is critical for our data quality standards.
ASSERTION category ∈ {A, B, C} 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED): |
|
|
|
|
|
|
| 5 |
5 |
50 |
D |
0.8 |
inactive |
| 6 |
6 |
2 |
F |
0.9 |
active |
| 8 |
8 |
8 |
E |
1.5 |
active |
| 9 |
9 |
20 |
H |
0.6 |
active |
| 10 |
10 |
4 |
G |
0.2 |
inactive |
You could always use more HTML and CSS to do a lot of customization:
validation.get_step_report(
i=2,
header=(
"VALIDATION SUMMARY\n\n{details}\n\n"
"<hr style='color: lightblue;'>"
"<div style='font-size: smaller; padding-bottom: 5px; text-transform: uppercase'>"
"{title}"
"</div>"
)
)
VALIDATION SUMMARY
ASSERTION category ∈ {A, B, C} 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED):
Report for Validation Step 2
|
|
|
|
|
|
|
| 5 |
5 |
50 |
D |
0.8 |
inactive |
| 6 |
6 |
2 |
F |
0.9 |
active |
| 8 |
8 |
8 |
E |
1.5 |
active |
| 9 |
9 |
20 |
H |
0.6 |
active |
| 10 |
10 |
4 |
G |
0.2 |
inactive |
If you prefer no header at all, simply set header=None:
validation.get_step_report(
i=2,
header=None
)
|
|
|
|
|
|
| 5 |
5 |
50 |
D |
0.8 |
inactive |
| 6 |
6 |
2 |
F |
0.9 |
active |
| 8 |
8 |
8 |
E |
1.5 |
active |
| 9 |
9 |
20 |
H |
0.6 |
active |
| 10 |
10 |
4 |
G |
0.2 |
inactive |
These customization options can be combined to create highly focused reports tailored to specific needs:
validation.get_step_report(
i=2,
columns_subset=["id", "category"],
header="*Category Validation:* Top Issues",
limit=2
)
| Category Validation: Top Issues |
|
|
|
| 5 |
5 |
D |
| 6 |
6 |
F |
Through these customization options, you can craft step reports that effectively communicate the most important information to different audiences. Technical teams might benefit from seeing all columns but with a limited number of examples. Business stakeholders might prefer a focused view with only the most relevant columns. For documentation purposes, custom headers provide important context about what’s being validated.
Remember that customizing your step reports is about more than aesthetics: it’s about making complex validation information more accessible and actionable for all stakeholders involved in data quality.
Using Step Reports for Data Investigation
Step reports can be powerful tools for investigating data quality issues. Let’s look at a more complex example:
# Create a more complex dataset with multiple issues
complex_data = pl.DataFrame({
"id": range(1, 11),
"value": [10, 20, 3, 40, 50, 2, 70, 80, 90, 7],
"ratio": [0.1, 0.2, 0.3, 1.4, 0.5, 0.6, 0.7, 0.8, 1.2, 0.9],
"category": ["A", "B", "C", "A", "D", "B", "A", "C", "B", "E"]
})
# Create a validation with multiple steps
validation_complex = (
pb.Validate(data=complex_data, tbl_name="complex_data")
.col_vals_gt(columns="value", value=10)
.col_vals_le(columns="ratio", value=1.0)
.col_vals_in_set(columns="category", set=["A", "B", "C"])
.interrogate()
)
# Get step report for the ratio validation (step 2)
ratio_report = validation_complex.get_step_report(i=2)
ratio_report
Report for Validation Step 2ASSERTION ratio ≤ 1.0 2 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 2 ROWS (WITH TEST UNIT FAILURES IN RED): |
|
|
|
|
|
| 4 |
4 |
40 |
1.4 |
A |
| 9 |
9 |
90 |
1.2 |
B |
In this example, we’re investigating issues with the ratio column by generating a step report specifically for that validation step. The step report shows exactly which rows have values that exceed our maximum threshold of 1.0.
Step Reports with Segmented Data
When working with segmented validation, step reports become even more valuable as they allow you to investigate issues within specific segments:
# Create data with different regions
segmented_data = pl.DataFrame({
"id": range(1, 10),
"value": [10, 20, 3, 40, 50, 2, 6, 8, 60],
"region": ["North", "North", "South", "South", "East", "East", "West", "West", "West"]
})
# Create a validation with segments
segmented_validation = (
pb.Validate(data=segmented_data, tbl_name="regional_data")
.col_vals_gt(
columns="value",
value=10,
segments="region" # Segment by region
)
.interrogate()
)
# Get step report for a specific segment (the 'West' region)
# For segmented validations, each segment gets its own step number
north_report = segmented_validation.get_step_report(i=4)
north_report
Report for Validation Step 4ASSERTION value > 10 2 / 3 TEST UNIT FAILURES IN COLUMN 2 EXTRACT OF ALL 2 ROWS (WITH TEST UNIT FAILURES IN RED): |
|
|
|
|
| 1 |
7 |
6 |
West |
| 2 |
8 |
8 |
West |
For segmented validations, each segment is treated as a separate validation step with its own step number. This allows you to investigate issues specific to each data segment using the appropriate step number from the validation report.
Best Practices for Using Step Reports
Here are some guidelines for effectively using step reports in your data validation workflow:
Generate step reports selectively: create reports only for steps that require detailed investigation rather than for all steps
Use the limit= parameter for large datasets: when working with large datasets, focus only on a subset of failing rows to avoid information overload
Share specific step reports with stakeholders: when collaborating with domain experts, share relevant step reports to help them understand and address specific data quality issues (and customize the header to improve clarity)
Combine with extracts for deeper analysis: use the Validate.get_data_extracts() method to extract the failing rows for further analysis or correction
Document findings from step reports: when you discover patterns or insights from step reports, document them to inform future data quality improvements
Remember that step reports are most valuable when used strategically as part of a broader data quality framework. By following these best practices, you can use step reports not just for troubleshooting, but to develop a deeper understanding of your data’s characteristics and quality patterns over time. This approach transforms step reports from simple debugging tools into strategic assets for continuous data quality improvement.
Conclusion
Step reports provide a focused lens into specific validation steps, allowing you to investigate data quality issues in detail. By generating targeted reports for specific validation steps, you can:
- pinpoint exactly which data points are causing validation failures
- communicate specific issues to relevant stakeholders
- gather insights that might be missed in the aggregate validation report
- track improvements in specific aspects of data quality over time
Whether you’re debugging validation failures, investigating edge cases, or communicating specific data quality issues to colleagues, step reports can give you the detailed information you need to understand and resolve data quality problems effectively.