Step Reports

While validation reports provide a comprehensive overview of all validation steps, sometimes you need to focus on a specific validation step in greater detail. This is where step reports come in. A step report is a detailed examination of a single validation step, providing in-depth information about the test units that were validated and their pass/fail status.

Step reports are especially useful when debugging validation failures, investigating problematic data, or communicating detailed findings to colleagues who are responsible for specific data quality issues.

Creating a Step Report

To create a step report, you first need to run a validation and then use the Validate.get_step_report() method, specifying which validation step you want to examine:

import pointblank as pb
import polars as pl

# Sample data as a Polars DataFrame
data = pl.DataFrame({
    "id": range(1, 11),
    "value": [10, 20, 3, 35, 50, 2, 70, 8, 20, 4],
    "category": ["A", "B", "C", "A", "D", "F", "A", "E", "H", "G"],
    "ratio": [0.5, 0.7, 0.3, 1.2, 0.8, 0.9, 0.4, 1.5, 0.6, 0.2],
    "status": ["active", "active", "inactive", "active", "inactive",
               "active", "inactive", "active", "active", "inactive"]
})

# Create a validation
validation = (
    pb.Validate(data=data, tbl_name="example_data")
    .col_vals_gt(
        columns="value",
        value=10
    )
    .col_vals_in_set(
        columns="category",
        set=["A", "B", "C"]
    )
    .interrogate()
)

# Get step report for the second validation step (i=2)
step_report = validation.get_step_report(i=2)

step_report

	id Int64	value Int64	category String	ratio Float64	status String
Report for Validation Step 2 ASSERTION `category ∈ {A, B, C}` 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED):
5	5	50	D	0.8	inactive
6	6	2	F	0.9	active
8	8	8	E	1.5	active
9	9	20	H	0.6	active
10	10	4	G	0.2	inactive

In this example, we first create and interrogate a validation object with two steps. We then generate a step report for the second validation step (i=2), which checks if the values in the category column are in the set ["A", "B", "C"].

Note that step numbers in Pointblank start at 1, matching what you see in the validation report’s STEP column (i.e., not 0-based indexing). So the first step is referred to with i=1, the second step with i=2, and so on.

Understanding Step Report Components

A step report consists of several key components that provide detailed information about the validation step:

Header: displays the validation step number, type of validation, and a brief description
Table Body: presents either the failing rows, a sample of completely passing data, or an expected/actual comparison (for a Validate.col_schema_match() step)

The step report table highlights passing and failing rows, making it easy to identify problematic data points. This is especially useful for diagnosing issues when dealing with large datasets.

Different Types of Step Reports

It’s important to note that step reports vary in appearance and structure depending on the type of validation method used:

Value-based validations (like Validate.col_vals_gt(), Validate.col_vals_in_set()): show individual rows that failed validation
Uniqueness checks (Validate.rows_distinct()): group together the duplicate records in order of appearance
Schema validations (Validate.col_schema_match()): display column-level information about expected vs. actual data types

Additionally, step reports for value-based validations and uniqueness checks operate in two distinct modes:

When errors are present: The report shows only the failing rows and, for value-based validations, clearly highlights the column under study
When no errors exist: The report header clearly indicates success, and a sample of the data is shown (along with the studied column highlighted, for value-based validations)

This variation in reporting style allows step reports to effectively communicate the specific type of validation being performed and display relevant information in the most appropriate format. When you’re working with different validation types, expect to see different step report layouts optimized for each context.

Value-Based Validation Step Reports

Value-based step reports focus on showing individual rows where values in the target column failed the validation check. These reports highlight the specific column being validated and clearly display which values violated the condition.

# Create sample data with some validation failures
data = pl.DataFrame({
    "id": range(1, 8),
    "value": [120, 85, 47, 210, 30, 10, 5],
    "category": ["A", "B", "C", "A", "D", "B", "E"]
})

# Create a validation with a value-based check
validation_values = (
    pb.Validate(data=data, tbl_name="sales_data")
    .col_vals_gt(
        columns="value",
        value=50,
        brief="Sales values should exceed $50"
    )
    .interrogate()
)

# Display the step report for the value-based validation
validation_values.get_step_report(i=1)

	id Int64	value Int64	category String
Report for Validation Step 1 ASSERTION `value > 50` 4 / 7 TEST UNIT FAILURES IN COLUMN 2 EXTRACT OF ALL 4 ROWS (WITH TEST UNIT FAILURES IN RED):
3	3	47	C
5	5	30	D
6	6	10	B
7	7	5	E

This report clearly identifies which rows contain values that don’t meet our threshold, making it easy to investigate these specific data points.

Uniqueness Validation Step Reports

Uniqueness checks produce a different type of step report that groups duplicate records together. This format makes it easy to identify patterns in duplicate data.

# Create sample data with some duplicate rows based on the combination of columns
data = pl.DataFrame({
    "customer_id": [101, 102, 103, 101, 104, 105, 102],
    "order_date": ["2023-01-15", "2023-01-16", "2023-01-16",
                   "2023-01-15", "2023-01-17", "2023-01-18", "2023-01-19"],
    "product": ["Laptop", "Phone", "Tablet", "Laptop",
                "Monitor", "Keyboard", "Headphones"]
})

# Create a validation checking for unique customer-product combinations
validation_duplicates = (
    pb.Validate(data=data, tbl_name="order_data")
    .rows_distinct(
        columns_subset=["customer_id", "product"],
        brief="Customer should not order the same product twice"
    )
    .interrogate()
)

# Display the step report for the uniqueness validation
validation_duplicates.get_step_report(i=1)

	customer_id Int64	product String
Report for Validation Step 1 Rows are distinct across a subset of columns 2 / 7 TEST UNIT FAILURES EXTRACT OF ALL 2 ROWS:
1	101	Laptop
4	101	Laptop

The report organizes duplicate records together, making it easy to see which combinations are repeated and how many times they appear.

Schema Validation Step Reports

Schema validation step reports have a completely different structure, comparing expected versus actual column data types and presence.

schema = pb.Schema(
    columns=[
        ("date_time", "timestamp"),
        ("dates", "date"),
        ("a", "int64"),
        ("b",),
        ("c",),
        ("d", "float64"),
        ("e", ["bool", "boolean"]),
        ("f", "str"),
    ]
)

validation_schema = (
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="duckdb"),
        tbl_name="small_table",
        label="Step report for a schema check"
    )
    .col_schema_match(schema=schema)
    .interrogate()
)

# Display the step report for the schema validation
validation_schema.get_step_report(i=1)

TARGET			EXPECTED
Report for Validation Step 1 ✗ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64
	COLUMN	DATA TYPE		COLUMN		DATA TYPE
1	date_time	timestamp(6)	1	date_time	✓	timestamp	✗
2	date	date	2	dates	✗	date	—
3	a	int64	3	a	✓	int64	✓
4	b	string	4	b	✓	—
5	c	int64	5	c	✓	—
6	d	float64	6	d	✓	float64	✓
7	e	boolean	7	e	✓	bool \| boolean	✓
8	f	string	8	f	✓	str	✗
Supplied Column Schema: `[('date_time', 'timestamp'), ('dates', 'date'), ('a', 'int64'), ('b',), ('c',), ('d', 'float64'), ('e', ['bool', 'boolean']), ('f', 'str')]`

This report style focuses on comparing the expected schema against the actual table structure, highlighting mismatches in data types or missing/extra columns. The table format makes it easy to see exactly where the schema expectations differ from reality.

Customizing Step Reports

Step reports can be customized with several parameters to better focus your analysis and tailor the output to your specific needs. The Validate.get_step_report() method offers multiple customization options to help you create more effective reports.

When a dataset has many columns, you might want to focus on just those relevant to your analysis. You can create a step report containing only a subset of the columns in the target table:

validation.get_step_report(
    i=2,

    # Only show these columns ---
    columns_subset=["id", "category", "status"]
)

	id Int64	category String	status String
Report for Validation Step 2 ASSERTION `category ∈ {A, B, C}` 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED):
5	5	D	inactive
6	6	F	active
8	8	E	active
9	9	H	active
10	10	G	inactive

This approach makes step reports much easier to interpret by highlighting just the essential columns that help understand the validation failures.

For large datasets with many failing rows, you might want to use limit= to set a cap on the number of rows shown in the report:

validation.get_step_report(
    i=2,

    # Only show up to 2 failing rows ---
    limit=2
)

	id Int64	value Int64	category String	ratio Float64	status String
Report for Validation Step 2 ASSERTION `category ∈ {A, B, C}` 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF FIRST 2 ROWS (WITH TEST UNIT FAILURES IN RED):
5	5	50	D	0.8	inactive
6	6	2	F	0.9	active

The report header can also be extensively customized to provide more specific context. You can replace the default header with plain text or Markdown formatting:

validation.get_step_report(
    i=2,
    header="Category Values Validation: *Critical Analysis*"
)

	id Int64	value Int64	category String	ratio Float64	status String
Category Values Validation: Critical Analysis
5	5	50	D	0.8	inactive
6	6	2	F	0.9	active
8	8	8	E	1.5	active
9	9	20	H	0.6	active
10	10	4	G	0.2	inactive

For more advanced header customization, you can use the templating system with the {title} and {details} elements to retain parts of the default header while adding your own content. The {title} template is the default title whereas {details} provides information on the assertion, number of failures, etc. Let’s move away from the default template of {title}{details} and provide a custom title to go with the details text:

validation.get_step_report(
    i=2,
    header="Custom Category Validation Report {details}"
)

	id Int64	value Int64	category String	ratio Float64	status String
Custom Category Validation Report ASSERTION `category ∈ {A, B, C}` 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED):
5	5	50	D	0.8	inactive
6	6	2	F	0.9	active
8	8	8	E	1.5	active
9	9	20	H	0.6	active
10	10	4	G	0.2	inactive

We can keep {title} and {details} and add some more context in between the two:

validation.get_step_report(
    i=2,
    header=(
        "{title}<br>"
        "<span style='font-size: 0.75em;'>"
        "This validation is critical for our data quality standards."
        "</span><br>"
        "{details}"
    )
)

	id Int64	value Int64	category String	ratio Float64	status String
Report for Validation Step 2 This validation is critical for our data quality standards. ASSERTION `category ∈ {A, B, C}` 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED):
5	5	50	D	0.8	inactive
6	6	2	F	0.9	active
8	8	8	E	1.5	active
9	9	20	H	0.6	active
10	10	4	G	0.2	inactive

You could always use more HTML and CSS to do a lot of customization:

validation.get_step_report(
    i=2,
    header=(
        "VALIDATION SUMMARY\n\n{details}\n\n"
        "<hr style='color: lightblue;'>"
        "<div style='font-size: smaller; padding-bottom: 5px; text-transform: uppercase'>"
        "{title}"
        "</div>"
    )
)

	id Int64	value Int64	category String	ratio Float64	status String
VALIDATION SUMMARY ASSERTION `category ∈ {A, B, C}` 5 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 5 ROWS (WITH TEST UNIT FAILURES IN RED): Report for Validation Step 2
5	5	50	D	0.8	inactive
6	6	2	F	0.9	active
8	8	8	E	1.5	active
9	9	20	H	0.6	active
10	10	4	G	0.2	inactive

If you prefer no header at all, simply set header=None:

validation.get_step_report(
    i=2,
    header=None
)

	id Int64	value Int64	category String	ratio Float64	status String
5	5	50	D	0.8	inactive
6	6	2	F	0.9	active
8	8	8	E	1.5	active
9	9	20	H	0.6	active
10	10	4	G	0.2	inactive

These customization options can be combined to create highly focused reports tailored to specific needs:

validation.get_step_report(
    i=2,
    columns_subset=["id", "category"],
    header="*Category Validation:* Top Issues",
    limit=2
)

	id Int64	category String
Category Validation: Top Issues
5	5	D
6	6	F

Through these customization options, you can craft step reports that effectively communicate the most important information to different audiences. Technical teams might benefit from seeing all columns but with a limited number of examples. Business stakeholders might prefer a focused view with only the most relevant columns. For documentation purposes, custom headers provide important context about what’s being validated.

Remember that customizing your step reports is about more than aesthetics: it’s about making complex validation information more accessible and actionable for all stakeholders involved in data quality.

Using Step Reports for Data Investigation

Step reports can be powerful tools for investigating data quality issues. Let’s look at a more complex example:

# Create a more complex dataset with multiple issues
complex_data = pl.DataFrame({
    "id": range(1, 11),
    "value": [10, 20, 3, 40, 50, 2, 70, 80, 90, 7],
    "ratio": [0.1, 0.2, 0.3, 1.4, 0.5, 0.6, 0.7, 0.8, 1.2, 0.9],
    "category": ["A", "B", "C", "A", "D", "B", "A", "C", "B", "E"]
})

# Create a validation with multiple steps
validation_complex = (
    pb.Validate(data=complex_data, tbl_name="complex_data")
    .col_vals_gt(columns="value", value=10)
    .col_vals_le(columns="ratio", value=1.0)
    .col_vals_in_set(columns="category", set=["A", "B", "C"])
    .interrogate()
)

# Get step report for the ratio validation (step 2)
ratio_report = validation_complex.get_step_report(i=2)

ratio_report

	id Int64	value Int64	ratio Float64	category String
Report for Validation Step 2 ASSERTION `ratio ≤ 1.0` 2 / 10 TEST UNIT FAILURES IN COLUMN 3 EXTRACT OF ALL 2 ROWS (WITH TEST UNIT FAILURES IN RED):
4	4	40	1.4	A
9	9	90	1.2	B

In this example, we’re investigating issues with the ratio column by generating a step report specifically for that validation step. The step report shows exactly which rows have values that exceed our maximum threshold of 1.0.

Combining Step Reports with Extracts

For more advanced analysis, you can extract the actual data from a step report into a DataFrame:

# Extract the data from the step report
failing_ratios = validation_complex.get_data_extracts(i=2)

failing_ratios

{2: shape: (2, 5)
 ┌───────────┬─────┬───────┬───────┬──────────┐
 │ _row_num_ ┆ id  ┆ value ┆ ratio ┆ category │
 │ ---       ┆ --- ┆ ---   ┆ ---   ┆ ---      │
 │ u32       ┆ i64 ┆ i64   ┆ f64   ┆ str      │
 ╞═══════════╪═════╪═══════╪═══════╪══════════╡
 │ 4         ┆ 4   ┆ 40    ┆ 1.4   ┆ A        │
 │ 9         ┆ 9   ┆ 90    ┆ 1.2   ┆ B        │
 └───────────┴─────┴───────┴───────┴──────────┘}

This extracts the failing rows from the validation step, which you can then further analyze or fix as needed. Note that the parameter i=2 corresponds directly to the step number shown in the validation report; it’s the same numbering system used for Validate.get_step_report().

These extracts are particularly valuable for analysts who need to:

perform additional calculations on problematic data
feed failing records into correction pipelines
create visualizations of data patterns that led to validation failures
export problem records to share with data owners

It’s worth noting that the validation report itself includes export buttons on the far right of each row that allow you to download CSV files of the failing data directly. This serves as a convenient delivery mechanism for sharing extracts with colleagues who may not be working in Python, making the validation report not just a visual tool but also a practical means of distributing problematic data for further investigation.

Step Reports with Segmented Data

When working with segmented validation, step reports become even more valuable as they allow you to investigate issues within specific segments:

# Create data with different regions
segmented_data = pl.DataFrame({
    "id": range(1, 10),
    "value": [10, 20, 3, 40, 50, 2, 6, 8, 60],
    "region": ["North", "North", "South", "South", "East", "East", "West", "West", "West"]
})

# Create a validation with segments
segmented_validation = (
    pb.Validate(data=segmented_data, tbl_name="regional_data")
    .col_vals_gt(
        columns="value",
        value=10,
        segments="region"  # Segment by region
    )
    .interrogate()
)

# Get step report for a specific segment (the 'West' region)
# For segmented validations, each segment gets its own step number
north_report = segmented_validation.get_step_report(i=4)

north_report

	id Int64	value Int64	region String
Report for Validation Step 4 ASSERTION `value > 10` 2 / 3 TEST UNIT FAILURES IN COLUMN 2 EXTRACT OF ALL 2 ROWS (WITH TEST UNIT FAILURES IN RED):
1	7	6	West
2	8	8	West

For segmented validations, each segment is treated as a separate validation step with its own step number. This allows you to investigate issues specific to each data segment using the appropriate step number from the validation report.

Best Practices for Using Step Reports

Here are some guidelines for effectively using step reports in your data validation workflow:

Generate step reports selectively: create reports only for steps that require detailed investigation rather than for all steps
Use the limit= parameter for large datasets: when working with large datasets, focus only on a subset of failing rows to avoid information overload
Share specific step reports with stakeholders: when collaborating with domain experts, share relevant step reports to help them understand and address specific data quality issues (and customize the header to improve clarity)
Combine with extracts for deeper analysis: use the Validate.get_data_extracts() method to extract the failing rows for further analysis or correction
Document findings from step reports: when you discover patterns or insights from step reports, document them to inform future data quality improvements

Remember that step reports are most valuable when used strategically as part of a broader data quality framework. By following these best practices, you can use step reports not just for troubleshooting, but to develop a deeper understanding of your data’s characteristics and quality patterns over time. This approach transforms step reports from simple debugging tools into strategic assets for continuous data quality improvement.

Conclusion

Step reports provide a focused lens into specific validation steps, allowing you to investigate data quality issues in detail. By generating targeted reports for specific validation steps, you can:

pinpoint exactly which data points are causing validation failures
communicate specific issues to relevant stakeholders
gather insights that might be missed in the aggregate validation report
track improvements in specific aspects of data quality over time

Whether you’re debugging validation failures, investigating edge cases, or communicating specific data quality issues to colleagues, step reports can give you the detailed information you need to understand and resolve data quality problems effectively.