import pointblank as pb
import polars as pl
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data="Three different validation methods."
label
)="a", value=0)
.col_vals_gt(columns
.rows_distinct()="date")
.col_exists(columns
.interrogate() )
Overview
This article provides a quick overview of the data validation features in Pointblank. It introduces the key concepts and shows examples of the main functionality, giving you a foundation for using the library effectively.
Later articles in the User Guide will expand on each section covered here, providing more explanations and examples.
Validation Methods
Pointblank’s core functionality revolves around validation steps, which are individual checks that verify different aspects of your data. These steps are created by calling validation methods from the Validate
class. When combined they create a comprehensive validation plan for your data.
Here’s an example of a validation that incorporates three different validation methods:
This example showcases how you can combine different types of validations in a single validation plan:
- a column value validation with
col_vals_gt()
- a row-based validation with
rows_distinct()
- a table structure validation with
col_exists()
Most validation methods share common parameters that enhance their flexibility and power. These shared parameters (overviewed in the next few sections) create a consistent interface across all validation steps while allowing you to customize validation behavior for specific needs.
Column Selection Patterns
You can apply the same validation logic to multiple columns at once through use of column selection patterns (used in the columns=
parameter). This reduces repetitive code and makes your validation plans more maintainable:
import narwhals.selectors as nws
# Map validations across multiple columns
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data="Applying column mapping in `columns`."
label
)
# Apply validation rules to multiple columns ---
.col_vals_not_null(=["a", "b", "c"]
columns
)
# Apply to numeric columns only with a Narwhals selector ---
.col_vals_gt(=nws.numeric(),
columns=0
value
)
.interrogate() )
This technique is particularly valuable when working with wide datasets containing many similarly-structured columns or when applying standard quality checks across an entire table. It also ensures consistency in how validation rules are applied across related data columns.
Preprocessing
Preprocessing (with the pre=
parameter) allows you to transform or modify your data before applying validation checks, enabling you to validate derived or modified data without altering the original dataset:
import polars as pl
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data="Preprocessing validation steps via `pre=`."
label
)
.col_vals_gt(="a", value=5,
columns
# Apply transformation before validation ---
=lambda df: df.with_columns(
pre"a") * 2 # Double values before checking
pl.col(
)
)
.col_vals_lt(="c", value=100,
columns
# Apply more complex transformation ---
=lambda df: df.with_columns(
pre"c").pow(2) # Square values before checking
pl.col(
)
)
.interrogate() )
Preprocessing enables validation of transformed data without modifying your original dataset, making it ideal for checking derived metrics, or validating normalized values. This approach keeps your validation code clean while allowing for sophisticated data quality checks on calculated results.
Segmentation
Segmentation (through the segments=
parameter) allows you to validate data across different groups, enabling you to identify segment-specific quality issues that might be hidden in aggregate analyses:
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data="Segmenting validation steps via `segments=`."
label
)
.col_vals_gt(="c", value=3,
columns
# Split into steps by categorical values in column 'f' ---
="f"
segments
)
.interrogate() )
Segmentation is powerful for detecting patterns of quality issues that may exist only in specific data subsets, such as certain time periods, categories, or geographical regions. It helps ensure that all significant segments of your data meet quality standards, not just the data as a whole.
Thresholds
Thresholds (set through the thresholds=
parameter) let you set acceptable levels of failure before triggering warnings, errors, or critical notifications for individual validation steps:
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data="Using thresholds."
label
)
# Add validation steps with different thresholds ---
.col_vals_gt(="a", value=1,
columns=pb.Thresholds(warning=0.1, error=0.2, critical=0.3)
thresholds
)
# Add another step with stricter thresholds ---
.col_vals_lt(="c", value=10,
columns=pb.Thresholds(warning=0.05, error=0.1)
thresholds
)
.interrogate() )
Thresholds provide a nuanced way to monitor data quality, allowing you to set different severity levels based on the importance of each validation and your organization’s tolerance for specific types of data issues.
Actions
Actions (which can be configured in the actions=
parameter) allow you to define specific responses when validation thresholds are crossed. You can use simple string messages or custom functions for more complex behavior:
# Example 1: Action with a string message ---
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data="Using actions with a string message."
label
)
.col_vals_gt(="c", value=2,
columns=pb.Thresholds(warning=0.1, error=0.2),
thresholds
# Add a print-to-console action for the 'warning' threshold ---
=pb.Actions(
actions="WARNING: Values below `{value}` detected in column 'c'."
warning
)
)
.interrogate() )
WARNING: Values below `2` detected in column 'c'.
# Example 2: Action with a callable function ---
def custom_action():
from datetime import datetime
print(f"Data quality issue found ({datetime.now()}).")
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data="Using actions with a callable function."
label
)
.col_vals_gt(="a", value=5,
columns=pb.Thresholds(warning=0.1, error=0.2),
thresholds
# Apply the function to the 'error' threshold ---
=pb.Actions(error=custom_action)
actions
)
.interrogate() )
Data quality issue found (2025-05-19 17:22:23.695340).
With custom action functions, you can implement sophisticated responses like sending notifications or logging to external systems.
Briefs
Briefs (which can be set through the brief=
parameter) allow you to customize descriptions associated with validation steps, making validation results more understandable to stakeholders. Briefs can be either automatically generated by setting brief=True
or defined as custom messages for more specific explanations:
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data="Using `brief=` for displaying brief messages."
label
)
.col_vals_gt(="a", value=0,
columns
# Use `True` for automatic generation of briefs ---
=True
brief
)
.col_exists(=["date", "date_time"],
columns
# Add a custom brief for this validation step ---
="Verify required date columns exist for time-series analysis"
brief
)
.interrogate() )
Pointblank Validation | |||||||||||||
Using `brief=` for displaying brief messages. Polars |
|||||||||||||
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_gt()
Expect that values in |
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
col_exists()
Verify required date columns exist for time-series analysis |
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 3 |
col_exists()
Verify required date columns exist for time-series analysis |
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
Briefs make validation results more meaningful by providing context about why each check matters. They’re particularly valuable in shared reports where stakeholders from various disciplines need to understand validation results in domain-specific terms.
Getting More Information
Each validation step can be further customized and has additional options. See these pages for more information:
- Validation Methods: A closer look at the more common validation methods
- Column Selection Patterns: Techniques for targeting specific columns
- Preprocessing: Transform data before validation
- Segmentation: Apply validations to specific segments of your data
- Thresholds: Set quality standards and trigger severity levels
- Actions: Respond to threshold exceedances with notifications or custom functions
- Briefs: Add context to validation steps
Conclusion
Validation steps are the building blocks of data validation in Pointblank. By combining steps from different categories and leveraging common features like thresholds, actions, and preprocessing, you can create comprehensive data quality checks tailored to your specific needs.
The next sections of this guide will dive deeper into each of these topics, providing detailed explanations and examples.