import pointblank as pb
(=pb.load_dataset(dataset="small_table", tbl_type="polars"))
pb.Validate(data
.col_vals_not_null(="c",
columns
# Set thresholds for the validation step ---
=pb.Thresholds(warning=1, error=0.2)
thresholds
)
.interrogate() )
Thresholds
Thresholds are a key concept in Pointblank that allow you to define acceptable limits for failing validation tests. Rather than a simple pass/fail model, thresholds enable you to signal failure at different severity levels (‘warning’, ‘error’, and ‘critical’), giving you fine-grained control over how data quality issues are reported and handled.
When used with Actions (covered in the next section), thresholds create a robust system for responding to data quality issues based on their severity. This approach allows you to:
- set different tolerance levels for different types of validation checks
- escalate responses based on the severity of data quality issues
- configure different notification strategies for different threshold levels
- create a more nuanced data validation workflow than simple pass/fail tests
A Simple Example
Let’s start with a basic example that demonstrates how thresholds work in practice:
In this example, we’re validating that column c
contains no Null values. We’ve set:
- A ‘warning’ threshold of
1
(triggers when 1 or more values are Null) - An ‘error’ threshold of
0.2
(triggers when 20% or more values are Null)
Looking at the results:
- the
FAIL
column shows that 2 test units have failed - the
W
column (for ‘warning’) shows a filled gray circle, indicating the warning threshold has been exceeded - the
E
column (for ‘error’) shows an open yellow circle, indicating the error threshold has not been exceeded - the
C
column (for ‘critical’) shows a dash since we didn’t set a critical threshold
Types of Threshold Values
Thresholds in Pointblank can be specified in two different ways:
Absolute Thresholds
Absolute thresholds are specified as integers and represent a fixed number of failing test units:
# Warning threshold of exactly 5 failing test units
= pb.Thresholds(warning=5) thresholds_absolute
With this configuration, the ‘warning’ threshold would be triggered if 5 or more test units fail.
Proportional Thresholds
Proportional thresholds are specified as decimals between 0 and 1, representing a percentage of the total test units:
# Error threshold of 10% of test units failing
= pb.Thresholds(error=0.1) thresholds_proportional
With this configuration, the ‘error’ threshold would be triggered if 10% or more of the test units fail.
Understanding Severity Levels
The three threshold levels in Pointblank (‘warning’, ‘error’, and ‘critical’) are inspired by traditional logging levels used in software development. These names suggest a progression of severity:
- ‘warning’ (level
30
): indicates potential issues that don’t necessarily prevent normal operation - ‘error’ (level
40
): suggests more serious problems that might impact data quality - ‘critical’ (level
50
): represents the most severe issues that likely require immediate attention
These numerical values (30
, 40
, 50
) are used internally by Pointblank when determining threshold hierarchy and can be accessed through the {level_num}
field in action metadata (covered in the next User Guide article).
While these names imply certain severity levels, they’re ultimately just convenient labels for different thresholds. You have complete flexibility in how you use them:
- you could use ‘warning’ for issues that should block a pipeline
- you might configure ‘critical’ for minor issues that just need documentation
- the ‘error’ level could trigger informational emails rather than actual error handling
The naming is primarily a suggestion to help organize your validation strategy. What matters most is how you configure actions for each threshold level to suit your specific data quality requirements.
Threshold Behavior
It’s important to understand a few key behaviors of thresholds:
- thresholds are inclusive: a value equal to or exceeding the threshold will trigger the associated level
- thresholds can be mixed: you can use absolute values for some levels and proportional for others
- threshold levels are hierarchical: ‘critical’ is more severe than ‘error’, which is more severe than ‘warning’
- when a test fails, all applicable threshold levels are marked in the report (though actions may only execute for the highest level by default)
Setting Global Thresholds
You can set thresholds globally for all validation steps in a workflow using the thresholds=
parameter in Validate
:
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data
# Setting thresholds for all validation steps ---
=pb.Thresholds(warning=1, error=0.1)
thresholds
)="a")
.col_vals_not_null(columns="a", value=2)
.col_vals_gt(columns
.interrogate() )
With this approach, the same thresholds are applied to every validation step in the workflow.
Overriding Thresholds for Specific Steps
You can override global thresholds for specific validation steps by providing the thresholds=
parameter in individual validation methods:
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data
# Setting global thresholds ---
=pb.Thresholds(warning=1, error=0.1)
thresholds
)="a")
.col_vals_not_null(columns
.col_vals_gt(="a", value=2,
columns
# Step-specific threshold that overrides global ---
=pb.Thresholds(warning=3)
thresholds
)
.interrogate() )
In this example, the second validation step uses its own ‘warning’ threshold of 3
, overriding the global setting of 1
.
Ways to Define Thresholds
Pointblank offers multiple ways to define thresholds to accommodate different coding styles and requirements.
1. Using the Thresholds
Class (Recommended)
The most explicit and flexible approach is using the Thresholds
class:
# Set individual thresholds for different levels
= pb.Thresholds(warning=0.05, error=0.1, critical=0.25)
thresholds_all_levels
# Set only specific levels
= pb.Thresholds(error=0.15) thresholds_error_only
This approach allows you to:
- set any combination of threshold levels
- use descriptive parameter names for clarity
- skip levels you don’t need to set
2. Using a Tuple
For concise code, you can use a tuple where positions represent ‘warning’, ‘error’, and ‘critical’ levels in that order:
# (warning, error, critical)
= (1, 0.1, 0.25)
thresholds_tuple
# Shorter tuples are also allowed
= (3,) # Only the 'warning' threshold
thresholds_tuple_warning = (3, 0.2) # Both 'warning' and 'error' thresholds thresholds_tuple_warning_error
While concise, this approach requires you to start with the ‘warning’ level and add levels in order.
3. Using a Dictionary
You can also use a dictionary with keys that match the threshold level names:
# Can use any combination of threshold levels
= {"warning": 1, "critical": 0.15} thresholds_dict
The dictionary must use the exact keys "warning"
, "error"
, and/or "critical"
.
4. Using a Single Value
The simplest approach is using a single numeric value, which sets just the ‘warning’ threshold:
# Sets 'warning' threshold to `5`
= 5 thresholds_single
This is equivalent to pb.Thresholds(warning=5)
.
Thresholds and Validation Steps
Let’s look at a more complete validation workflow that demonstrates different threshold configurations:
# Create a validation workflow with global and step-specific thresholds
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data
# Global thresholds applied to all steps unless overridden ---
=pb.Thresholds(warning=0.05, error=0.1, critical=0.2)
thresholds
)
# Step 1: Uses global thresholds ---
="b")
.col_vals_not_null(columns
# Step 2: Overrides with step-specific thresholds ---
.col_vals_gt(="a", value=2,
columns=pb.Thresholds(warning=1, critical=0.3) # No 'error' threshold
thresholds
)
# Step 3: Uses a simplified tuple notation ---
="c", thresholds=(2, 0.15))
.col_vals_not_null(columns
.interrogate() )
Thresholds and Actions
While thresholds by themselves provide visual indicators of validation severity in reports, their real power emerges when combined with Actions. The Actions system (covered in the next article) allows you to specify what happens when a threshold is exceeded.
For example, you might configure:
- A ‘warning’ threshold that logs a message
- An ‘error’ threshold that sends an email notification
- A ‘critical’ threshold that blocks a data pipeline
Here’s a simple preview of how thresholds and actions work together:
(
pb.Validate(=pb.load_dataset(dataset="small_table", tbl_type="polars"),
data
# Define thresholds for all three severity levels ---
=pb.Thresholds(warning=1, error=2, critical=3),
thresholds
# Define actions for different threshold levels ---
=pb.Actions(
actions="Warning: {step} has {FAIL} failing values",
warning="ERROR: Step {step} exceeded the 'error' threshold",
error="CRITICAL: Data quality issue in column {col}"
critical
)
)="c")
.col_vals_not_null(columns
.interrogate() )
ERROR: Step 1 exceeded the 'error' threshold
Conclusion
Thresholds are a powerful feature that transform Pointblank from a simple validation tool into a sophisticated data quality monitoring system. By setting appropriate thresholds, you can:
- Define different severity levels for data quality issues
- Customize tolerance levels for different types of validation checks
- Create a more nuanced approach to data validation than binary pass/fail
- Enable targeted actions based on the severity of issues detected
In the next article, we’ll explore the Actions system in depth, showing you how to define automatic responses when thresholds are exceeded.