import pointblank as pb
= (
validation_1 =pb.load_dataset(dataset="small_table"))
pb.Validate(data
.col_vals_not_null(="c", thresholds=(1, 0.2)
columns
)
.interrogate()
)
validation_1
Thresholds
Thresholds enable you to signal failure at different severity levels. They also allow for the triggering of custom actions, a topic which is covered in the next section. For instance you might be testing a column for null/missing values. When doing so, you’d want to know when there are at least 10% missing values in the column. Alternatively, it could be the case that even a single missing value is critical to your work. Threshold settings in Pointblank give you the flexibility to devise data-failure signaling to whatever tolerances are important to you.
Let’s start with the basics though. Here’s an example of a simple validation where threshold values are set in the col_vals_not_null()
validation step (this type of validation expects that there are no null/missing values in a particular column):
The code uses thresholds=(1, 0.2)
to set a ‘warning’ threshold of 1
and an ‘error’ threshold of 0.2
(which is 20%) failing test units. You might notice the following in the validation table:
- The
FAIL
column shows that 2 tests units have failed - The
W
column (short for ‘warning’) shows a filled gray circle indicating it’s reached its threshold level - The
E
(‘error’) column shows an open yellow circle indicating it’s below the threshold level
The one final threshold, C
(‘critical’), wasn’t set so appears on the validation table as a dash.
Two Types of Threshold Values: Proportional and Absolute
Threshold values can be specified in two ways:
- proportional: a decimal value like 0.1 is taken to mean 10% of all test units failed
- absolute: a whole number represents a fixed number of test units failed
Threshold values act as cutoffs and are inclusive. So, any value of failing test units greater than or equal to the threshold value will result in exceeding the threshold. So if a threshold is defined with a value of 5
, then 5 failing test units will result in an exceedence.
Using the Validation(thresholds=)
Argument
We can also define thresholds globally. This means that every validation step will re-use the same set of threshold values.
import pointblank as pb
= (
validation_2 =pb.load_dataset(dataset="small_table"), thresholds=(1, 0.1))
pb.Validate(data="a")
.col_vals_not_null(columns="b", value=2)
.col_vals_gt(columns
.interrogate()
)
validation_2
In this, both the col_vals_not_null()
and col_vals_gt()
steps will use the thresholds=
value set in the Validate
call. Now, if you want to override these global threshold values for a given validation step, you can always use the threshold=
argument when calling a validation method (this argument is present within every validation method).
Ways to Define Thresholds
There are more than a few ways to set threshold levels. We provide this flexibility because it’s often useful to have simple shorthand methods for such a common task.
Using a Tuple or a Single Value
The fastest way to define a threshold is to use a tuple with positional entries for the ‘warning’, ‘error’, and ‘critical’ levels.
= (1, 0.1, 0.25) # (warning, error, critical) thresholds_tuple
Note that a shorter tuple is also allowed:
(1, )
: ‘warning’ state at 1 failing test unit(1, 0.1)
: ‘warning’ state at 1 failing test unit,error
state at 10% failing test units
You can even use a scalar value (float between 0
and 1
or an integer). That single value represents the threshold value for the ‘warning’ level:
= 1 thresholds_single
While using a tuple or a scalar can be very succinct, a problem that arises is that the ordering of values always begins at the ‘warning’ level. This means you cannot define a threshold level for just the ‘error’ level, for example. This is fine for many cases, however, there are other ways to express thresholds without these constraints.
Using the Thresholds
Class
Using the Thresholds
class lets you define the threshold levels using the warning=
, error=
, and critical=
arguments. And unlike the method of setting thresholds with a tuple, any of the threshold levels can be unset. Here’s an example where you might want to set the ‘error’ and ‘critical’ levels (leaving the ‘warning’ level unset):
= pb.Thresholds(error=0.3, critical=0.5) thresholds_class
Using a Dictionary to Set Thresholds
A specially-crafted dictionary is acceptable as input to any thresholds=
argument. You need to ensure that the keys are named using either "warning"
, "error"
, or "critical"
. Any combination of keys is fine, but be careful to use only the aforementioned names (otherwise, you’ll receive a ValueError
). Here’s an example that sets the ‘warning’ and ‘critical’ levels:
= {"warning": 1, "critical": 0.1} thresholds_dict