PolarsRows13Columns8 |
||||||||
Across Columns
The previous section demonstrated the use of the many validation methods. In all of the examples where column values were compared against another value, that value was always one that was fixed. These sorts of comparisons (e.g., are values in column x
greater than or equal to zero?) are useful but sometimes you need more.
For a more dynamic type of comparison check, you can also compare a column’s values against those of another column. The name of the comparison column is provided as the value=
parameter (or left=
and right=
as needed in the col_vals_between()
and col_vals_outside()
validation methods).
Let’s see how this type of validation is made possible through a few examples using the small_table
dataset shown below:
Using col()
to Specify a Value Column
We know to use the columns=
to supply the target column. Just to recap, if we wanted to check that every value in column a
is greater than 2
then the validation step is written something like this:
...="a", value=2, ...)
.col_vals_gt(columns ...
What if you have two numerical columns and have good reason to compare values across those columns? This could be a check that expects every value in a
is greater than every value in x
(the comparison column). That would take the form:
...="a", value=pb.col("x"), ...)
.col_vals_gt(columns ...
Using the col()
helper function here in value=
is key. It lets Pointblank know that you’re not using a literal, fixed value for the comparison (you’re specifying a column). So long as you do this, the validation will involve checking that every value a
is greater than every adjacent value in x
. Here’s an example of this:
(=pb.load_dataset("small_table"))
pb.Validate(data="d", value=pb.col("c"))
.col_vals_gt(columns
.interrogate() )
Notice that in the COLUMNS
and VALUES
columns of the validation report, the two column names are shown: d
and c
. A thing that could be surprising is that there are two failing test units, even though values in d
are consistently larger than values in column c
. The reason is that there are missing values (i.e., None
values) in column c
and any missing value in a comparison check will result in a failing test unit.
When doing a comparison against a fixed value, we only had to worry about missing values in the target column. When comparing across columns, there is potential for missing values in both columns and that could result in correspondingly more failing test units. The corrective here is to use na_pass=True
. If you feel missing values (in either column) should be disregarded, this setting is a reasonable choice (and you could always use col_vals_not_null()
to perform missing value checks on these columns anyway).
Let’s take a quick look at the results when na_pass=True
is used:
(=pb.load_dataset("small_table"))
pb.Validate(data="d", value=pb.col("c"), na_pass=True)
.col_vals_gt(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — |
With that change, every test unit in that single step passes validation.
The validation methods that accept a col()
expression in their value=
parameter include:
Using col()
in Range Checks
Two validation methods deal with checking values within and outside a range:
These validation methods both have left=
and right=
arguments. You can use a mix of literal values and col()
-based expressions for these parameters. Here’s an example where we check values in d
to be in the range of lower-bound values in column c
and a fixed upper-bound value of 10000
:
(=pb.load_dataset("small_table"))
pb.Validate(data="d", left=pb.col("c"), right=10_000, na_pass=True)
.col_vals_between(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — |
Observe that the range reported in the VALUES
column is [c, 10000]
. This is our assurance that the left bound is dependent on values in column c
and that the right bound is fixed to a value of 10000
. All test units passed here as we were careful about missing values (using na_pass=True
) as in the previous example.