Across Columns

The previous section demonstrated the use of the many validation methods. In all of the examples where column values were compared against another value, that value was always one that was fixed. These sorts of comparisons (e.g., are values in column x greater than or equal to zero?) are useful but sometimes you need more.

For a more dynamic type of comparison check, you can also compare a column’s values against those of another column. The name of the comparison column is provided as the value= parameter (or left= and right= as needed in the col_vals_between() and col_vals_outside() validation methods).

Let’s see how this type of validation is made possible through a few examples using the small_table dataset shown below:

PolarsRows13Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291.03 True mid
7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843.34 True high
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1035.64 False low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

Using col() to Specify a Value Column

We know to use the columns= to supply the target column. Just to recap, if we wanted to check that every value in column a is greater than 2 then the validation step is written something like this:

...
.col_vals_gt(columns="a", value=2, ...)
...

What if you have two numerical columns and have good reason to compare values across those columns? This could be a check that expects every value in a is greater than every value in x (the comparison column). That would take the form:

...
.col_vals_gt(columns="a", value=pb.col("x"), ...)
...

Using the col() helper function here in value= is key. It lets Pointblank know that you’re not using a literal, fixed value for the comparison (you’re specifying a column). So long as you do this, the validation will involve checking that every value a is greater than every adjacent value in x. Here’s an example of this:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(columns="d", value=pb.col("c"))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_gt
col_vals_gt()
d c 13 11
0.85
2
0.15

Notice that in the COLUMNS and VALUES columns of the validation report, the two column names are shown: d and c. A thing that could be surprising is that there are two failing test units, even though values in d are consistently larger than values in column c. The reason is that there are missing values (i.e., None values) in column c and any missing value in a comparison check will result in a failing test unit.

When doing a comparison against a fixed value, we only had to worry about missing values in the target column. When comparing across columns, there is potential for missing values in both columns and that could result in correspondingly more failing test units. The corrective here is to use na_pass=True. If you feel missing values (in either column) should be disregarded, this setting is a reasonable choice (and you could always use col_vals_not_null() to perform missing value checks on these columns anyway).

Let’s take a quick look at the results when na_pass=True is used:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(columns="d", value=pb.col("c"), na_pass=True)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
d c 13 13
1.00
0
0.00

With that change, every test unit in that single step passes validation.

The validation methods that accept a col() expression in their value= parameter include:

Using col() in Range Checks

Two validation methods deal with checking values within and outside a range:

These validation methods both have left= and right= arguments. You can use a mix of literal values and col()-based expressions for these parameters. Here’s an example where we check values in d to be in the range of lower-bound values in column c and a fixed upper-bound value of 10000:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_between(columns="d", left=pb.col("c"), right=10_000, na_pass=True)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_between
col_vals_between()
d [c, 10000] 13 13
1.00
0
0.00

Observe that the range reported in the VALUES column is [c, 10000]. This is our assurance that the left bound is dependent on values in column c and that the right bound is fixed to a value of 10000. All test units passed here as we were careful about missing values (using na_pass=True) as in the previous example.