Comparing Values Across Columns

The previous section demonstrated the use of the many validation methods. In all of the examples where column values were compared against another value, that value was always one that was fixed. These sorts of comparisons (e.g., are values in column x greater than or equal to zero?) are useful but sometimes you need more.

For a more dynamic type of comparison check, you can also compare a column’s values against those of another column. The name of the comparison column is provided as the value= parameter (or left= and right= as needed in the col_vals_between() and col_vals_outside() validation methods).

Let’s see how this type of validation is made possible through a few examples using the small_table dataset shown below:

	date_time Datetime	date Date	a Int64	b String	c Int64	d Float64	e Boolean	f String
PolarsRows13Columns8
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	True	low
3	2016-01-05 13:32:00	2016-01-05	6	8-kdg-938	3	2343.23	True	high
4	2016-01-06 17:23:00	2016-01-06	2	5-jdo-903	None	3892.4	False	mid
5	2016-01-09 12:36:00	2016-01-09	8	3-ldm-038	7	283.94	True	low
6	2016-01-11 06:15:00	2016-01-11	4	2-dhe-923	4	3291.03	True	mid
7	2016-01-15 18:46:00	2016-01-15	7	1-knw-093	3	843.34	True	high
8	2016-01-17 11:27:00	2016-01-17	4	5-boe-639	2	1035.64	False	low
9	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
10	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
11	2016-01-26 20:07:00	2016-01-26	4	2-dmx-010	7	833.98	True	low
12	2016-01-28 02:51:00	2016-01-28	2	7-dmx-010	8	108.34	False	low
13	2016-01-30 11:23:00	2016-01-30	1	3-dka-303	None	2230.09	True	high

Using `col()` to Specify a Value Column

We know to use the columns= to supply the target column. Just to recap, if we wanted to check that every value in column a is greater than 2 then the validation step is written something like this:

...
.col_vals_gt(columns="a", value=2, ...)
...

What if you have two numerical columns and have good reason to compare values across those columns? This could be a check that expects every value in a is greater than every value in x (the comparison column). That would take the form:

...
.col_vals_gt(columns="a", value=pb.col("x"), ...)
...

Using the col() helper function here in value= is key. It lets Pointblank know that you’re not using a literal, fixed value for the comparison (you’re specifying a column). So long as you do this, the validation will involve checking that every value a is greater than every adjacent value in x. Here’s an example of this:

import pointblank as pb

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(columns="d", value=pb.col("c"))
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_gt()	d	c		✓	13	11 0.85	2 0.15	—	—	—

Notice that in the COLUMNS and VALUES columns of the validation report, the two column names are shown: d and c. A thing that could be surprising is that there are two failing test units, even though values in d are consistently larger than values in column c. The reason is that there are missing values (i.e., None values) in column c and any missing value in a comparison check will result in a failing test unit.

When doing a comparison against a fixed value, we only had to worry about missing values in the target column. When comparing across columns, there is potential for missing values in both columns and that could result in correspondingly more failing test units. The corrective here is to use na_pass=True. If you feel missing values (in either column) should be disregarded, this setting is a reasonable choice (and you could always use col_vals_not_null() to perform missing value checks on these columns anyway).

Let’s take a quick look at the results when na_pass=True is used:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(columns="d", value=pb.col("c"), na_pass=True)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_gt()	d	c		✓	13	13 1.00	0 0.00	—	—	—	—

With that change, every test unit in that single step passes validation.

The validation methods that accept a col() expression in their value= parameter include:

Using `col()` in Range Checks

Two validation methods deal with checking values within and outside a range:

These validation methods both have left= and right= arguments. You can use a mix of literal values and col()-based expressions for these parameters. Here’s an example where we check values in d to be in the range of lower-bound values in column c and a fixed upper-bound value of 10000:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_between(columns="d", left=pb.col("c"), right=10_000, na_pass=True)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_between()	d	[c, 10000]		✓	13	13 1.00	0 0.00	—	—	—	—

Observe that the range reported in the VALUES column is [c, 10000]. This is our assurance that the left bound is dependent on values in column c and that the right bound is fixed to a value of 10000. All test units passed here as we were careful about missing values (using na_pass=True) as in the previous example.

Using col() to Specify a Value Column

Using col() in Range Checks

Using `col()` to Specify a Value Column

Using `col()` in Range Checks