Defining Columns

Most of the validation methods included in Pointblank perform column-level checks. As such, they provide the common argument columns=. The pluralization in the name indicates that multiple columns can be provided. And it goes further than that, as column selectors can be used in columns= to resolve columns.

Why do this? It can often be the case that you’d want to perform a validation check of a certain common type (e.g., checking that numerical values are all positive) across a number of columns. Rather than define the same rules across multiple invocations of the same validation method (one for each column), we can simply map the validation rules across those columns.

We’ll run through several examples below, and all of them will use the game_revenue dataset. Here’s a preview of it:

PolarsRows2,000Columns11
player_id
String
session_id
String
session_start
Datetime
time
Datetime
item_type
String
item_name
String
item_revenue
Float64
session_duration
Float64
start_day
Date
acquisition
String
country
String
1 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:31:27+00:00 iap offer2 8.99 16.3 2015-01-01 google Germany
2 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:36:57+00:00 iap gems3 22.49 16.3 2015-01-01 google Germany
3 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:37:45+00:00 iap gold7 107.99 16.3 2015-01-01 google Germany
4 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:42:33+00:00 ad ad_20sec 0.76 16.3 2015-01-01 google Germany
5 ECPANOIXLZHF896 ECPANOIXLZHF896-hdu9jkls 2015-01-01 11:50:02+00:00 2015-01-01 11:55:20+00:00 ad ad_5sec 0.03 35.2 2015-01-01 google Germany
1996 NAOJRDMCSEBI281 NAOJRDMCSEBI281-j2vs9ilp 2015-01-21 01:57:50+00:00 2015-01-21 02:02:50+00:00 ad ad_survey 1.332 25.8 2015-01-11 organic Norway
1997 NAOJRDMCSEBI281 NAOJRDMCSEBI281-j2vs9ilp 2015-01-21 01:57:50+00:00 2015-01-21 02:22:14+00:00 ad ad_survey 1.35 25.8 2015-01-11 organic Norway
1998 RMOSWHJGELCI675 RMOSWHJGELCI675-vbhcsmtr 2015-01-21 02:39:48+00:00 2015-01-21 02:40:00+00:00 ad ad_5sec 0.03 8.4 2015-01-10 other_campaign France
1999 RMOSWHJGELCI675 RMOSWHJGELCI675-vbhcsmtr 2015-01-21 02:39:48+00:00 2015-01-21 02:47:12+00:00 iap offer5 26.09 8.4 2015-01-10 other_campaign France
2000 GJCXNTWEBIPQ369 GJCXNTWEBIPQ369-9elq67md 2015-01-21 03:59:23+00:00 2015-01-21 04:06:29+00:00 ad ad_5sec 0.12 18.5 2015-01-14 organic United States

Let’s start with the simpler case of providing a list of columns to columns= before getting into the resolving columns with column selector functions.

Using a List of Column Names

The columns= parameter (in every validation method that has that argument) can accept a list of column names. In the game_revenue dataset, there are two columns that contain numerical data: item_revenue and session_duration. Suppose we expect that data in both columns should be greater than 0, providing a list of those two columns will create two validation steps from a single invocation of col_vals_gt():

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_gt(columns=["item_revenue", "session_duration"], value=0)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_gt
col_vals_gt()
session_duration 0 2000 2000
1.00
0
0.00

The validation report table indeed shows that two validation steps were created! Further to this, the interrogation results show that all values in the two columns are greater than 0.

It’s important to note that all validation parameters are shared across all generated steps. So if thresholds= were to be set, those threshold values would be cloned and used in each and every step generated (one step per column provided):

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_gt(
        columns=["item_revenue", "session_duration"],
        value=0,
        thresholds=(0.1, 0.2, 0.3)
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_gt
col_vals_gt()
session_duration 0 2000 2000
1.00
0
0.00

This is all to say that if you wanted the same validation rules but different threshold settings, you would have to define two individual validation steps with only the thresholds= values differing.

Using Pointblank’s Column Selectors

Pointblank includes a few column selector functions for use in columns=. If you’re new to selectors what they do is resolve column names typically based on:

  • text patterns
  • column position
  • column type

Two common ones, starts_with() and ends_with(), resolve columns based on starting and ending text in column names.

The game_revenue dataset has three columns starting with the text ‘item’: item_type, item_name, and item_revenue. Let’s look at an example where we can succinctly express a validation plan checking that these columns contain no missing values:

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_not_null(columns=pb.starts_with("item"))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
item_type 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
item_name 2000 2000
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
item_revenue 2000 2000
1.00
0
0.00

As can be seen, three validation steps were created from the use of columns=pb.starts_with("item") because those three columns were found in the table.

The complete list of column selectors includes:

and in the next section, we’ll learn how they could actually be combined in interesting ways for more advanced column selection tasks.

Combining Column Selectors

Column selector functions can be combined for extra column selection power. We need to use two things to make this possible: (1) the col() helper function, and (2) one or more logical operators for composing column selectors. Those operators are:

  • & (and)
  • | (or)
  • - (difference)
  • ~ (not)

Taken together, if you wanted to select all columns except the first four, you would use this col() expression:

col_selection = pb.col(pb.everything() - pb.first_n(4))

The logic here is that everything() selects every column and the - pb.first_n(4) part then removes the first four columns from that selection.

Let’s try using the col_selection object in a validation of the game_revenue table, checking multiple columns for missing values:

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_not_null(columns=col_selection, thresholds=(1, 0.05, 0.1))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
item_type 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
item_name 2000 2000
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
item_revenue 2000 2000
1.00
0
0.00
#4CA64C 4
col_vals_not_null
col_vals_not_null()
session_duration 2000 2000
1.00
0
0.00
#4CA64C 5
col_vals_not_null
col_vals_not_null()
start_day 2000 2000
1.00
0
0.00
#4CA64C 6
col_vals_not_null
col_vals_not_null()
acquisition 2000 2000
1.00
0
0.00
#4CA64C 7
col_vals_not_null
col_vals_not_null()
country 2000 2000
1.00
0
0.00

This column selection worked to select every column except for the first four, resulting in seven separate validation steps.

Even More Selectors: Narwhals Selectors

While Pointblank offers a good selection of selectors, we built in support for the column selectors available in the Narwhals library. If you don’t know Narwhals yet, it’s a lightweight compatibility layer between dataframe libraries. The library is so good that we use it under the hood in Pointblank so that you can easily validate Pandas, Polars, PyArrow, cuDF, and Modin tables. The narwhals.selectors module contains the following column selector functions:

  • matches()
  • by_dtype()
  • boolean()
  • categorical()
  • datetime()
  • numeric()
  • string()

We support the use of these selectors for any input DataFrame that is supported by Narwhals (i.e., is in the aforementioned list). Here’s an example using the numeric() selector to select all numeric columns in game_revenue:

import narwhals.selectors as ncs

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_gt(columns=ncs.numeric(), value=0)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_gt
col_vals_gt()
session_duration 0 2000 2000
1.00
0
0.00

Narwhals selectors can also use operators to combine multiple selectors. Here is an example of using the string() and matches() selectors together to select all string columns matching the text ‘item_’:

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_not_null(columns=pb.col(ncs.string() & ncs.matches("item_")))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
item_type 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
item_name 2000 2000
1.00
0
0.00

The combination of Narwhals column selectors as used in the above serves to constrain the set of columns resolved. And the same set of logical operators (&, |, -, ~) can be used to compose Narwhals selectors within a col() expression.

Caveats for Using Column Selectors

Provided there is systematic column naming already in place, using column selectors like starts_with() can be very convenient. This is especially true as column counts become larger.

A slight disadvantage to this approach is some uncertainty on whether those columns being checked actually exist. You might resolve fewer columns than anticipated or no columns at all due to errors in using the column selectors or through misunderstanding in the columns’ naming conventions.

Should the use of a column selector yield no columns the interrogation process itself won’t fail, however, the validation report table will strongly signal that there was an evaluation issue:

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_not_null(columns=pb.starts_with("items"))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_not_null
col_vals_not_null()
StartsWith(text='items', case_sensitive=False) 💥

Aside from the validation step being tinted in red, the EVAL column will display an explosion (and there won’t be any results). In practice, you would either correct the string supplied to starts_with() or take a different approach.

Given the slight bit of uncertainty you get when using column selectors (rather than the explicit use of column names), it’s good to also include validation steps that check for the existence of key column names with col_exists() (and checking the table schema itself with col_schema_match() is also worthwhile here).