PolarsRows2,000Columns11 |
|||||||||||
Defining Columns
Most of the validation methods included in Pointblank perform column-level checks. As such, they provide the common argument columns=
. The pluralization in the name indicates that multiple columns can be provided. And it goes further than that, as column selectors can be used in columns=
to resolve columns.
Why do this? It can often be the case that you’d want to perform a validation check of a certain common type (e.g., checking that numerical values are all positive) across a number of columns. Rather than define the same rules across multiple invocations of the same validation method (one for each column), we can simply map the validation rules across those columns.
We’ll run through several examples below, and all of them will use the game_revenue
dataset. Here’s a preview of it:
Let’s start with the simpler case of providing a list of columns to columns=
before getting into the resolving columns with column selector functions.
Using a List of Column Names
The columns=
parameter (in every validation method that has that argument) can accept a list of column names. In the game_revenue
dataset, there are two columns that contain numerical data: item_revenue
and session_duration
. Suppose we expect that data in both columns should be greater than 0
, providing a list of those two columns will create two validation steps from a single invocation of col_vals_gt()
:
(=pb.load_dataset("game_revenue"))
pb.Validate(data=["item_revenue", "session_duration"], value=0)
.col_vals_gt(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
The validation report table indeed shows that two validation steps were created! Further to this, the interrogation results show that all values in the two columns are greater than 0
.
It’s important to note that all validation parameters are shared across all generated steps. So if thresholds=
were to be set, those threshold values would be cloned and used in each and every step generated (one step per column provided):
(=pb.load_dataset("game_revenue"))
pb.Validate(data
.col_vals_gt(=["item_revenue", "session_duration"],
columns=0,
value=(0.1, 0.2, 0.3)
thresholds
)
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 2 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — |
This is all to say that if you wanted the same validation rules but different threshold settings, you would have to define two individual validation steps with only the thresholds=
values differing.
Using Pointblank’s Column Selectors
Pointblank includes a few column selector functions for use in columns=
. If you’re new to selectors what they do is resolve column names typically based on:
- text patterns
- column position
- column type
Two common ones, starts_with()
and ends_with()
, resolve columns based on starting and ending text in column names.
The game_revenue
dataset has three columns starting with the text ‘item’: item_type
, item_name
, and item_revenue
. Let’s look at an example where we can succinctly express a validation plan checking that these columns contain no missing values:
(=pb.load_dataset("game_revenue"))
pb.Validate(data=pb.starts_with("item"))
.col_vals_not_null(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 3 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
As can be seen, three validation steps were created from the use of columns=pb.starts_with("item")
because those three columns were found in the table.
The complete list of column selectors includes:
and in the next section, we’ll learn how they could actually be combined in interesting ways for more advanced column selection tasks.
Combining Column Selectors
Column selector functions can be combined for extra column selection power. We need to use two things to make this possible: (1) the col()
helper function, and (2) one or more logical operators for composing column selectors. Those operators are:
&
(and)|
(or)-
(difference)~
(not)
Taken together, if you wanted to select all columns except the first four, you would use this col()
expression:
= pb.col(pb.everything() - pb.first_n(4)) col_selection
The logic here is that everything()
selects every column and the - pb.first_n(4)
part then removes the first four columns from that selection.
Let’s try using the col_selection
object in a validation of the game_revenue
table, checking multiple columns for missing values:
(=pb.load_dataset("game_revenue"))
pb.Validate(data=col_selection, thresholds=(1, 0.05, 0.1))
.col_vals_not_null(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 2 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 3 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 4 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 5 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 6 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 7 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — |
This column selection worked to select every column except for the first four, resulting in seven separate validation steps.
Even More Selectors: Narwhals Selectors
While Pointblank offers a good selection of selectors, we built in support for the column selectors available in the Narwhals library. If you don’t know Narwhals yet, it’s a lightweight compatibility layer between dataframe libraries. The library is so good that we use it under the hood in Pointblank so that you can easily validate Pandas, Polars, PyArrow, cuDF, and Modin tables. The narwhals.selectors
module contains the following column selector functions:
matches()
by_dtype()
boolean()
categorical()
datetime()
numeric()
string()
We support the use of these selectors for any input DataFrame that is supported by Narwhals (i.e., is in the aforementioned list). Here’s an example using the numeric()
selector to select all numeric columns in game_revenue
:
import narwhals.selectors as ncs
(=pb.load_dataset("game_revenue"))
pb.Validate(data=ncs.numeric(), value=0)
.col_vals_gt(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
Narwhals selectors can also use operators to combine multiple selectors. Here is an example of using the string()
and matches()
selectors together to select all string columns matching the text ‘item_’:
(=pb.load_dataset("game_revenue"))
pb.Validate(data=pb.col(ncs.string() & ncs.matches("item_")))
.col_vals_not_null(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
The combination of Narwhals column selectors as used in the above serves to constrain the set of columns resolved. And the same set of logical operators (&
, |
, -
, ~
) can be used to compose Narwhals selectors within a col()
expression.
Caveats for Using Column Selectors
Provided there is systematic column naming already in place, using column selectors like starts_with()
can be very convenient. This is especially true as column counts become larger.
A slight disadvantage to this approach is some uncertainty on whether those columns being checked actually exist. You might resolve fewer columns than anticipated or no columns at all due to errors in using the column selectors or through misunderstanding in the columns’ naming conventions.
Should the use of a column selector yield no columns the interrogation process itself won’t fail, however, the validation report table will strongly signal that there was an evaluation issue:
(=pb.load_dataset("game_revenue"))
pb.Validate(data=pb.starts_with("items"))
.col_vals_not_null(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C66 | 1 |
|
💥 | — | — | — | — | — | — | — |
Aside from the validation step being tinted in red, the EVAL
column will display an explosion (and there won’t be any results). In practice, you would either correct the string supplied to starts_with()
or take a different approach.
Given the slight bit of uncertainty you get when using column selectors (rather than the explicit use of column names), it’s good to also include validation steps that check for the existence of key column names with col_exists()
(and checking the table schema itself with col_schema_match()
is also worthwhile here).