PolarsRows2,000Columns11 |
|||||||||||
Working with Columns
Data validation often requires working with columns in flexible ways. Pointblank offers two powerful approaches:
- Applying validation rules across multiple columns: validate many columns with a single rule
- Comparing values between columns: create validations that compare values across different columns
This guide covers both approaches in detail with practical examples.
Part 1: Applying Rules Across Multiple Columns
Many of Pointblank’s validation methods perform column-level checks. These methods provide the columns=
parameter, which accepts not just a single column name but multiple columns through various selection methods.
Why is this useful? Often you’ll want to perform the same validation check (e.g., checking that numerical values are all positive) across multiple columns. Rather than defining the same rules multiple times, you can map the validation across those columns in a single step.
Let’s explore this using the game_revenue
dataset:
Using a List of Column Names
The simplest way to validate multiple columns is to provide a list to the columns=
parameter. In the game_revenue
dataset, we have two columns with numerical data: item_revenue
and session_duration
. If we expect all values in both columns to be greater than 0
, we can write:
import pointblank as pb
(=pb.load_dataset("game_revenue"))
pb.Validate(data
.col_vals_gt(=["item_revenue", "session_duration"],
columns=0
value
)
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_gt()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
col_vals_gt()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
The validation report shows two validation steps were created from a single method call! All validation parameters are shared across all generated steps, including thresholds and briefs:
(=pb.load_dataset("game_revenue"))
pb.Validate(data
.col_vals_gt(=["item_revenue", "session_duration"],
columns=0,
value=(0.1, 0.2, 0.3),
thresholds="`{col}` must be greater than zero."
brief
)
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_gt()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 2 |
col_vals_gt()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — |
In this example, you can see that the validation report displays customized briefs for each column (“item_revenue
must be greater than zero.” and “session_duration
must be greater than zero.”), automatically substituting the column name using the {col}
placeholder in the brief template. This feature is particularly helpful when reviewing reports, as it provides clear, human-readable descriptions of what each validation step is checking. When working with multiple columns through a single validation call, these dynamically generated briefs make your validation reports more understandable for both technical and non-technical stakeholders.
Using Pointblank’s Column Selectors
For more advanced column selection, Pointblank provides selector functions that resolve columns based on:
- text patterns in column names
- column position
- column data type
Two common selectors, starts_with()
and ends_with()
, resolve columns based on text patterns in column names.
The game_revenue
dataset has three columns starting with “item”: item_type
, item_name
, and item_revenue
. Let’s check that these columns contain no missing values:
(=pb.load_dataset("game_revenue"))
pb.Validate(data=pb.starts_with("item"))
.col_vals_not_null(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 3 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
Three validation steps were automatically created because three columns matched the pattern.
The complete list of column selectors includes:
Combining Column Selectors
Column selectors can be combined for more powerful selection. To do this, use the col()
helper function with logical operators:
&
(and)|
(or)-
(difference)~
(not)
For example, to select all columns except the first four:
= pb.col(pb.everything() - pb.first_n(4))
col_selection
(=pb.load_dataset("game_revenue"))
pb.Validate(data
.col_vals_not_null(=col_selection,
columns=(1, 0.05, 0.1)
thresholds
)
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 2 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 3 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 4 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 5 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 6 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — | |||
#4CA64C | 7 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
○ | ○ | ○ | — |
This selects every column except the first four, resulting in seven validation steps.
Narwhals Selectors
Pointblank also supports column selectors from the Narwhals library, which include:
matches()
by_dtype()
boolean()
categorical()
datetime()
numeric()
string()
Here’s an example selecting all numeric columns:
import narwhals.selectors as ncs
(=pb.load_dataset("game_revenue"))
pb.Validate(data
.col_vals_gt(=ncs.numeric(),
columns=0
value
)
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_gt()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
col_vals_gt()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
And selecting all string columns matching “item_”:
(=pb.load_dataset("game_revenue"))
pb.Validate(data=pb.col(ncs.string() & ncs.matches("item_")))
.col_vals_not_null(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
col_vals_not_null()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
This example demonstrates the power of combining Narwhals selectors with logical operators. By using ncs.string()
to select string columns and then filtering with ncs.matches("item_")
, we can precisely target text columns with specific naming patterns. This type of targeted selection is particularly valuable when working with wide datasets that have consistent column naming conventions, allowing you to apply appropriate validation rules to logically grouped columns without explicitly listing each one.
Caveats for Using Column Selectors
While column selectors are powerful, there are some caveats. If a selector doesn’t match any columns, the validation won’t fail but will show an ‘explosion’ in the report:
(=pb.load_dataset("game_revenue"))
pb.Validate(data=pb.starts_with("items"))
.col_vals_not_null(columns
.col_vals_gt(="item_revenue",
columns=0
value
)
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C66 | 1 |
col_vals_not_null()
|
💥 | — | — | — | — | — | — | — | |||
#4CA64C | 2 |
col_vals_gt()
|
✓ | 2000 | 2000 1.00 |
0 0.00 |
— | — | — | — |
Notice that although there was a problem with Step 1 (that should be addressed), the interrogation did move on to Step 2 without complication.
To mitigate uncertainty, include validation steps that check for the existence of key columns with col_exists()
or verify the schema with col_schema_match()
.
Part 2: Comparing Values Between Columns
Sometimes you need to compare values across different columns rather than against fixed values. Pointblank enables this through the col()
helper function.
Let’s look at examples using the small_table
dataset:
PolarsRows13Columns8 |
||||||||
Using col()
to Specify a Comparison Column
While we typically use validation methods to compare column values against fixed values:
...="a", value=2, ...)
.col_vals_gt(columns ...
We can also compare values between columns by using col()
in the value=
parameter:
...="a", value=pb.col("x"), ...)
.col_vals_gt(columns ...
This checks that each value in column a
is greater than the corresponding value in column x
. Here’s a concrete example:
(=pb.load_dataset("small_table"))
pb.Validate(data
.col_vals_gt(="d",
columns=pb.col("c")
value
)
.interrogate() )
Notice that the validation report shows both column names (d
and c
). There are two failing test units because of missing values in column c
. When comparing across columns, missing values in either column can cause failures.
To handle missing values, use na_pass=True
:
(=pb.load_dataset("small_table"))
pb.Validate(data
.col_vals_gt(="d",
columns=pb.col("c"),
value=True
na_pass
)
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_gt()
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — |
Now all tests pass.
The following validation methods accept a col()
expression in their value=
parameter:
Using col()
in Range Checks
For range validations via col_vals_between()
and col_vals_outside()
you can use a mix of column references and fixed values:
(=pb.load_dataset("small_table"))
pb.Validate(data
.col_vals_between(="d",
columns=pb.col("c"),
left=10_000,
right=True
na_pass
)
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_vals_between()
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — |
The validation report shows the range as [c, 10000]
, indicating that the lower bound comes from column c
while the upper bound is fixed at 10000
.
Advanced Examples: Combining Both Approaches
The true power comes from combining both approaches: validating multiple columns and using cross-column comparisons:
= (
validation =pb.load_dataset("small_table"))
pb.Validate(data
.col_vals_gt(=["c", "d"],
columns=pb.col("a"),
value=True
na_pass
)
.interrogate()
)
validation
This creates validation steps checking that values in both columns d
and e
are greater than their corresponding values in column a
.
Conclusion
Pointblank provides flexible approaches to working with columns:
- Column selection: validate multiple columns with a single validation rule
- Cross-column comparison: compare values between columns
These capabilities allow you to:
- write more concise validation code
- apply consistent validation rules across similar columns
- create dynamic validations that check relationships between columns
- build comprehensive data quality checks with minimal code
By mastering these techniques, you can create more elegant and powerful validation plans while reducing repetition in your code.