Working with Columns

Data validation often requires working with columns in flexible ways. Pointblank offers two powerful approaches:

  1. Applying validation rules across multiple columns: validate many columns with a single rule
  2. Comparing values between columns: create validations that compare values across different columns

This guide covers both approaches in detail with practical examples.

Part 1: Applying Rules Across Multiple Columns

Many of Pointblank’s validation methods perform column-level checks. These methods provide the columns= parameter, which accepts not just a single column name but multiple columns through various selection methods.

Why is this useful? Often you’ll want to perform the same validation check (e.g., checking that numerical values are all positive) across multiple columns. Rather than defining the same rules multiple times, you can map the validation across those columns in a single step.

Let’s explore this using the game_revenue dataset:

PolarsRows2,000Columns11
player_id
String
session_id
String
session_start
Datetime
time
Datetime
item_type
String
item_name
String
item_revenue
Float64
session_duration
Float64
start_day
Date
acquisition
String
country
String
1 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:31:27+00:00 iap offer2 8.99 16.3 2015-01-01 google Germany
2 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:36:57+00:00 iap gems3 22.49 16.3 2015-01-01 google Germany
3 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:37:45+00:00 iap gold7 107.99 16.3 2015-01-01 google Germany
4 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:42:33+00:00 ad ad_20sec 0.76 16.3 2015-01-01 google Germany
5 ECPANOIXLZHF896 ECPANOIXLZHF896-hdu9jkls 2015-01-01 11:50:02+00:00 2015-01-01 11:55:20+00:00 ad ad_5sec 0.03 35.2 2015-01-01 google Germany
1996 NAOJRDMCSEBI281 NAOJRDMCSEBI281-j2vs9ilp 2015-01-21 01:57:50+00:00 2015-01-21 02:02:50+00:00 ad ad_survey 1.332 25.8 2015-01-11 organic Norway
1997 NAOJRDMCSEBI281 NAOJRDMCSEBI281-j2vs9ilp 2015-01-21 01:57:50+00:00 2015-01-21 02:22:14+00:00 ad ad_survey 1.35 25.8 2015-01-11 organic Norway
1998 RMOSWHJGELCI675 RMOSWHJGELCI675-vbhcsmtr 2015-01-21 02:39:48+00:00 2015-01-21 02:40:00+00:00 ad ad_5sec 0.03 8.4 2015-01-10 other_campaign France
1999 RMOSWHJGELCI675 RMOSWHJGELCI675-vbhcsmtr 2015-01-21 02:39:48+00:00 2015-01-21 02:47:12+00:00 iap offer5 26.09 8.4 2015-01-10 other_campaign France
2000 GJCXNTWEBIPQ369 GJCXNTWEBIPQ369-9elq67md 2015-01-21 03:59:23+00:00 2015-01-21 04:06:29+00:00 ad ad_5sec 0.12 18.5 2015-01-14 organic United States

Using a List of Column Names

The simplest way to validate multiple columns is to provide a list to the columns= parameter. In the game_revenue dataset, we have two columns with numerical data: item_revenue and session_duration. If we expect all values in both columns to be greater than 0, we can write:

import pointblank as pb

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_gt(
        columns=["item_revenue", "session_duration"],
        value=0
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_gt
col_vals_gt()
session_duration 0 2000 2000
1.00
0
0.00

The validation report shows two validation steps were created from a single method call! All validation parameters are shared across all generated steps, including thresholds and briefs:

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_gt(
        columns=["item_revenue", "session_duration"],
        value=0,
        thresholds=(0.1, 0.2, 0.3),
        brief="`{col}` must be greater than zero."
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()

item_revenue must be greater than zero.

item_revenue 0 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_gt
col_vals_gt()

session_duration must be greater than zero.

session_duration 0 2000 2000
1.00
0
0.00

In this example, you can see that the validation report displays customized briefs for each column (“item_revenue must be greater than zero.” and “session_duration must be greater than zero.”), automatically substituting the column name using the {col} placeholder in the brief template. This feature is particularly helpful when reviewing reports, as it provides clear, human-readable descriptions of what each validation step is checking. When working with multiple columns through a single validation call, these dynamically generated briefs make your validation reports more understandable for both technical and non-technical stakeholders.

Using Pointblank’s Column Selectors

For more advanced column selection, Pointblank provides selector functions that resolve columns based on:

  • text patterns in column names
  • column position
  • column data type

Two common selectors, starts_with() and ends_with(), resolve columns based on text patterns in column names.

The game_revenue dataset has three columns starting with “item”: item_type, item_name, and item_revenue. Let’s check that these columns contain no missing values:

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_not_null(columns=pb.starts_with("item"))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
item_type 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
item_name 2000 2000
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
item_revenue 2000 2000
1.00
0
0.00

Three validation steps were automatically created because three columns matched the pattern.

The complete list of column selectors includes:

Combining Column Selectors

Column selectors can be combined for more powerful selection. To do this, use the col() helper function with logical operators:

  • & (and)
  • | (or)
  • - (difference)
  • ~ (not)

For example, to select all columns except the first four:

col_selection = pb.col(pb.everything() - pb.first_n(4))

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_not_null(
        columns=col_selection,
        thresholds=(1, 0.05, 0.1)
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
item_type 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
item_name 2000 2000
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
item_revenue 2000 2000
1.00
0
0.00
#4CA64C 4
col_vals_not_null
col_vals_not_null()
session_duration 2000 2000
1.00
0
0.00
#4CA64C 5
col_vals_not_null
col_vals_not_null()
start_day 2000 2000
1.00
0
0.00
#4CA64C 6
col_vals_not_null
col_vals_not_null()
acquisition 2000 2000
1.00
0
0.00
#4CA64C 7
col_vals_not_null
col_vals_not_null()
country 2000 2000
1.00
0
0.00

This selects every column except the first four, resulting in seven validation steps.

Narwhals Selectors

Pointblank also supports column selectors from the Narwhals library, which include:

  • matches()
  • by_dtype()
  • boolean()
  • categorical()
  • datetime()
  • numeric()
  • string()

Here’s an example selecting all numeric columns:

import narwhals.selectors as ncs

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_gt(
        columns=ncs.numeric(),
        value=0
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_gt
col_vals_gt()
session_duration 0 2000 2000
1.00
0
0.00

And selecting all string columns matching “item_”:

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_not_null(columns=pb.col(ncs.string() & ncs.matches("item_")))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
item_type 2000 2000
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
item_name 2000 2000
1.00
0
0.00

This example demonstrates the power of combining Narwhals selectors with logical operators. By using ncs.string() to select string columns and then filtering with ncs.matches("item_"), we can precisely target text columns with specific naming patterns. This type of targeted selection is particularly valuable when working with wide datasets that have consistent column naming conventions, allowing you to apply appropriate validation rules to logically grouped columns without explicitly listing each one.

Caveats for Using Column Selectors

While column selectors are powerful, there are some caveats. If a selector doesn’t match any columns, the validation won’t fail but will show an ‘explosion’ in the report:

(
    pb.Validate(data=pb.load_dataset("game_revenue"))
    .col_vals_not_null(columns=pb.starts_with("items"))
    .col_vals_gt(
        columns="item_revenue",
        value=0
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_not_null
col_vals_not_null()
StartsWith(text='items', case_sensitive=False) 💥
#4CA64C 2
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00

Notice that although there was a problem with Step 1 (that should be addressed), the interrogation did move on to Step 2 without complication.

To mitigate uncertainty, include validation steps that check for the existence of key columns with col_exists() or verify the schema with col_schema_match().

Part 2: Comparing Values Between Columns

Sometimes you need to compare values across different columns rather than against fixed values. Pointblank enables this through the col() helper function.

Let’s look at examples using the small_table dataset:

PolarsRows13Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291.03 True mid
7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843.34 True high
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1035.64 False low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

Using col() to Specify a Comparison Column

While we typically use validation methods to compare column values against fixed values:

...
.col_vals_gt(columns="a", value=2, ...)
...

We can also compare values between columns by using col() in the value= parameter:

...
.col_vals_gt(columns="a", value=pb.col("x"), ...)
...

This checks that each value in column a is greater than the corresponding value in column x. Here’s a concrete example:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(
        columns="d",
        value=pb.col("c")
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_gt
col_vals_gt()
d c 13 11
0.85
2
0.15

Notice that the validation report shows both column names (d and c). There are two failing test units because of missing values in column c. When comparing across columns, missing values in either column can cause failures.

To handle missing values, use na_pass=True:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(
        columns="d",
        value=pb.col("c"),
        na_pass=True
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
d c 13 13
1.00
0
0.00

Now all tests pass.

The following validation methods accept a col() expression in their value= parameter:

Using col() in Range Checks

For range validations via col_vals_between() and col_vals_outside() you can use a mix of column references and fixed values:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_between(
        columns="d",
        left=pb.col("c"),
        right=10_000,
        na_pass=True
    )
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_between
col_vals_between()
d [c, 10000] 13 13
1.00
0
0.00

The validation report shows the range as [c, 10000], indicating that the lower bound comes from column c while the upper bound is fixed at 10000.

Advanced Examples: Combining Both Approaches

The true power comes from combining both approaches: validating multiple columns and using cross-column comparisons:

validation = (
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(
        columns=["c", "d"],
        value=pb.col("a"),
        na_pass=True
    )
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_gt
col_vals_gt()
c a 13 8
0.62
5
0.38
#4CA64C 2
col_vals_gt
col_vals_gt()
d a 13 13
1.00
0
0.00

This creates validation steps checking that values in both columns d and e are greater than their corresponding values in column a.

Conclusion

Pointblank provides flexible approaches to working with columns:

  1. Column selection: validate multiple columns with a single validation rule
  2. Cross-column comparison: compare values between columns

These capabilities allow you to:

  • write more concise validation code
  • apply consistent validation rules across similar columns
  • create dynamic validations that check relationships between columns
  • build comprehensive data quality checks with minimal code

By mastering these techniques, you can create more elegant and powerful validation plans while reducing repetition in your code.