Expression-Based Validation

While Pointblank offers many specialized validation functions for common data quality checks, sometimes you need more flexibility for complex validation requirements. This is where expression-based validation with col_vals_expr() comes in.

The col_vals_expr() method allows you to:

Now let’s explore how to use these capabilities through a collection of examples!

Basic Usage

At its core, col_vals_expr() validates whether an expression evaluates to True for each row in your data. Here’s a simple example:

import pointblank as pb
import polars as pl

# Load small_table dataset as a Polars DataFrame
small_table_pl = pb.load_dataset(dataset="small_table", tbl_type="polars")

(
    pb.Validate(data=small_table_pl)
    .col_vals_expr(

        # Use Polars expression syntax ---
        expr=pl.col("d") > pl.col("a") * 50,
        brief="Column `d` should be at least 50 times larger than `a`."
    )
    .interrogate()
)
Pointblank Validation
2025-05-19|17:22:27
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_expr
col_vals_expr()

Column d should be at least 50 times larger than a.

COLUMN EXPR 13 12
0.92
1
0.08

In this example, we’re validating that for each row, the value in column d is at least 50 times larger than the value in column a.

Notes on Expression Syntax

The expression syntax depends on your table type:

  • Polars: uses Polars expression syntax with pl.col("column_name")
  • Pandas: uses standard Python/NumPy syntax

The expression should:

  • evaluate to a boolean result for each row
  • reference columns using the appropriate syntax for your table type
  • use standard operators (+, -, *, /, >, <, ==, etc.)
  • not include assignments

Complex Expressions

The real power of col_vals_expr() comes with complex expressions that would be difficult to represent using the standard validation functions:

# Load game_revenue dataset as a Polars DataFrame
game_revenue_pl = pb.load_dataset(dataset="game_revenue", tbl_type="polars")

(
    pb.Validate(data=game_revenue_pl)
    .col_vals_expr(

        # Use Polars expression syntax ---
        expr=(pl.col("session_duration") > 20) | (pl.col("item_revenue") > 10),
        brief="Sessions should be either long (>20 min) or high-value (>$10)."
    )
    .interrogate()
)
Pointblank Validation
2025-05-19|17:22:27
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_expr
col_vals_expr()

Sessions should be either long (>20 min) or high-value (>$10).

COLUMN EXPR 2000 1518
0.76
482
0.24

This validates that either the session duration is longer than 20 minutes OR the item revenue is greater than $10.

Example: Multiple Conditions

You can create sophisticated validations with multiple conditions:

# Create a simple Polars DataFrame
employee_df = pl.DataFrame({
    "age": [25, 30, 15, 40, 35],
    "income": [50000, 75000, 0, 100000, 60000],
    "years_experience": [3, 8, 0, 15, 7]
})

(
    pb.Validate(data=employee_df, tbl_name="employee_data")
    .col_vals_expr(
        # Complex condition with multiple comparisons
        expr=(
            (pl.col("age") >= 18) &
            (pl.col("income") / (pl.col("years_experience") + 1) <= 25000)
        ),
        brief="Adults should have reasonable income-to-experience ratios."
    )
    .interrogate()
)
Pointblank Validation
2025-05-19|17:22:27
Polarsemployee_data
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_expr
col_vals_expr()

Adults should have reasonable income-to-experience ratios.

COLUMN EXPR 5 4
0.80
1
0.20

Example: Handling Null Values

When working with expressions, consider how to handle null/missing values:

(
    pb.Validate(data=small_table_pl)
    .col_vals_expr(
        # Check for nulls before division
        expr=(pl.col("c").is_not_null()) & ((pl.col("c") / pl.col("a")) > 1.5),
        brief="Ratio of `c`/`a` should exceed 1.5 (when `c` is not null)."
    )
    .interrogate()
)
Pointblank Validation
2025-05-19|17:22:27
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_expr
col_vals_expr()

Ratio of c/a should exceed 1.5 (when c is not null).

COLUMN EXPR 13 5
0.38
8
0.62

Best Practices

Here are some tips and tricks for effectively using expression-based validation with col_vals_expr().

Document Your Expressions

Always provide clear documentation in the brief= parameter:

(
    pb.Validate(data=small_table_pl)
    .col_vals_expr(
        expr=pl.col("d") > pl.col("a") * 1.5,
        # Document which columns are being compared
        brief="Column `d` should be at least 1.5 times larger than column `a`."
    )
    .interrogate()
)
Pointblank Validation
2025-05-19|17:22:27
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_expr
col_vals_expr()

Column d should be at least 1.5 times larger than column a.

COLUMN EXPR 13 13
1.00
0
0.00

Handle Edge Cases

Consider potential edge cases like division by zero or nulls:

(
    pb.Validate(data=small_table_pl)
    .col_vals_expr(
        # Check denominator before division
        expr=(pl.col("a") != 0) & (pl.col("d") / pl.col("a") > 1.5),
        brief="Ratio of `d`/`a` should exceed 1.5 (avoiding division by zero)."
    )
    .interrogate()
)
Pointblank Validation
2025-05-19|17:22:27
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_expr
col_vals_expr()

Ratio of d/a should exceed 1.5 (avoiding division by zero).

COLUMN EXPR 13 13
1.00
0
0.00

Test on Small Datasets First

When developing complex expressions, test on a small sample of your data first to ensure your logic is correct before applying it to large datasets.

Conclusion

The col_vals_expr() method provides a powerful way to implement complex validation logic in Pointblank when standard validation methods aren’t sufficient. By leveraging expressions, you can create sophisticated data quality checks tailored to your specific requirements, combining conditions across multiple columns and applying transformations as needed.

This flexibility makes expression-based validation an essential tool for addressing complex data quality scenarios in your validation workflows.