YAML Reference

This reference provides a comprehensive guide to all YAML keys and parameters supported by Pointblank’s YAML validation workflows. Use this document as a quick lookup when building validation configurations.

Global Configuration Keys

Top-level Structure

tbl: data_source                       # REQUIRED: Data source specification
df_library: "polars"                   # OPTIONAL: DataFrame library ("polars", "pandas", "duckdb")
tbl_name: "Custom Table Name"          # OPTIONAL: Human-readable table name
label: "Validation Description"        # OPTIONAL: Description for the validation workflow
lang: "en"                             # OPTIONAL: Language code (default: "en")
locale: "en"                           # OPTIONAL: Locale setting (default: "en")
brief: "Global brief: {auto}"          # OPTIONAL: Global brief template
thresholds:                            # OPTIONAL: Global failure thresholds
  warning: 0.1
  error: 0.2
  critical: 0.3
actions:                               # OPTIONAL: Global failure actions
  warning: "Warning message template"
  error: "Error message template"
  critical: "Critical message template"
  highest_only: false
steps:                                 # REQUIRED: List of validation steps
  - validation_method_name
  - validation_method_name:
      parameter: value

Data Source (tbl)

The tbl key specifies the data source and supports multiple formats:

# File paths
tbl: "data/file.csv"
tbl: "data/file.parquet"

# Built-in datasets
tbl: small_table
tbl: game_revenue
tbl: nycflights

# Python expressions for complex data loading
tbl:
  python: |
    pl.scan_csv("data.csv").filter(pl.col("date") >= "2024-01-01")

Using Templates with set_tbl=

For reusable validation templates that will always use a custom data source via the set_tbl= parameter in yaml_interrogate(), the tbl field is still required but its value doesn’t matter since it will be overridden. Recommended approaches:

# Option 1: Use a valid dataset name (gets overridden anyway)
tbl: small_table  # Will be ignored when `set_tbl=` is used

# Option 2: Use YAML null (clearest semantic intent)
tbl: null  # Indicates table will be provided via `set_tbl=`

When using yaml_interrogate() with set_tbl=, the validation template becomes fully reusable:

# Define reusable template
template = """
tbl: null  # Will be overridden
tbl_name: "Sales Validation"
steps:
  - col_exists:
      columns: [customer_id, revenue, region]
  - col_vals_gt:
      columns: [revenue]
      value: 0
"""

# Apply to different datasets
q1_result = pb.yaml_interrogate(template, set_tbl=q1_data)
q2_result = pb.yaml_interrogate(template, set_tbl=q2_data)

DataFrame Library (df_library)

The df_library key controls which DataFrame library is used to load data sources. This parameter affects both built-in datasets and file loading:

# Use Polars DataFrames (default)
df_library: polars

# Use Pandas DataFrames
df_library: pandas

# Use DuckDB tables (via Ibis)
df_library: duckdb

Examples with different libraries:

# Load built-in dataset as Pandas DataFrame
tbl: small_table
df_library: pandas
steps:
  - specially:
      expr: "lambda df: df.assign(validation_result=df['a'] > 0)"

# Load CSV file as Polars DataFrame
tbl: "data/sales.csv"
df_library: polars
steps:
  - col_vals_gt:
      columns: amount
      value: 0

# Load dataset as DuckDB table
tbl: nycflights
df_library: duckdb
steps:
  - row_count_match:
      count: 336776

The df_library parameter is particularly useful when:

  • using validation expressions that require specific DataFrame APIs (e.g., Pandas .assign(), Polars .select())
  • integrating with existing pipelines that use a specific DataFrame library
  • optimizing performance for different data sizes and operations
  • ensuring compatibility with downstream processing steps

Global Thresholds

Thresholds define when validation failures trigger different severity levels:

thresholds:
  warning: 0.05    # 5% failure rate triggers warning
  error: 0.10      # 10% failure rate triggers error
  critical: 0.15   # 15% failure rate triggers critical
  • values: numbers between 0 and 1 (percentages) or integers (row counts)
  • levels: warning, error, critical

Global Actions

Actions define responses when thresholds are exceeded. When supplying a string to a severity level (‘warning’, ‘error’, ‘critical’), you can use template variables that will be automatically substituted with contextual information:

actions:
  warning: "Warning: {n_failed} failures in step {step}"
  error:
    python: |
      lambda: print("Error detected!")
  critical: "Critical failure at {time}"
  highest_only: false        # Execute all applicable actions vs. only highest severity

Template variables available for action strings:

  • {step}: current validation step number
  • {col}: column name(s) being validated
  • {val}: validation value or threshold
  • {n_failed}: number of failing records
  • {n}: total number of records
  • {type}: validation method type
  • {level}: severity level (‘warning’/‘error’/‘critical’)
  • {time}: timestamp of validation

Validation Methods Reference

Column Value Validations

Comparison Methods

col_vals_gt: are column data greater than a fixed value or data in another column?

- col_vals_gt:
    columns: [column_name]             # REQUIRED: Column(s) to validate
    value: 100                         # REQUIRED: Comparison value
    na_pass: true                      # OPTIONAL: Pass NULL values (default: false)
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Values must be > 100"      # OPTIONAL: Step description

col_vals_lt: are column data less than a fixed value or data in another column?

- col_vals_lt:
    columns: [column_name]
    value: 100
    na_pass: true
    # ... (same parameters as col_vals_gt)

col_vals_ge: are column data greater than or equal to a fixed value or data in another column?

- col_vals_ge:
    columns: [column_name]
    value: 100
    na_pass: true
    # ... (same parameters as col_vals_gt)

col_vals_le: are column data less than or equal to a fixed value or data in another column?

- col_vals_le:
    columns: [column_name]
    value: 100
    na_pass: true
    # ... (same parameters as col_vals_gt)

col_vals_eq: are column data equal to a fixed value or data in another column?

- col_vals_eq:
    columns: [column_name]
    value: "expected_value"
    na_pass: true
    # ... (same parameters as col_vals_gt)

col_vals_ne: are column data not equal to a fixed value or data in another column?

- col_vals_ne:
    columns: [column_name]
    value: "forbidden_value"
    na_pass: true
    # ... (same parameters as col_vals_gt)

Range Methods

col_vals_between: are column data between two specified values (inclusive)?

- col_vals_between:
    columns: [column_name]             # REQUIRED: Column(s) to validate
    left: 0                            # REQUIRED: Lower bound
    right: 100                         # REQUIRED: Upper bound
    inclusive: [true, true]            # OPTIONAL: Include bounds [left, right]
    na_pass: false                     # OPTIONAL: Pass NULL values
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Values between 0 and 100"  # OPTIONAL: Step description

col_vals_outside: are column data outside of two specified values?

- col_vals_outside:
    columns: [column_name]
    left: 0
    right: 100
    inclusive: [false, false]          # OPTIONAL: Exclude bounds [left, right]
    na_pass: false
    # ... (same parameters as col_vals_between)

Set Membership Methods

col_vals_in_set: are column data part of a specified set of values?

- col_vals_in_set:
    columns: [column_name]             # REQUIRED: Column(s) to validate
    set: [value1, value2, value3]      # REQUIRED: Allowed values
    na_pass: false                     # OPTIONAL: Pass NULL values
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Values in allowed set"     # OPTIONAL: Step description

col_vals_not_in_set: are column data not part of a specified set of values?

- col_vals_not_in_set:
    columns: [column_name]
    set: [forbidden1, forbidden2]      # REQUIRED: Forbidden values
    na_pass: false
    # ... (same parameters as col_vals_in_set)

NULL Value Methods

col_vals_null: are column data null (missing)?

- col_vals_null:
    columns: [column_name]             # REQUIRED: Column(s) to validate
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Values must be NULL"       # OPTIONAL: Step description

col_vals_not_null: are column data not null (not missing)?

- col_vals_not_null:
    columns: [column_name]
    # ... (same parameters as col_vals_null)

Pattern Matching Methods

col_vals_regex: do string-based column data match a regular expression?

- col_vals_regex:
    columns: [column_name]             # REQUIRED: Column(s) to validate
    pattern: "^[A-Z]{2,3}$"            # REQUIRED: Regular expression pattern
    na_pass: false                     # OPTIONAL: Pass NULL values
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Values match pattern"      # OPTIONAL: Step description

col_vals_within_spec: do column data conform to a specification (email, URL, postal codes, etc.)?

- col_vals_within_spec:
    columns: [column_name]             # REQUIRED: Column(s) to validate
    spec: "email"                      # REQUIRED: Specification type
    na_pass: false                     # OPTIONAL: Pass NULL values
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Values match spec"         # OPTIONAL: Step description

Available specification types:

  • "email" - Email addresses
  • "url" - Internet URLs
  • "phone" - Phone numbers
  • "ipv4" - IPv4 addresses
  • "ipv6" - IPv6 addresses
  • "mac" - MAC addresses
  • "isbn" - International Standard Book Numbers (10 or 13 digit)
  • "vin" - Vehicle Identification Numbers
  • "credit_card" - Credit card numbers (uses Luhn algorithm)
  • "swift" - Business Identifier Codes (SWIFT-BIC)
  • "postal_code[<country_code>]" - Postal codes for specific countries (e.g., "postal_code[US]", "postal_code[CA]")
  • "zip" - Alias for US ZIP codes ("postal_code[US]")
  • "iban[<country_code>]" - International Bank Account Numbers (e.g., "iban[DE]", "iban[FR]")

Examples:

# Email validation
- col_vals_within_spec:
    columns: user_email
    spec: "email"

# US postal codes
- col_vals_within_spec:
    columns: zip_code
    spec: "postal_code[US]"

# German IBAN
- col_vals_within_spec:
    columns: account_number
    spec: "iban[DE]"

Custom Expression Methods

col_vals_expr: do column data agree with a predicate expression?

- col_vals_expr:
    expr:                              # REQUIRED: Custom validation expression
      python: |
        pl.when(pl.col("status") == "active")
        .then(pl.col("value") > 0)
        .otherwise(pl.lit(True))
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Custom validation rule"    # OPTIONAL: Step description

Trend Validation Methods

col_vals_increasing: are column data increasing row-by-row?

- col_vals_increasing:
    columns: [column_name]             # REQUIRED: Column(s) to validate
    allow_stationary: false            # OPTIONAL: Allow consecutive equal values (default: false)
    decreasing_tol: 0.5                # OPTIONAL: Tolerance for negative movement (default: null)
    na_pass: false                     # OPTIONAL: Pass NULL values
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Values must increase"      # OPTIONAL: Step description

This validation checks whether values in a column increase as you move down the rows. Useful for validating time-series data, sequence numbers, or any monotonically increasing values.

Parameters:

  • allow_stationary: If true, allows consecutive values to be equal (stationary phases). For example, [1, 2, 2, 3] would pass when true but fail at the third value when false.
  • decreasing_tol: Absolute tolerance for negative movement. Setting this to 0.5 means values can decrease by up to 0.5 units and still pass. Setting any value also sets allow_stationary to true.

Examples:

# Strict increasing validation
- col_vals_increasing:
    columns: timestamp_seconds
    brief: "Timestamps must strictly increase"

# Allow stationary values
- col_vals_increasing:
    columns: version_number
    allow_stationary: true
    brief: "Version numbers should increase (ties allowed)"

# With tolerance for small decreases
- col_vals_increasing:
    columns: temperature
    decreasing_tol: 0.1
    brief: "Temperature trend (small drops allowed)"

col_vals_decreasing: are column data decreasing row-by-row?

- col_vals_decreasing:
    columns: [column_name]             # REQUIRED: Column(s) to validate
    allow_stationary: false            # OPTIONAL: Allow consecutive equal values (default: false)
    increasing_tol: 0.5                # OPTIONAL: Tolerance for positive movement (default: null)
    na_pass: false                     # OPTIONAL: Pass NULL values
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Values must decrease"      # OPTIONAL: Step description

This validation checks whether values in a column decrease as you move down the rows. Useful for countdown timers, inventory depletion, or any monotonically decreasing values.

Parameters:

  • allow_stationary: If true, allows consecutive values to be equal (stationary phases). For example, [10, 8, 8, 5] would pass when true but fail at the third value when false.
  • increasing_tol: Absolute tolerance for positive movement. Setting this to 0.5 means values can increase by up to 0.5 units and still pass. Setting any value also sets allow_stationary to true.

Examples:

# Strict decreasing validation
- col_vals_decreasing:
    columns: countdown_timer
    brief: "Timer must strictly decrease"

# Allow stationary values
- col_vals_decreasing:
    columns: priority_score
    allow_stationary: true
    brief: "Priority scores should decrease (ties allowed)"

# With tolerance for small increases
- col_vals_decreasing:
    columns: stock_level
    increasing_tol: 5
    brief: "Stock levels decrease (small restocks allowed)"

Row-based Validations

rows_distinct: are row data distinct?

- rows_distinct                        # Simple form

- rows_distinct:                       # With parameters
    columns_subset: [col1, col2]       # OPTIONAL: Check subset of columns
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "No duplicate rows"         # OPTIONAL: Step description

rows_complete: are row data complete?

- rows_complete                        # Simple form

- rows_complete:                       # With parameters
    columns_subset: [col1, col2]       # OPTIONAL: Check subset of columns
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Complete rows only"        # OPTIONAL: Step description

Structure Validations

col_exists: does column exist in the table?

- col_exists:
    columns: [col1, col2, col3]        # REQUIRED: Column(s) that must exist
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Required columns exist"    # OPTIONAL: Step description

col_schema_match: does the table have expected column names and data types?

- col_schema_match:
    schema:                            # REQUIRED: Expected schema
      columns:
        - [column_name, "data_type"]   # Column with type validation
        - column_name                  # Column name only (no type check)
        - [column_name]                # Alternative syntax
    complete: true                     # OPTIONAL: Require exact column set
    in_order: true                     # OPTIONAL: Require exact column order
    case_sensitive_colnames: true      # OPTIONAL: Case-sensitive column names
    case_sensitive_dtypes: true        # OPTIONAL: Case-sensitive data types
    full_match_dtypes: true            # OPTIONAL: Exact type matching
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Schema validation"         # OPTIONAL: Step description

row_count_match: does the table have n rows?

- row_count_match:
    count: 1000                        # REQUIRED: Expected row count
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Expected row count"        # OPTIONAL: Step description

col_count_match: does the table have n columns?

- col_count_match:
    count: 10                          # REQUIRED: Expected column count
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Expected column count"     # OPTIONAL: Step description

tbl_match: does the table match a comparison table?

- tbl_match:
    tbl_compare:                       # REQUIRED: Comparison table
      python: |
        pb.load_dataset("reference_table", tbl_type="polars")
    pre: |                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.0
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Table structure matches"   # OPTIONAL: Step description

This validation performs a comprehensive comparison between the target table and a comparison table, using progressively stricter checks:

  1. Column count match: both tables have the same number of columns
  2. Row count match: both tables have the same number of rows
  3. Schema match (loose): column names and dtypes match (case-insensitive, any order)
  4. Schema match (order): columns in correct order (case-insensitive names)
  5. Schema match (exact): column names match exactly (case-sensitive, correct order)
  6. Data match: values in corresponding cells are identical

The validation fails at the first check that doesn’t pass, making it easy to diagnose mismatches. This operates over a single test unit (pass/fail for complete table match).

Cross-backend validation: tbl_match() supports automatic backend coercion when comparing tables from different backends (e.g., Polars vs. Pandas, DuckDB vs. SQLite). The comparison table is automatically converted to match the target table’s backend.

Examples:

# Compare against reference dataset
- tbl_match:
    tbl_compare:
      python: |
        pb.load_dataset("expected_output", tbl_type="polars")
    brief: "Output matches expected results"

# Compare against CSV file
- tbl_match:
    tbl_compare:
      python: |
        pl.read_csv("reference_data.csv")
    brief: "Matches reference CSV"

# Compare with preprocessing on target table only
- tbl_match:
    tbl_compare:
      python: |
        pb.load_dataset("reference_table", tbl_type="polars")
    pre: |
      lambda df: df.select(["id", "name", "value"])
    brief: "Selected columns match reference"

Special Validation Methods

conjointly: are multiple validations having a joint dependency?

- conjointly:
    expressions:                       # REQUIRED: List of lambda expressions
      - "lambda df: df['d'] > df['a']"
      - "lambda df: df['a'] > 0"
      - "lambda df: df['a'] + df['d'] < 12000"
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "All conditions must pass"  # OPTIONAL: Step description

specially: do table data pass a custom validation function?

- specially:
    expr:                              # REQUIRED: Custom validation function
      "lambda df: df.select(pl.col('a') + pl.col('d') > 0)"
    thresholds:                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "Custom validation"         # OPTIONAL: Step description

Alternative syntax with Python expressions:

- specially:
    expr:
      python: |
        lambda df: df.select(pl.col('amount') > 0)

For Pandas DataFrames (when using df_library: pandas):

- specially:
    expr: "lambda df: df.assign(is_valid=df['a'] + df['d'] > 0)"

AI-Powered Validation

prompt: validate rows using AI/LLM-powered analysis

- prompt:
    prompt: "Values should be positive and realistic"  # REQUIRED: Natural language criteria
    model: "anthropic:claude-sonnet-4"                 # REQUIRED: Model identifier
    columns_subset: [column1, column2]                 # OPTIONAL: Columns to validate
    batch_size: 1000                                   # OPTIONAL: Rows per batch (default: 1000)
    max_concurrent: 3                                  # OPTIONAL: Concurrent API requests (default: 3)
    pre: |                                             # OPTIONAL: Data preprocessing
      lambda df: df.filter(condition)
    thresholds:                                        # OPTIONAL: Step-level thresholds
      warning: 0.1
    actions:                                           # OPTIONAL: Step-level actions
      warning: "Custom message"
    brief: "AI validation"                             # OPTIONAL: Step description

This validation method uses Large Language Models (LLMs) to validate rows of data based on natural language criteria. Each row becomes a test unit that either passes or fails the validation criteria, producing binary True/False results that integrate with standard Pointblank reporting.

Supported models:

  • Anthropic: "anthropic:claude-sonnet-4", "anthropic:claude-opus-4"
  • OpenAI: "openai:gpt-4", "openai:gpt-4-turbo", "openai:gpt-3.5-turbo"
  • Ollama: "ollama:<model-name>" (e.g., "ollama:llama3")
  • Bedrock: "bedrock:<model-name>"

Authentication: API keys are automatically loaded from environment variables or .env files:

  • OpenAI: Set OPENAI_API_KEY environment variable or add to .env file
  • Anthropic: Set ANTHROPIC_API_KEY environment variable or add to .env file
  • Ollama: No API key required (runs locally)
  • Bedrock: Configure AWS credentials through standard AWS methods

Example .env file:

ANTHROPIC_API_KEY="your_anthropic_api_key_here"
OPENAI_API_KEY="your_openai_api_key_here"

Performance optimization: The validation process uses row signature memoization to avoid redundant LLM calls. When multiple rows have identical values in the selected columns, only one representative row is validated, and the result is applied to all matching rows. This dramatically reduces API costs and processing time for datasets with repetitive patterns.

Examples:

# Basic AI validation
- prompt:
    prompt: "Email addresses should look realistic and professional"
    model: "anthropic:claude-sonnet-4"
    columns_subset: [email]

# Complex semantic validation
- prompt:
    prompt: "Product descriptions should mention the product category and include at least one benefit"
    model: "openai:gpt-4"
    columns_subset: [product_name, description, category]
    batch_size: 500
    max_concurrent: 5

# Sentiment analysis
- prompt:
    prompt: "Customer feedback should express positive sentiment"
    model: "anthropic:claude-sonnet-4"
    columns_subset: [feedback_text, rating]

# Context-dependent validation
- prompt:
    prompt: "For high-value transactions (amount > 1000), a detailed justification should be provided"
    model: "openai:gpt-4"
    columns_subset: [amount, justification, approver]
    thresholds:
      warning: 0.05
      error: 0.15

# Local model with Ollama
- prompt:
    prompt: "Transaction descriptions should be clear and professional"
    model: "ollama:llama3"
    columns_subset: [description]

Best practices for AI validation:

  • Be specific and clear in your prompt criteria
  • Include only necessary columns in columns_subset to reduce API costs
  • Start with smaller batch_size for testing, increase for production
  • Adjust max_concurrent based on API rate limits
  • Use thresholds appropriate for probabilistic validation results
  • Consider cost implications for large datasets
  • Test prompts on sample data before full deployment

When to use AI validation:

  • Semantic checks (e.g., “does the description match the category?”)
  • Context-dependent validation (e.g., “is the justification appropriate for the amount?”)
  • Subjective quality assessment (e.g., “is the text professional?”)
  • Pattern recognition that’s hard to express programmatically
  • Natural language understanding tasks

When NOT to use AI validation:

  • Simple numeric comparisons (use col_vals_gt, col_vals_lt, etc.)
  • Exact pattern matching (use col_vals_regex)
  • Schema validation (use col_schema_match)
  • Performance-critical validations with large datasets
  • When deterministic results are required

Column Selection Patterns

All validation methods that accept a columns parameter support these selection patterns:

# Single column
columns: column_name

# Multiple columns as list
columns: [col1, col2, col3]

# Column selector functions (when used in Python expressions)
columns:
  python: |
    starts_with("prefix_")

# Examples of common patterns
columns: [customer_id, order_id]     # Specific columns
columns: user_email                  # Single column

Parameter Details

Common Parameters

These parameters are available for most validation methods:

  • columns: column selection (string, list, or selector expression)
  • na_pass: whether to pass NULL/missing values (boolean, default: false)
  • pre: data preprocessing function (Python lambda expression)
  • thresholds: step-level failure thresholds (dict)
  • actions: step-level failure actions (dict)
  • brief: step description (string, boolean, or template)

Brief Parameter Options

The brief parameter supports several formats:

brief: "Custom description"          # Custom text
brief: true                         # Auto-generated description
brief: false                        # No description
brief: "Step {step}: {auto}"        # Template with auto-generated text
brief: "Column '{col}' validation"  # Template with variables

template variables: {step}, {col}, {value}, {set}, {pattern}, {auto}

Python Expressions

Several parameters support Python expressions using the python: block syntax:

# Data source loading
tbl:
  python: |
    pl.scan_csv("data.csv").filter(pl.col("active") == True)

# Preprocessing
pre:
  python: |
    lambda df: df.filter(pl.col("date") >= "2024-01-01")

# Custom expressions
expr:
  python: |
    pl.col("value").is_between(0, 100)

# Callable actions
actions:
  error:
    python: |
      lambda: print("VALIDATION ERROR: Critical data quality issue detected!")

Note: The Python environment in YAML is restricted for security. Only built-in functions (print, len, str, etc.), Path from pathlib, and available DataFrame libraries (pl, pd) are accessible. You cannot import additional modules like requests, logging, or custom libraries.

You can also use the shortcut syntax for lambda expressions:

# Shortcut syntax (equivalent to python: block)
pre: |
  lambda df: df.filter(pl.col("status") == "active")

Restricted Python Environment

For security reasons, the Python environment in YAML configurations is restricted to a safe subset of functionality. The available namespace includes:

Built-in functions:

  • basic types: str, int, float, bool, list, dict, tuple, set
  • math functions: sum, min, max, abs, round, len
  • iteration: range, enumerate, zip
  • output: print

Available modules:

  • Path from pathlib for file path operations
  • pb (pointblank) for dataset loading and validation functions
  • pl (polars) if available on the system
  • pd (pandas) if available on the system

Restrictions:

  • cannot import external libraries (requests, logging, os, sys, etc.)
  • cannot use __import__, exec, eval, or other dynamic execution functions
  • file operations are limited to Path functionality

Examples of valid callable actions:

# Simple output with built-in functions
actions:
  warning:
    python: |
      lambda: print(f"WARNING: {sum([1, 2, 3])} validation issues detected")

# Using available variables and string formatting
actions:
  error:
    python: |
      lambda: print("ERROR: Data validation failed at " + str(len("validation")))

# Multiple statements in lambda (using parentheses)
actions:
  critical:
    python: |
      lambda: (
          print("CRITICAL ALERT:"),
          print("Immediate attention required"),
          print("Contact data team")
      )[-1]  # Return the last value

For complex alerting, logging, or external system integration, use string template actions instead of callable actions, and handle the external communication in your application code after validation completes.

Best Practices

Organization

  • use descriptive tbl_name and label values
  • add brief descriptions for complex validations
  • group related validations logically
  • use consistent indentation and formatting

Performance

  • apply pre filters early to reduce data volume
  • order validations from fast to slow
  • use columns_subset for row-based validations when appropriate
  • consider data source location (local vs. remote)
  • choose df_library based on data size and operations:
    • polars: fastest for large datasets and analytical operations
    • pandas: best for complex transformations and data science workflows
    • duckdb: optimal for analytical queries on very large datasets

Maintainability

  • store YAML files in version control
  • use template variables in actions and briefs
  • document expected failures with comments
  • test configurations with validate_yaml() before deployment
  • specify df_library explicitly when using library-specific validation expressions
  • keep DataFrame library choice consistent within related validation workflows

Error Handling

  • set appropriate thresholds based on data patterns
  • use actions for monitoring and alerting
  • start with conservative thresholds and adjust
  • consider using highest_only: false for comprehensive reporting