YAML Validation Workflows

Pointblank supports defining validation workflows using YAML configuration files, providing a declarative, readable, and maintainable approach to data validation. YAML workflows are particularly useful for teams, version control, automation pipelines, and scenarios where you want to separate validation logic from application code.

YAML validation workflows offer several advantages: they’re easy to read and write, can be version controlled alongside your data processing code, enable non-programmers to contribute to data quality definitions, and provide a clear separation between validation logic and execution code.

The YAML approach complements Pointblank’s Python API, giving you flexibility to choose the right tool for each situation. Simple, repetitive validations work well in YAML, while complex logic with custom functions might be better suited for the Python API.

Basic YAML Validation Structure

A YAML validation workflow consists of a few key components:

tbl: specifies the data source (file path, dataset name, or Python expression)
steps: defines the validation checks to perform
Optional metadata: table name, label, thresholds, actions, and other configuration

Here’s a simple example validating the built-in small_table dataset:

tbl: small_table
df_library: polars                     # Optional: specify DataFrame library
tbl_name: "Small Table Validation"
label: "Basic data quality checks"
steps:
  - rows_distinct
  - col_exists:
      columns: [a, b, c, d]
  - col_vals_not_null:
      columns: [a, b]

You can save this configuration to a .yaml file and execute it using the yaml_interrogate() function:

import pointblank as pb
from pathlib import Path

# Save the YAML configuration to a file
yaml_content = """
tbl: small_table
df_library: polars
tbl_name: "Small Table Validation"
label: "Basic data quality checks"
steps:
  - rows_distinct
  - col_exists:
      columns: [a, b, c, d]
  - col_vals_not_null:
      columns: [a, b]
"""

yaml_file = Path("basic_validation.yaml")
yaml_file.write_text(yaml_content)

# Execute the validation from the file
result = pb.yaml_interrogate(yaml_file)
result

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
Basic data quality checks PolarsSmall Table Validation
#4CA64C66	1	rows_distinct()	ALL COLUMNS	—	✓	13	9 0.69	2 0.15	—	—	—
#4CA64C	2	col_exists()	a	—	✓	1	1 1.00	0 0.00	—	—	—	—
#4CA64C	3	col_exists()	b	—	✓	1	1 1.00	0 0.00	—	—	—	—
#4CA64C	4	col_exists()	c	—	✓	1	1 1.00	0 0.00	—	—	—	—
#4CA64C	5	col_exists()	d	—	✓	1	1 1.00	0 0.00	—	—	—	—
#4CA64C	6	col_vals_not_null()	a	—	✓	13	13 1.00	0 0.00	—	—	—	—
#4CA64C	7	col_vals_not_null()	b	—	✓	13	13 1.00	0 0.00	—	—	—	—
2025-11-18 17:00:26 UTC< 1 s2025-11-18 17:00:26 UTC

The validation table shows the results of each step, just as if you had written the equivalent Python code. You can also pass YAML content directly as a string for quick testing, but working with files is the recommended approach for production workflows.

Data Sources in YAML

The tbl field supports various data source types, making it easy to work with different kinds of data. You can also control the DataFrame library used for loading data with the df_library parameter.

DataFrame Library Selection

By default, Pointblank loads data as Polars DataFrames, but you can specify alternative libraries:

# Load as Polars DataFrame (default)
tbl: small_table
df_library: polars

# Load as Pandas DataFrame
tbl: small_table
df_library: pandas

# Load as DuckDB table (via Ibis)
tbl: small_table
df_library: duckdb

This is particularly useful when using validation expressions that require specific DataFrame APIs:

# Using Pandas-specific operations
tbl: small_table
df_library: pandas
steps:
  - specially:
      expr: "lambda df: df.assign(total=df['a'] + df['d'])"

# Using Polars-specific operations
tbl: small_table
df_library: polars
steps:
  - specially:
      expr: "lambda df: df.select(pl.col('a') + pl.col('d') > 0)"

File-based Sources

# CSV files (respects df_library setting)
tbl: "data/customers.csv"
df_library: pandas

# Parquet files
tbl: "warehouse/sales.parquet"
df_library: polars

# Multiple files with patterns
tbl: "logs/*.parquet"

Built-in Datasets

# Use Pointblank's built-in datasets
tbl: small_table
tbl: game_revenue
tbl: nycflights

Python Expressions for Complex Sources

For more complex data loading, use the python: block syntax. This syntax can be used with several parameters throughout your YAML configuration:

tbl: For complex data source loading (as shown below)
expr: For custom validation expressions in col_vals_expr
pre: For data preprocessing before validation steps
actions: For callable action functions (warning, error, critical, and default)

# Load data with custom Polars operations
tbl:
  python: |
    pl.scan_csv("sales_data.csv")
    .filter(pl.col("date") >= "2024-01-01")
    .head(1000)

# Load from a database connection
tbl:
  python: |
    pl.read_database(
        query="SELECT * FROM customers WHERE active = true",
        connection="postgresql://user:pass@localhost/db"
    )

Reusable Templates with `set_tbl=`

One of the most powerful features of YAML validation workflows is the ability to create reusable templates that can be applied to different datasets. Using the set_tbl= parameter with yaml_interrogate(), you can define validation logic once and apply it to multiple data sources.

Creating Validation Templates

When creating templates for use with set_tbl=, the tbl field is still required but its value will be overridden. The recommended approach is to use tbl: null:

tbl: null
tbl_name: "Sales Data Validation Template"
label: "Standard validation checks for sales data"
steps:
  - col_exists:
      columns: [customer_id, revenue, region, date]
  - col_vals_not_null:
      columns: [customer_id, revenue]
  - col_vals_gt:
      columns: [revenue]
      value: 0
  - col_vals_in_set:
      columns: [region]
      set: [North, South, East, West]

Applying Templates to Multiple Datasets

Here’s a practical example showing how to apply the same validation template to multiple quarterly datasets, demonstrating the power of reusable YAML configurations:

import pointblank as pb
import polars as pl

# Define the template once
sales_template = """
tbl: null  # Will be overridden
tbl_name: "Sales Data Validation"
label: "Standard sales validation checks"
thresholds:
  warning: 0.05
  error: 0.1
steps:
  - col_exists:
      columns: [customer_id, revenue, region]
  - col_vals_not_null:
      columns: [customer_id, revenue]
  - col_vals_gt:
      columns: [revenue]
      value: 0
  - col_vals_in_set:
      columns: [region]
      set: [North, South, East, West]
"""

# Create different datasets
q1_data = pl.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "revenue": [100, 200, 150, 300],
    "region": ["North", "South", "East", "West"]
})

q2_data = pl.DataFrame({
    "customer_id": [5, 6, 7, 8],
    "revenue": [250, 180, 220, 350],
    "region": ["South", "North", "West", "East"]
})

# Apply the same template to both datasets
q1_result = pb.yaml_interrogate(sales_template, set_tbl=q1_data)
q2_result = pb.yaml_interrogate(sales_template, set_tbl=q2_data)

print(f"Q1 validation: {all(v.all_passed for v in q1_result.validation_info)}")
print(f"Q2 validation: {all(v.all_passed for v in q2_result.validation_info)}")

Q1 validation: True
Q2 validation: True

Template Best Practices

Use tbl: null: this clearly indicates the template expects a data source to be provided
Include comprehensive metadata: use tbl_name, label, and brief to make results self-documenting
Set appropriate thresholds: define warning/error levels that make sense for your use case
Version control templates: store templates in your repository alongside your data processing code
Test with sample data: validate your templates work with representative datasets

Common Template Patterns

For API response validation, you can ensure that responses have the expected structure and valid status codes:

tbl: null
tbl_name: "API Response Validation"
brief: "Standard checks for API response data"
steps:
  - col_exists:
      columns: [user_id, status, timestamp]
  - col_vals_in_set:
      columns: [status]
      set: [success, error, pending]
  - col_vals_not_null:
      columns: [user_id, timestamp]

For file upload validation, you can check file sizes and formats to ensure they meet your requirements:

tbl: null
tbl_name: "File Upload Validation"
steps:
  - col_vals_gt:
      columns: [file_size]
      value: 0
  - col_vals_lt:
      columns: [file_size]
      value: 10485760  # 10MB limit
  - col_vals_in_set:
      columns: [file_type]
      set: [csv, json, xlsx, parquet]

This template approach is particularly valuable in data pipelines, ETL processes, and automated testing scenarios where you need to apply consistent validation logic across multiple similar datasets.

Validation Steps

YAML supports all of Pointblank’s validation methods. Here are some common patterns:

Column-based Validations

tbl: worldcities.csv
steps:
  # Check for missing values
  - col_vals_not_null:
      columns: [city_name, country]

  # Validate value ranges
  - col_vals_between:
      columns: latitude
      left: -90
      right: 90

  # Check set membership
  - col_vals_in_set:
      columns: country_code
      set: [US, CA, MX, UK, DE, FR]

  # Regular expression validation
  - col_vals_regex:
      columns: postal_code
      pattern: "^[0-9]{5}(-[0-9]{4})?$"

Row-based Validations

tbl: sales_data.csv
steps:
  # Check for duplicate rows
  - rows_distinct

  # Ensure complete rows (no missing values)
  - rows_complete

  # Check row count
  - row_count_match:
      count: 1000

Schema Validations

Schema validation ensures your data has the expected structure and column types. The col_schema_match validation method uses a schema key that contains a columns list, where each item in the list can specify a column name alone or a column name with its expected data type.

Each column entry can be specified as:

column_name: column name as a scalar string (structure validation, no type checking)
[column_name, "data_type"]: column name with type validation (as a list with two elements)
[column_name]: column name in a single-item list (equivalent to scalar, for consistency)

tbl: customer_data.csv
steps:
  # Complete schema validation (structure and types)
  - col_schema_match:
      schema:
        columns:
          - [customer_id, "int64"]
          - [name, "object"]
          - [email, "object"]
          - [signup_date, "datetime64[ns]"]

  # Structure-only validation (column names without types)
  - col_schema_match:
      schema:
        columns:
          - customer_id
          - name
          - email
      complete: false
      brief: "Check that core columns exist"

Schema Validation Options

Schema validations support the full range of validation options:

tbl: data_file.csv
steps:
  - col_schema_match:
      schema:
        columns:
          - [id, "int64"]
          - name
      complete: false                  # Allow extra columns
      in_order: false                  # Column order doesn't matter
      case_sensitive_colnames: false   # Case-insensitive column names
      case_sensitive_dtypes: false     # Case-insensitive type names
      full_match_dtypes: false         # Allow partial type matching
      brief: "Flexible schema validation"

Other Structure Validations

tbl: customer_data.csv
steps:
  # Check column count
  - col_count_match:
      count: 4

Trend Validations

Validate that values follow increasing or decreasing patterns across rows:

tbl: time_series_data.csv
steps:
  # Ensure timestamp values increase
  - col_vals_increasing:
      columns: timestamp
      brief: "Timestamps must be in chronological order"

  # Validate countdown timer decreases
  - col_vals_decreasing:
      columns: countdown
      allow_stationary: true
      brief: "Countdown values should decrease (ties allowed)"

  # Check trend with tolerance
  - col_vals_increasing:
      columns: temperature
      decreasing_tol: 0.5
      brief: "Temperature trends upward (small drops < 0.5°C allowed)"

Specification-based Validations

Validate values against common data specifications like email addresses, URLs, postal codes, and more:

tbl: user_contact_info.csv
steps:
  # Validate email addresses
  - col_vals_within_spec:
      columns: email
      spec: "email"

  # Validate US ZIP codes
  - col_vals_within_spec:
      columns: zip_code
      spec: "postal_code[US]"

  # Validate URLs
  - col_vals_within_spec:
      columns: website
      spec: "url"
      na_pass: true

Available specifications include: "email", "url", "phone", "ipv4", "ipv6", "mac", "isbn", "vin", "credit_card", "swift", "postal_code[<country>]", "iban[<country>]".

Table Comparison

Validate that an entire table matches a reference table:

tbl: processed_output.csv
steps:
  # Compare against expected output
  - tbl_match:
      tbl_compare:
        python: |
          pb.load_dataset("expected_output", tbl_type="polars")
      brief: "Output matches expected results"

The tbl_match() validation performs comprehensive comparison including column count, row count, schema, and data values. It supports cross-backend validation (e.g., comparing Polars vs. Pandas DataFrames).

AI-Powered Validation

Use Large Language Models to validate data based on natural language criteria:

tbl: customer_feedback.csv
steps:
  # Validate sentiment
  - prompt:
      prompt: "Customer feedback should express positive sentiment"
      model: "anthropic:claude-sonnet-4"
      columns_subset: [feedback_text, rating]
      batch_size: 500
      thresholds:
        warning: 0.1

  # Validate semantic correctness
  - prompt:
      prompt: "Product descriptions should mention the product category and at least one benefit"
      model: "openai:gpt-4"
      columns_subset: [product_name, description, category]

Note: AI validations require API keys to be set as environment variables (e.g., ANTHROPIC_API_KEY, OPENAI_API_KEY) or in a .env file. These validations are best suited for semantic, context-dependent, or subjective quality checks rather than simple numeric comparisons.

Thresholds and Severity Levels

Thresholds determine when validation failures trigger different severity levels. You can set global thresholds for the entire workflow:

tbl: sales_data.csv
tbl_name: "Sales Data Quality Check"
thresholds:
  warning: 0.05    # 5% failure rate triggers warning
  error: 0.10      # 10% failure rate triggers error
  critical: 0.15   # 15% failure rate triggers critical
steps:
  - col_vals_not_null:
      columns: [customer_id, amount]
  - col_vals_gt:
      columns: amount
      value: 0

You can also set thresholds for individual validation steps:

tbl: user_data.csv
steps:
  - col_vals_not_null:
      columns: email
      thresholds:
        warning: 1      # Any missing email is a warning
        error: 0.01     # 1% missing emails is an error

  - col_vals_regex:
      columns: email
      pattern: "^[\\w\\.-]+@[\\w\\.-]+\\.[a-zA-Z]{2,}$"
      thresholds:
        error: 1        # Any invalid email format is an error

Actions: Responding to Validation Failures

Actions define what happens when validation thresholds are exceeded. You can use string templates with placeholder variables or callable functions.

String Template Actions

tbl: orders.csv
thresholds:
  warning: 0.02
  error: 0.05
actions:
  warning: "Warning: Step {step} found {n_failed} failures in {col} column"
  error: "Error in {TYPE} validation: {n_failed}/{n} rows failed (Step {step})"
  critical: "Critical failure detected at {time}"
steps:
  - col_vals_not_null:
      columns: [order_id, customer_id]

Available template variables include:

{step}: validation step number
{col}: column name being validated
{val}: specific failing value (when applicable)
{n_failed}: number of failing rows
{n}: total number of rows checked
{TYPE}: validation method name (e.g., “COL_VALS_NOT_NULL”)
{LEVEL}: severity level (“WARNING”, “ERROR”, “CRITICAL”)
{time}: timestamp of the validation

Callable Actions

For more complex responses, use Python callable functions:

tbl: critical_data.csv
thresholds:
  error: 1
actions:
  error:
    python: |
      lambda: print("ALERT: Critical data validation failed!")
  critical:
    python: |
      lambda: print("CRITICAL: Validation failure - manual intervention required!")
steps:
  - col_vals_not_null:
      columns: [transaction_id, amount]

Note: The Python environment in YAML actions is restricted for security. You can use built-in functions like print(), basic operations, and available DataFrame libraries, but cannot import external modules like requests or logging. For external notifications, consider using string template actions or handling alerts in your application code after the validation completes.

Step-level Actions

You can also define actions for individual validation steps:

tbl: financial_data.csv
steps:
  - col_vals_not_null:
      columns: account_balance
      thresholds:
        error: 1
      actions:
        error: "Missing account balance detected in step {step}."

  - col_vals_gt:
      columns: account_balance
      value: 0
      actions:
        warning:
          python: |
            lambda: print("Negative balance warning triggered.")

Advanced Features

Pre-processing with the `pre` Parameter

You can apply data transformations before validation using the pre parameter:

tbl: transactions.csv
steps:
  # Validate only recent transactions
  - col_vals_gt:
      columns: amount
      value: 0
      pre:
        python: |
          lambda df: df.filter(
              pl.col("transaction_date") >= "2024-01-01"
          )

  # Check completeness for active customers only
  - col_vals_not_null:
      columns: [email, phone]
      pre: |
        lambda df: df.filter(pl.col("status") == "active")

Note that you can use either the explicit python: block syntax or the shortcut syntax (just pre: |) for the lambda expressions.

Complex Expressions

For advanced validation logic, use a col_vals_expr step with custom expressions:

tbl: sales_data.csv
steps:
  # Custom business logic validation
  - col_vals_expr:
      expr:
        python: |
          (
            pl.when(pl.col("product_type") == "premium")
            .then(pl.col("price") >= 100)
            .when(pl.col("product_type") == "standard")
            .then(pl.col("price").is_between(20, 99))
            .otherwise(pl.col("price") <= 19)
          )

Brief Descriptions

Add human-readable descriptions to validation steps. The brief parameter supports string templating and automatic generation:

tbl: customer_data.csv
brief: "Customer data quality validation for {auto}"
steps:
  - col_vals_not_null:
      columns: customer_id
      brief: "Ensure all customers have valid IDs"

  - col_vals_regex:
      columns: email
      pattern: "^[\\w\\.-]+@[\\w\\.-]+\\.[a-zA-Z]{2,}$"
      brief: "Validate email format compliance"

  - col_vals_between:
      columns: age
      left: 13
      right: 120
      brief: "Check reasonable age ranges"

  # Use automatic brief generation
  - col_vals_not_null:
      columns: phone_number
      brief: true

  # Template variables in briefs
  - col_vals_in_set:
      columns: status
      set: [active, inactive, pending]
      brief: "Column '{col}' must be one of: {set}"

Brief Templating Options:

custom strings: Write your own descriptive text
true: Automatically generates a brief based on the validation method and parameters
{auto}: Placeholder for auto-generated text within custom strings
template variables: Use the same variables available in actions:
- {col}: column name(s) being validated
- {step}: the step number in the validation plan
- {value}: the comparison value used in the validation (for single-value comparisons)
- {pattern}: for regex validations, the pattern being matched

Working with YAML Files

Loading from Files

You can save your YAML configuration to files and load them:

# Create a YAML file
yaml_content = """
tbl: small_table
tbl_name: "File-based Validation"
steps:
  - col_vals_between:
      columns: c
      left: 1
      right: 10
  - col_vals_in_set:
      columns: f
      set: [low, mid, high]
"""

# Save to file
from pathlib import Path
yaml_file = Path("validation_config.yaml")
yaml_file.write_text(yaml_content)

# Load and execute
result = pb.yaml_interrogate(yaml_file)
result

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
2025-11-18\|17:00:26 PolarsFile-based Validation
#4CA64C66	1	col_vals_between()	c	[1, 10]	✓	13	11 0.85	2 0.15	—	—	—
#4CA64C	2	col_vals_in_set()	f	low, mid, high	✓	13	13 1.00	0 0.00	—	—	—	—
2025-11-18 17:00:26 UTC< 1 s2025-11-18 17:00:26 UTC

Converting YAML to Python

Use yaml_to_python() to generate equivalent Python code from your YAML configuration:

yaml_config = """
tbl: small_table
tbl_name: "Example Validation"
thresholds:
  warning: 0.1
  error: 0.2
actions:
  warning: "Warning: {TYPE} validation failed"
steps:
  - col_vals_gt:
      columns: a
      value: 0
  - col_vals_in_set:
      columns: f
      set: [low, mid, high]
"""

# Generate Python code
python_code = pb.yaml_to_python(yaml_config)
print(python_code)

```python
import pointblank as pb

(
    pb.Validate(
        data=pb.load_dataset("small_table", tbl_type="polars"),
        tbl_name="Example Validation",
        thresholds=pb.Thresholds(warning=0.1, error=0.2),
        actions=pb.Actions(warning="Warning: {TYPE} validation failed"),
    )
    .col_vals_gt(columns="a", value=0)
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .interrogate()
)
```

This is useful for:

learning how YAML maps to Python API calls
transitioning from YAML to code-based workflows
generating documentation that shows both approaches
debugging YAML configurations

Practical Examples

Data Pipeline Validation

Here’s a comprehensive example for validating data in a processing pipeline:

tbl:
  python: |
    (
      pl.scan_csv("raw_data/customer_events.csv")
      .filter(pl.col("event_date") >= "2024-01-01")
    )

tbl_name: "Customer Events Pipeline Validation"
label: "Daily data quality check for customer events"

thresholds:
  warning: 0.01   # 1% failure rate
  error: 0.05     # 5% failure rate

actions:
  warning: "Pipeline warning: {TYPE} validation found {n_failed} issues"
  error:
    python: |
      lambda: print("ERROR: Pipeline validation failed - manual review required")

steps:
  # Schema validation
  - col_schema_match:
      schema:
        columns:
          - [customer_id, "int64"]
          - [event_type, "object"]
          - [event_date, "object"]
          - [revenue, "float64"]
      brief: "Validate table structure matches expected schema"

  # Data completeness
  - col_vals_not_null:
      columns: [customer_id, event_type, event_date]
      brief: "Critical fields must be complete"

  # Business logic validation
  - col_vals_in_set:
      columns: event_type
      set: [signup, purchase, cancellation, upgrade]
      brief: "Event types must be from approved list"

  # Data quality checks
  - col_vals_gt:
      columns: revenue
      value: 0
      na_pass: true
      brief: "Revenue values must be positive when present"

  # Temporal validation
  - col_vals_expr:
      expr:
        python: |
          pl.col("event_date").str.strptime(pl.Date, "%Y-%m-%d").is_not_null()
      brief: "Event dates must be valid YYYY-MM-DD format"

Quality Monitoring Dashboard

For ongoing data quality monitoring:

tbl: warehouse/daily_metrics.parquet
tbl_name: "Daily Metrics Quality Check"

thresholds:
  warning: 5      # 5 failing rows
  error: 50       # 50 failing rows
  critical: 100   # 100 failing rows

actions:
  warning: "Quality check warning: {n_failed} rows failed {TYPE} validation"
  error: "Quality degradation detected: Step {step} failed for {n_failed}/{n} rows"
  critical:
    python: |
      lambda: print("CRITICAL: Data quality failure detected - immediate attention required")
  highest_only: false

steps:
  - row_count_match:
      count: 10000
      brief: "Verify expected daily record count"

  - col_vals_not_null:
      columns: [date, metric_value, source_system]
      brief: "Core fields must be complete"

  - col_vals_between:
      columns: metric_value
      left: 0
      right: 1000000
      brief: "Metric values within reasonable range"

  - rows_distinct:
      columns_subset: [date, metric_name, source_system]
      brief: "No duplicate metric records per day"

Best Practices

Organization and Structure

use descriptive names: give your validations clear tbl_name and label values
add brief descriptions: document what each validation step checks
group related validations: organize steps logically (schema, completeness, business rules)
version control: store YAML files in git alongside your data processing code

Error Handling and Monitoring

set appropriate thresholds: start conservative and adjust based on your data patterns
use actions for alerting: set up notifications for critical failures
document expected failures: some data quality issues might be acceptable
monitor validation results: track validation performance over time

Performance Considerations

use the pre parameter efficiently: apply filters early to reduce data volume
order validations strategically: put fast, likely-to-fail checks first
consider data source location: local files are faster than remote sources
use appropriate column selections: only validate the columns you need

Wrapping Up

YAML validation workflows provide a powerful, declarative approach to data validation in Pointblank. Such workflows are great at expressing common validation patterns in a readable format that can be easily shared, version controlled, and maintained by teams.

Key advantages of YAML workflows:

readable: non-programmers can understand and contribute to validation logic
maintainable: easy to modify validation rules without changing application code
portable: YAML files can be shared between projects and teams
version controlled: track changes to validation logic over time
flexible: support for simple checks and complex custom logic

Use YAML workflows when you want declarative, maintainable validation definitions, and fall back to the Python API when you need complex programmatic logic or tight integration with application code. The two approaches complement each other well and can be used together as your validation needs evolve.