YAML Reference
This reference provides a comprehensive guide to all YAML keys and parameters supported by Pointblank’s YAML validation workflows. Use this document as a quick lookup when building validation configurations.
Global Configuration Keys
Top-level Structure
tbl: data_source # REQUIRED: Data source specification
df_library: "polars" # OPTIONAL: DataFrame library ("polars", "pandas", "duckdb")
tbl_name: "Custom Table Name" # OPTIONAL: Human-readable table name
label: "Validation Description" # OPTIONAL: Description for the validation workflow
lang: "en" # OPTIONAL: Language code (default: "en")
locale: "en" # OPTIONAL: Locale setting (default: "en")
brief: "Global brief: {auto}" # OPTIONAL: Global brief template
thresholds: # OPTIONAL: Global failure thresholds
warning: 0.1
error: 0.2
critical: 0.3
actions: # OPTIONAL: Global failure actions
warning: "Warning message template"
error: "Error message template"
critical: "Critical message template"
highest_only: false
steps: # REQUIRED: List of validation steps
- validation_method_name
- validation_method_name:
parameter: valueData Source (tbl)
The tbl key specifies the data source and supports multiple formats:
# File paths
tbl: "data/file.csv"
tbl: "data/file.parquet"
# Built-in datasets
tbl: small_table
tbl: game_revenue
tbl: nycflights
# Python expressions for complex data loading
tbl:
python: |
pl.scan_csv("data.csv").filter(pl.col("date") >= "2024-01-01")Using Templates with set_tbl=
For reusable validation templates that will always use a custom data source via the set_tbl= parameter in yaml_interrogate(), the tbl field is still required but its value doesn’t matter since it will be overridden. Recommended approaches:
# Option 1: Use a valid dataset name (gets overridden anyway)
tbl: small_table # Will be ignored when `set_tbl=` is used
# Option 2: Use YAML null (clearest semantic intent)
tbl: null # Indicates table will be provided via `set_tbl=`When using yaml_interrogate() with set_tbl=, the validation template becomes fully reusable:
# Define reusable template
template = """
tbl: null # Will be overridden
tbl_name: "Sales Validation"
steps:
- col_exists:
columns: [customer_id, revenue, region]
- col_vals_gt:
columns: [revenue]
value: 0
"""
# Apply to different datasets
q1_result = pb.yaml_interrogate(template, set_tbl=q1_data)
q2_result = pb.yaml_interrogate(template, set_tbl=q2_data)DataFrame Library (df_library)
The df_library key controls which DataFrame library is used to load data sources. This parameter affects both built-in datasets and file loading:
# Use Polars DataFrames (default)
df_library: polars
# Use Pandas DataFrames
df_library: pandas
# Use DuckDB tables (via Ibis)
df_library: duckdbExamples with different libraries:
# Load built-in dataset as Pandas DataFrame
tbl: small_table
df_library: pandas
steps:
- specially:
expr: "lambda df: df.assign(validation_result=df['a'] > 0)"
# Load CSV file as Polars DataFrame
tbl: "data/sales.csv"
df_library: polars
steps:
- col_vals_gt:
columns: amount
value: 0
# Load dataset as DuckDB table
tbl: nycflights
df_library: duckdb
steps:
- row_count_match:
count: 336776The df_library parameter is particularly useful when:
- using validation expressions that require specific DataFrame APIs (e.g., Pandas
.assign(), Polars.select()) - integrating with existing pipelines that use a specific DataFrame library
- optimizing performance for different data sizes and operations
- ensuring compatibility with downstream processing steps
Global Thresholds
Thresholds define when validation failures trigger different severity levels:
thresholds:
warning: 0.05 # 5% failure rate triggers warning
error: 0.10 # 10% failure rate triggers error
critical: 0.15 # 15% failure rate triggers critical- values: numbers between
0and1(percentages) or integers (row counts) - levels:
warning,error,critical
Global Actions
Actions define responses when thresholds are exceeded. When supplying a string to a severity level (‘warning’, ‘error’, ‘critical’), you can use template variables that will be automatically substituted with contextual information:
actions:
warning: "Warning: {n_failed} failures in step {step}"
error:
python: |
lambda: print("Error detected!")
critical: "Critical failure at {time}"
highest_only: false # Execute all applicable actions vs. only highest severityTemplate variables available for action strings:
{step}: current validation step number{col}: column name(s) being validated{val}: validation value or threshold{n_failed}: number of failing records{n}: total number of records{type}: validation method type{level}: severity level (‘warning’/‘error’/‘critical’){time}: timestamp of validation
Validation Methods Reference
Column Value Validations
Comparison Methods
col_vals_gt: are column data greater than a fixed value or data in another column?
- col_vals_gt:
columns: [column_name] # REQUIRED: Column(s) to validate
value: 100 # REQUIRED: Comparison value
na_pass: true # OPTIONAL: Pass NULL values (default: false)
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values must be > 100" # OPTIONAL: Step descriptioncol_vals_lt: are column data less than a fixed value or data in another column?
- col_vals_lt:
columns: [column_name]
value: 100
na_pass: true
# ... (same parameters as col_vals_gt)col_vals_ge: are column data greater than or equal to a fixed value or data in another column?
- col_vals_ge:
columns: [column_name]
value: 100
na_pass: true
# ... (same parameters as col_vals_gt)col_vals_le: are column data less than or equal to a fixed value or data in another column?
- col_vals_le:
columns: [column_name]
value: 100
na_pass: true
# ... (same parameters as col_vals_gt)col_vals_eq: are column data equal to a fixed value or data in another column?
- col_vals_eq:
columns: [column_name]
value: "expected_value"
na_pass: true
# ... (same parameters as col_vals_gt)col_vals_ne: are column data not equal to a fixed value or data in another column?
- col_vals_ne:
columns: [column_name]
value: "forbidden_value"
na_pass: true
# ... (same parameters as col_vals_gt)Range Methods
col_vals_between: are column data between two specified values (inclusive)?
- col_vals_between:
columns: [column_name] # REQUIRED: Column(s) to validate
left: 0 # REQUIRED: Lower bound
right: 100 # REQUIRED: Upper bound
inclusive: [true, true] # OPTIONAL: Include bounds [left, right]
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values between 0 and 100" # OPTIONAL: Step descriptioncol_vals_outside: are column data outside of two specified values?
- col_vals_outside:
columns: [column_name]
left: 0
right: 100
inclusive: [false, false] # OPTIONAL: Exclude bounds [left, right]
na_pass: false
# ... (same parameters as col_vals_between)Set Membership Methods
col_vals_in_set: are column data part of a specified set of values?
- col_vals_in_set:
columns: [column_name] # REQUIRED: Column(s) to validate
set: [value1, value2, value3] # REQUIRED: Allowed values
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values in allowed set" # OPTIONAL: Step descriptioncol_vals_not_in_set: are column data not part of a specified set of values?
- col_vals_not_in_set:
columns: [column_name]
set: [forbidden1, forbidden2] # REQUIRED: Forbidden values
na_pass: false
# ... (same parameters as col_vals_in_set)NULL Value Methods
col_vals_null: are column data null (missing)?
- col_vals_null:
columns: [column_name] # REQUIRED: Column(s) to validate
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values must be NULL" # OPTIONAL: Step descriptioncol_vals_not_null: are column data not null (not missing)?
- col_vals_not_null:
columns: [column_name]
# ... (same parameters as col_vals_null)Pattern Matching Methods
col_vals_regex: do string-based column data match a regular expression?
- col_vals_regex:
columns: [column_name] # REQUIRED: Column(s) to validate
pattern: "^[A-Z]{2,3}$" # REQUIRED: Regular expression pattern
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values match pattern" # OPTIONAL: Step descriptioncol_vals_within_spec: do column data conform to a specification (email, URL, postal codes, etc.)?
- col_vals_within_spec:
columns: [column_name] # REQUIRED: Column(s) to validate
spec: "email" # REQUIRED: Specification type
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values match spec" # OPTIONAL: Step descriptionAvailable specification types:
"email"- Email addresses"url"- Internet URLs"phone"- Phone numbers"ipv4"- IPv4 addresses"ipv6"- IPv6 addresses"mac"- MAC addresses"isbn"- International Standard Book Numbers (10 or 13 digit)"vin"- Vehicle Identification Numbers"credit_card"- Credit card numbers (uses Luhn algorithm)"swift"- Business Identifier Codes (SWIFT-BIC)"postal_code[<country_code>]"- Postal codes for specific countries (e.g.,"postal_code[US]","postal_code[CA]")"zip"- Alias for US ZIP codes ("postal_code[US]")"iban[<country_code>]"- International Bank Account Numbers (e.g.,"iban[DE]","iban[FR]")
Examples:
# Email validation
- col_vals_within_spec:
columns: user_email
spec: "email"
# US postal codes
- col_vals_within_spec:
columns: zip_code
spec: "postal_code[US]"
# German IBAN
- col_vals_within_spec:
columns: account_number
spec: "iban[DE]"Custom Expression Methods
col_vals_expr: do column data agree with a predicate expression?
- col_vals_expr:
expr: # REQUIRED: Custom validation expression
python: |
pl.when(pl.col("status") == "active")
.then(pl.col("value") > 0)
.otherwise(pl.lit(True))
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Custom validation rule" # OPTIONAL: Step descriptionTrend Validation Methods
col_vals_increasing: are column data increasing row-by-row?
- col_vals_increasing:
columns: [column_name] # REQUIRED: Column(s) to validate
allow_stationary: false # OPTIONAL: Allow consecutive equal values (default: false)
decreasing_tol: 0.5 # OPTIONAL: Tolerance for negative movement (default: null)
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values must increase" # OPTIONAL: Step descriptionThis validation checks whether values in a column increase as you move down the rows. Useful for validating time-series data, sequence numbers, or any monotonically increasing values.
Parameters:
allow_stationary: Iftrue, allows consecutive values to be equal (stationary phases). For example,[1, 2, 2, 3]would pass whentruebut fail at the third value whenfalse.decreasing_tol: Absolute tolerance for negative movement. Setting this to0.5means values can decrease by up to 0.5 units and still pass. Setting any value also setsallow_stationarytotrue.
Examples:
# Strict increasing validation
- col_vals_increasing:
columns: timestamp_seconds
brief: "Timestamps must strictly increase"
# Allow stationary values
- col_vals_increasing:
columns: version_number
allow_stationary: true
brief: "Version numbers should increase (ties allowed)"
# With tolerance for small decreases
- col_vals_increasing:
columns: temperature
decreasing_tol: 0.1
brief: "Temperature trend (small drops allowed)"col_vals_decreasing: are column data decreasing row-by-row?
- col_vals_decreasing:
columns: [column_name] # REQUIRED: Column(s) to validate
allow_stationary: false # OPTIONAL: Allow consecutive equal values (default: false)
increasing_tol: 0.5 # OPTIONAL: Tolerance for positive movement (default: null)
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values must decrease" # OPTIONAL: Step descriptionThis validation checks whether values in a column decrease as you move down the rows. Useful for countdown timers, inventory depletion, or any monotonically decreasing values.
Parameters:
allow_stationary: Iftrue, allows consecutive values to be equal (stationary phases). For example,[10, 8, 8, 5]would pass whentruebut fail at the third value whenfalse.increasing_tol: Absolute tolerance for positive movement. Setting this to0.5means values can increase by up to 0.5 units and still pass. Setting any value also setsallow_stationarytotrue.
Examples:
# Strict decreasing validation
- col_vals_decreasing:
columns: countdown_timer
brief: "Timer must strictly decrease"
# Allow stationary values
- col_vals_decreasing:
columns: priority_score
allow_stationary: true
brief: "Priority scores should decrease (ties allowed)"
# With tolerance for small increases
- col_vals_decreasing:
columns: stock_level
increasing_tol: 5
brief: "Stock levels decrease (small restocks allowed)"Row-based Validations
rows_distinct: are row data distinct?
- rows_distinct # Simple form
- rows_distinct: # With parameters
columns_subset: [col1, col2] # OPTIONAL: Check subset of columns
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "No duplicate rows" # OPTIONAL: Step descriptionrows_complete: are row data complete?
- rows_complete # Simple form
- rows_complete: # With parameters
columns_subset: [col1, col2] # OPTIONAL: Check subset of columns
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Complete rows only" # OPTIONAL: Step descriptionStructure Validations
col_exists: does column exist in the table?
- col_exists:
columns: [col1, col2, col3] # REQUIRED: Column(s) that must exist
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Required columns exist" # OPTIONAL: Step descriptioncol_schema_match: does the table have expected column names and data types?
- col_schema_match:
schema: # REQUIRED: Expected schema
columns:
- [column_name, "data_type"] # Column with type validation
- column_name # Column name only (no type check)
- [column_name] # Alternative syntax
complete: true # OPTIONAL: Require exact column set
in_order: true # OPTIONAL: Require exact column order
case_sensitive_colnames: true # OPTIONAL: Case-sensitive column names
case_sensitive_dtypes: true # OPTIONAL: Case-sensitive data types
full_match_dtypes: true # OPTIONAL: Exact type matching
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Schema validation" # OPTIONAL: Step descriptionrow_count_match: does the table have n rows?
- row_count_match:
count: 1000 # REQUIRED: Expected row count
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Expected row count" # OPTIONAL: Step descriptioncol_count_match: does the table have n columns?
- col_count_match:
count: 10 # REQUIRED: Expected column count
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Expected column count" # OPTIONAL: Step descriptiontbl_match: does the table match a comparison table?
- tbl_match:
tbl_compare: # REQUIRED: Comparison table
python: |
pb.load_dataset("reference_table", tbl_type="polars")
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.0
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Table structure matches" # OPTIONAL: Step descriptionThis validation performs a comprehensive comparison between the target table and a comparison table, using progressively stricter checks:
- Column count match: both tables have the same number of columns
- Row count match: both tables have the same number of rows
- Schema match (loose): column names and dtypes match (case-insensitive, any order)
- Schema match (order): columns in correct order (case-insensitive names)
- Schema match (exact): column names match exactly (case-sensitive, correct order)
- Data match: values in corresponding cells are identical
The validation fails at the first check that doesn’t pass, making it easy to diagnose mismatches. This operates over a single test unit (pass/fail for complete table match).
Cross-backend validation: tbl_match() supports automatic backend coercion when comparing tables from different backends (e.g., Polars vs. Pandas, DuckDB vs. SQLite). The comparison table is automatically converted to match the target table’s backend.
Examples:
# Compare against reference dataset
- tbl_match:
tbl_compare:
python: |
pb.load_dataset("expected_output", tbl_type="polars")
brief: "Output matches expected results"
# Compare against CSV file
- tbl_match:
tbl_compare:
python: |
pl.read_csv("reference_data.csv")
brief: "Matches reference CSV"
# Compare with preprocessing on target table only
- tbl_match:
tbl_compare:
python: |
pb.load_dataset("reference_table", tbl_type="polars")
pre: |
lambda df: df.select(["id", "name", "value"])
brief: "Selected columns match reference"Special Validation Methods
conjointly: are multiple validations having a joint dependency?
- conjointly:
expressions: # REQUIRED: List of lambda expressions
- "lambda df: df['d'] > df['a']"
- "lambda df: df['a'] > 0"
- "lambda df: df['a'] + df['d'] < 12000"
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "All conditions must pass" # OPTIONAL: Step descriptionspecially: do table data pass a custom validation function?
- specially:
expr: # REQUIRED: Custom validation function
"lambda df: df.select(pl.col('a') + pl.col('d') > 0)"
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Custom validation" # OPTIONAL: Step descriptionAlternative syntax with Python expressions:
- specially:
expr:
python: |
lambda df: df.select(pl.col('amount') > 0)For Pandas DataFrames (when using df_library: pandas):
- specially:
expr: "lambda df: df.assign(is_valid=df['a'] + df['d'] > 0)"AI-Powered Validation
prompt: validate rows using AI/LLM-powered analysis
- prompt:
prompt: "Values should be positive and realistic" # REQUIRED: Natural language criteria
model: "anthropic:claude-sonnet-4" # REQUIRED: Model identifier
columns_subset: [column1, column2] # OPTIONAL: Columns to validate
batch_size: 1000 # OPTIONAL: Rows per batch (default: 1000)
max_concurrent: 3 # OPTIONAL: Concurrent API requests (default: 3)
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition)
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "AI validation" # OPTIONAL: Step descriptionThis validation method uses Large Language Models (LLMs) to validate rows of data based on natural language criteria. Each row becomes a test unit that either passes or fails the validation criteria, producing binary True/False results that integrate with standard Pointblank reporting.
Supported models:
- Anthropic:
"anthropic:claude-sonnet-4","anthropic:claude-opus-4" - OpenAI:
"openai:gpt-4","openai:gpt-4-turbo","openai:gpt-3.5-turbo" - Ollama:
"ollama:<model-name>"(e.g.,"ollama:llama3") - Bedrock:
"bedrock:<model-name>"
Authentication: API keys are automatically loaded from environment variables or .env files:
- OpenAI: Set
OPENAI_API_KEYenvironment variable or add to.envfile - Anthropic: Set
ANTHROPIC_API_KEYenvironment variable or add to.envfile - Ollama: No API key required (runs locally)
- Bedrock: Configure AWS credentials through standard AWS methods
Example .env file:
ANTHROPIC_API_KEY="your_anthropic_api_key_here"
OPENAI_API_KEY="your_openai_api_key_here"
Performance optimization: The validation process uses row signature memoization to avoid redundant LLM calls. When multiple rows have identical values in the selected columns, only one representative row is validated, and the result is applied to all matching rows. This dramatically reduces API costs and processing time for datasets with repetitive patterns.
Examples:
# Basic AI validation
- prompt:
prompt: "Email addresses should look realistic and professional"
model: "anthropic:claude-sonnet-4"
columns_subset: [email]
# Complex semantic validation
- prompt:
prompt: "Product descriptions should mention the product category and include at least one benefit"
model: "openai:gpt-4"
columns_subset: [product_name, description, category]
batch_size: 500
max_concurrent: 5
# Sentiment analysis
- prompt:
prompt: "Customer feedback should express positive sentiment"
model: "anthropic:claude-sonnet-4"
columns_subset: [feedback_text, rating]
# Context-dependent validation
- prompt:
prompt: "For high-value transactions (amount > 1000), a detailed justification should be provided"
model: "openai:gpt-4"
columns_subset: [amount, justification, approver]
thresholds:
warning: 0.05
error: 0.15
# Local model with Ollama
- prompt:
prompt: "Transaction descriptions should be clear and professional"
model: "ollama:llama3"
columns_subset: [description]Best practices for AI validation:
- Be specific and clear in your prompt criteria
- Include only necessary columns in
columns_subsetto reduce API costs - Start with smaller
batch_sizefor testing, increase for production - Adjust
max_concurrentbased on API rate limits - Use thresholds appropriate for probabilistic validation results
- Consider cost implications for large datasets
- Test prompts on sample data before full deployment
When to use AI validation:
- Semantic checks (e.g., “does the description match the category?”)
- Context-dependent validation (e.g., “is the justification appropriate for the amount?”)
- Subjective quality assessment (e.g., “is the text professional?”)
- Pattern recognition that’s hard to express programmatically
- Natural language understanding tasks
When NOT to use AI validation:
- Simple numeric comparisons (use
col_vals_gt,col_vals_lt, etc.) - Exact pattern matching (use
col_vals_regex) - Schema validation (use
col_schema_match) - Performance-critical validations with large datasets
- When deterministic results are required
Column Selection Patterns
All validation methods that accept a columns parameter support these selection patterns:
# Single column
columns: column_name
# Multiple columns as list
columns: [col1, col2, col3]
# Column selector functions (when used in Python expressions)
columns:
python: |
starts_with("prefix_")
# Examples of common patterns
columns: [customer_id, order_id] # Specific columns
columns: user_email # Single columnParameter Details
Common Parameters
These parameters are available for most validation methods:
columns: column selection (string, list, or selector expression)na_pass: whether to pass NULL/missing values (boolean, default: false)pre: data preprocessing function (Python lambda expression)thresholds: step-level failure thresholds (dict)actions: step-level failure actions (dict)brief: step description (string, boolean, or template)
Brief Parameter Options
The brief parameter supports several formats:
brief: "Custom description" # Custom text
brief: true # Auto-generated description
brief: false # No description
brief: "Step {step}: {auto}" # Template with auto-generated text
brief: "Column '{col}' validation" # Template with variablestemplate variables: {step}, {col}, {value}, {set}, {pattern}, {auto}
Python Expressions
Several parameters support Python expressions using the python: block syntax:
# Data source loading
tbl:
python: |
pl.scan_csv("data.csv").filter(pl.col("active") == True)
# Preprocessing
pre:
python: |
lambda df: df.filter(pl.col("date") >= "2024-01-01")
# Custom expressions
expr:
python: |
pl.col("value").is_between(0, 100)
# Callable actions
actions:
error:
python: |
lambda: print("VALIDATION ERROR: Critical data quality issue detected!")Note: The Python environment in YAML is restricted for security. Only built-in functions (print, len, str, etc.), Path from pathlib, and available DataFrame libraries (pl, pd) are accessible. You cannot import additional modules like requests, logging, or custom libraries.
You can also use the shortcut syntax for lambda expressions:
# Shortcut syntax (equivalent to python: block)
pre: |
lambda df: df.filter(pl.col("status") == "active")Restricted Python Environment
For security reasons, the Python environment in YAML configurations is restricted to a safe subset of functionality. The available namespace includes:
Built-in functions:
- basic types:
str,int,float,bool,list,dict,tuple,set - math functions:
sum,min,max,abs,round,len - iteration:
range,enumerate,zip - output:
print
Available modules:
Pathfrom pathlib for file path operationspb(pointblank) for dataset loading and validation functionspl(polars) if available on the systempd(pandas) if available on the system
Restrictions:
- cannot import external libraries (
requests,logging,os,sys, etc.) - cannot use
__import__,exec,eval, or other dynamic execution functions - file operations are limited to
Pathfunctionality
Examples of valid callable actions:
# Simple output with built-in functions
actions:
warning:
python: |
lambda: print(f"WARNING: {sum([1, 2, 3])} validation issues detected")
# Using available variables and string formatting
actions:
error:
python: |
lambda: print("ERROR: Data validation failed at " + str(len("validation")))
# Multiple statements in lambda (using parentheses)
actions:
critical:
python: |
lambda: (
print("CRITICAL ALERT:"),
print("Immediate attention required"),
print("Contact data team")
)[-1] # Return the last valueFor complex alerting, logging, or external system integration, use string template actions instead of callable actions, and handle the external communication in your application code after validation completes.
Best Practices
Organization
- use descriptive
tbl_nameandlabelvalues - add
briefdescriptions for complex validations - group related validations logically
- use consistent indentation and formatting
Performance
- apply
prefilters early to reduce data volume - order validations from fast to slow
- use
columns_subsetfor row-based validations when appropriate - consider data source location (local vs. remote)
- choose
df_librarybased on data size and operations:polars: fastest for large datasets and analytical operationspandas: best for complex transformations and data science workflowsduckdb: optimal for analytical queries on very large datasets
Maintainability
- store YAML files in version control
- use template variables in actions and briefs
- document expected failures with comments
- test configurations with
validate_yaml()before deployment - specify
df_libraryexplicitly when using library-specific validation expressions - keep DataFrame library choice consistent within related validation workflows
Error Handling
- set appropriate thresholds based on data patterns
- use actions for monitoring and alerting
- start with conservative thresholds and adjust
- consider using
highest_only: falsefor comprehensive reporting