YAML Reference
This reference provides a comprehensive guide to all YAML keys and parameters supported by Pointblank’s YAML validation workflows. Use this document as a quick lookup when building validation configurations.
Global Configuration Keys
Top-level Structure
tbl: data_source # REQUIRED: Data source specification
df_library: "polars" # OPTIONAL: DataFrame library ("polars", "pandas", "duckdb")
tbl_name: "Custom Table Name" # OPTIONAL: Human-readable table name
label: "Validation Description" # OPTIONAL: Description for the validation workflow
lang: "en" # OPTIONAL: Language code (default: "en")
locale: "en" # OPTIONAL: Locale setting (default: "en")
brief: "Global brief: {auto}" # OPTIONAL: Global brief template
thresholds: # OPTIONAL: Global failure thresholds
warning: 0.1
error: 0.2
critical: 0.3
actions: # OPTIONAL: Global failure actions
warning: "Warning message template"
error: "Error message template"
critical: "Critical message template"
highest_only: false
steps: # REQUIRED: List of validation steps
- validation_method_name
- validation_method_name:
parameter: value
Data Source (tbl
)
The tbl
key specifies the data source and supports multiple formats:
# File paths
tbl: "data/file.csv"
tbl: "data/file.parquet"
# Built-in datasets
tbl: small_table
tbl: game_revenue
tbl: nycflights
# Python expressions for complex data loading
tbl:
python: |
pl.scan_csv("data.csv").filter(pl.col("date") >= "2024-01-01")
DataFrame Library (df_library
)
The df_library
key controls which DataFrame library is used to load data sources. This parameter affects both built-in datasets and file loading:
# Use Polars DataFrames (default)
df_library: polars
# Use Pandas DataFrames
df_library: pandas
# Use DuckDB tables (via Ibis)
df_library: duckdb
Examples with different libraries:
# Load built-in dataset as Pandas DataFrame
tbl: small_table
df_library: pandas
steps:
- specially:
expr: "lambda df: df.assign(validation_result=df['a'] > 0)"
# Load CSV file as Polars DataFrame
tbl: "data/sales.csv"
df_library: polars
steps:
- col_vals_gt:
columns: amount
value: 0
# Load dataset as DuckDB table
tbl: nycflights
df_library: duckdb
steps:
- row_count_match:
count: 336776
The df_library
parameter is particularly useful when:
- using validation expressions that require specific DataFrame APIs (e.g., Pandas
.assign()
, Polars.select()
) - integrating with existing pipelines that use a specific DataFrame library
- optimizing performance for different data sizes and operations
- ensuring compatibility with downstream processing steps
Global Thresholds
Thresholds define when validation failures trigger different severity levels:
thresholds:
warning: 0.05 # 5% failure rate triggers warning
error: 0.10 # 10% failure rate triggers error
critical: 0.15 # 15% failure rate triggers critical
- values: numbers between 0 and 1 (percentages) or integers (row counts)
- levels:
warning
,error
,critical
Global Actions
Actions define responses when thresholds are exceeded. When supplying a string to a severity level (‘warning’, ‘error’, ‘critical’), you can use template variables that will be automatically substituted with contextual information:
actions:
warning: "Warning: {n_failed} failures in step {step}"
error:
python: |
lambda: print("Error detected!") critical: "Critical failure at {time}"
highest_only: false # Execute all applicable actions vs. only highest severity
Template variables available for action strings:
{step}
: current validation step number{col}
: column name(s) being validated{val}
: validation value or threshold{n_failed}
: number of failing records{n}
: total number of records{type}
: validation method type{level}
: severity level (‘warning’/‘error’/‘critical’){time}
: timestamp of validation
Validation Methods Reference
Column Value Validations
Comparison Methods
col_vals_gt
: are column data greater than a fixed value or data in another column?
- col_vals_gt:
columns: [column_name] # REQUIRED: Column(s) to validate
value: 100 # REQUIRED: Comparison value
na_pass: true # OPTIONAL: Pass NULL values (default: false)
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values must be > 100" # OPTIONAL: Step description
col_vals_lt
: are column data less than a fixed value or data in another column?
- col_vals_lt:
columns: [column_name]
value: 100
na_pass: true
# ... (same parameters as col_vals_gt)
col_vals_ge
: are column data greater than or equal to a fixed value or data in another column?
- col_vals_ge:
columns: [column_name]
value: 100
na_pass: true
# ... (same parameters as col_vals_gt)
col_vals_le
: are column data less than or equal to a fixed value or data in another column?
- col_vals_le:
columns: [column_name]
value: 100
na_pass: true
# ... (same parameters as col_vals_gt)
col_vals_eq
: are column data equal to a fixed value or data in another column?
- col_vals_eq:
columns: [column_name]
value: "expected_value"
na_pass: true
# ... (same parameters as col_vals_gt)
col_vals_ne
: are column data not equal to a fixed value or data in another column?
- col_vals_ne:
columns: [column_name]
value: "forbidden_value"
na_pass: true
# ... (same parameters as col_vals_gt)
Range Methods
col_vals_between
: are column data between two specified values (inclusive)?
- col_vals_between:
columns: [column_name] # REQUIRED: Column(s) to validate
left: 0 # REQUIRED: Lower bound
right: 100 # REQUIRED: Upper bound
inclusive: [true, true] # OPTIONAL: Include bounds [left, right]
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values between 0 and 100" # OPTIONAL: Step description
col_vals_outside
: are column data outside of two specified values?
- col_vals_outside:
columns: [column_name]
left: 0
right: 100
inclusive: [false, false] # OPTIONAL: Exclude bounds [left, right]
na_pass: false
# ... (same parameters as col_vals_between)
Set Membership Methods
col_vals_in_set
: are column data part of a specified set of values?
- col_vals_in_set:
columns: [column_name] # REQUIRED: Column(s) to validate
set: [value1, value2, value3] # REQUIRED: Allowed values
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values in allowed set" # OPTIONAL: Step description
col_vals_not_in_set
: are column data not part of a specified set of values?
- col_vals_not_in_set:
columns: [column_name]
set: [forbidden1, forbidden2] # REQUIRED: Forbidden values
na_pass: false
# ... (same parameters as col_vals_in_set)
NULL Value Methods
col_vals_null
: are column data null (missing)?
- col_vals_null:
columns: [column_name] # REQUIRED: Column(s) to validate
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values must be NULL" # OPTIONAL: Step description
col_vals_not_null
: are column data not null (not missing)?
- col_vals_not_null:
columns: [column_name]
# ... (same parameters as col_vals_null)
Pattern Matching Methods
col_vals_regex
: do string-based column data match a regular expression?
- col_vals_regex:
columns: [column_name] # REQUIRED: Column(s) to validate
pattern: "^[A-Z]{2,3}$" # REQUIRED: Regular expression pattern
na_pass: false # OPTIONAL: Pass NULL values
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Values match pattern" # OPTIONAL: Step description
Custom Expression Methods
col_vals_expr
: do column data agree with a predicate expression?
- col_vals_expr:
expr: # REQUIRED: Custom validation expression
python: |
pl.when(pl.col("status") == "active")
.then(pl.col("value") > 0)
.otherwise(pl.lit(True)) pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Custom validation rule" # OPTIONAL: Step description
Row-based Validations
rows_distinct
: are row data distinct?
- rows_distinct # Simple form
- rows_distinct: # With parameters
columns_subset: [col1, col2] # OPTIONAL: Check subset of columns
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "No duplicate rows" # OPTIONAL: Step description
rows_complete
: are row data complete?
- rows_complete # Simple form
- rows_complete: # With parameters
columns_subset: [col1, col2] # OPTIONAL: Check subset of columns
pre: | # OPTIONAL: Data preprocessing
lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Complete rows only" # OPTIONAL: Step description
Structure Validations
col_exists
: does column exist in the table?
- col_exists:
columns: [col1, col2, col3] # REQUIRED: Column(s) that must exist
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Required columns exist" # OPTIONAL: Step description
col_schema_match
: does the table have expected column names and data types?
- col_schema_match:
schema: # REQUIRED: Expected schema
columns:
- [column_name, "data_type"] # Column with type validation
- column_name # Column name only (no type check)
- [column_name] # Alternative syntax
complete: true # OPTIONAL: Require exact column set
in_order: true # OPTIONAL: Require exact column order
case_sensitive_colnames: true # OPTIONAL: Case-sensitive column names
case_sensitive_dtypes: true # OPTIONAL: Case-sensitive data types
full_match_dtypes: true # OPTIONAL: Exact type matching
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Schema validation" # OPTIONAL: Step description
row_count_match
: does the table have n rows?
- row_count_match:
count: 1000 # REQUIRED: Expected row count
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Expected row count" # OPTIONAL: Step description
col_count_match
: does the table have n columns?
- col_count_match:
count: 10 # REQUIRED: Expected column count
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Expected column count" # OPTIONAL: Step description
Special Validation Methods
conjointly
: are multiple validations having a joint dependency?
- conjointly:
expressions: # REQUIRED: List of lambda expressions
- "lambda df: df['d'] > df['a']"
- "lambda df: df['a'] > 0"
- "lambda df: df['a'] + df['d'] < 12000"
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "All conditions must pass" # OPTIONAL: Step description
specially
: do table data pass a custom validation function?
- specially:
expr: # REQUIRED: Custom validation function
"lambda df: df.select(pl.col('a') + pl.col('d') > 0)"
thresholds: # OPTIONAL: Step-level thresholds
warning: 0.1
actions: # OPTIONAL: Step-level actions
warning: "Custom message"
brief: "Custom validation" # OPTIONAL: Step description
Alternative syntax with Python expressions:
- specially:
expr:
python: |
lambda df: df.select(pl.col('amount') > 0)
For Pandas DataFrames (when using df_library: pandas
):
- specially:
expr: "lambda df: df.assign(is_valid=df['a'] + df['d'] > 0)"
## Column Selection Patterns
All validation methods that accept a `columns` parameter support these selection patterns:
```yaml
# Single column
columns: column_name
# Multiple columns as list
columns: [col1, col2, col3]
# Column selector functions (when used in Python expressions)
columns:
python: |
starts_with("prefix_")
# Examples of common patterns
columns: [customer_id, order_id] # Specific columns
columns: user_email # Single column
Parameter Details
Common Parameters
These parameters are available for most validation methods:
columns
: column selection (string, list, or selector expression)na_pass
: whether to pass NULL/missing values (boolean, default: false)pre
: data preprocessing function (Python lambda expression)thresholds
: step-level failure thresholds (dict)actions
: step-level failure actions (dict)brief
: step description (string, boolean, or template)
Brief Parameter Options
The brief
parameter supports several formats:
brief: "Custom description" # Custom text
brief: true # Auto-generated description
brief: false # No description
brief: "Step {step}: {auto}" # Template with auto-generated text
brief: "Column '{col}' validation" # Template with variables
template variables: {step}
, {col}
, {value}
, {set}
, {pattern}
, {auto}
Python Expressions
Several parameters support Python expressions using the python:
block syntax:
# Data source loading
tbl:
python: |
pl.scan_csv("data.csv").filter(pl.col("active") == True)
# Preprocessing
pre:
python: |
lambda df: df.filter(pl.col("date") >= "2024-01-01")
# Custom expressions
expr:
python: |
pl.col("value").is_between(0, 100)
# Callable actions
actions:
error:
python: |
lambda: print("VALIDATION ERROR: Critical data quality issue detected!")
Note: The Python environment in YAML is restricted for security. Only built-in functions (print
, len
, str
, etc.), Path
from pathlib, and available DataFrame libraries (pl
, pd
) are accessible. You cannot import additional modules like requests
, logging
, or custom libraries.
You can also use the shortcut syntax for lambda expressions:
# Shortcut syntax (equivalent to python: block)
pre: |
lambda df: df.filter(pl.col("status") == "active")
Restricted Python Environment
For security reasons, the Python environment in YAML configurations is restricted to a safe subset of functionality. The available namespace includes:
Built-in functions:
- basic types:
str
,int
,float
,bool
,list
,dict
,tuple
,set
- math functions:
sum
,min
,max
,abs
,round
,len
- iteration:
range
,enumerate
,zip
- output:
print
Available modules:
Path
from pathlib for file path operationspb
(pointblank
) for dataset loading and validation functionspl
(polars
) if available on the systempd
(pandas
) if available on the system
Restrictions:
- cannot import external libraries (
requests
,logging
,os
,sys
, etc.) - cannot use
__import__
,exec
,eval
, or other dynamic execution functions - file operations are limited to
Path
functionality
Examples of valid callable actions:
# Simple output with built-in functions
actions:
warning:
python: |
lambda: print(f"WARNING: {sum([1, 2, 3])} validation issues detected")
# Using available variables and string formatting
actions:
error:
python: |
lambda: print("ERROR: Data validation failed at " + str(len("validation")))
# Multiple statements in lambda (using parentheses)
actions:
critical:
python: |
lambda: (
print("CRITICAL ALERT:"),
print("Immediate attention required"),
print("Contact data team") )[-1] # Return the last value
For complex alerting, logging, or external system integration, use string template actions instead of callable actions, and handle the external communication in your application code after validation completes.
Best Practices
Organization
- use descriptive
tbl_name
andlabel
values - add
brief
descriptions for complex validations - group related validations logically
- use consistent indentation and formatting
Performance
- apply
pre
filters early to reduce data volume - order validations from fast to slow
- use
columns_subset
for row-based validations when appropriate - consider data source location (local vs. remote)
- choose
df_library
based on data size and operations:polars
: fastest for large datasets and analytical operationspandas
: best for complex transformations and data science workflowsduckdb
: optimal for analytical queries on very large datasets
Maintainability
- store YAML files in version control
- use template variables in actions and briefs
- document expected failures with comments
- test configurations with
validate_yaml()
before deployment - specify
df_library
explicitly when using library-specific validation expressions - keep DataFrame library choice consistent within related validation workflows
Error Handling
- set appropriate thresholds based on data patterns
- use actions for monitoring and alerting
- start with conservative thresholds and adjust
- consider using
highest_only: false
for comprehensive reporting