import pointblank as pb
# Define a basic YAML validation workflow
= '''
yaml_config tbl: small_table
steps:
- rows_distinct
- col_exists:
columns: [date, a, b]
'''
# Execute the validation workflow
= pb.yaml_interrogate(yaml_config)
result result
yaml_interrogate()function
Execute a YAML-based validation workflow.
USAGE
=None) yaml_interrogate(yaml, set_tbl
This is the main entry point for YAML-based validation workflows. It takes YAML configuration (as a string or file path) and returns a validated Validate
object with interrogation results.
The YAML configuration defines the data source, validation steps, and optional settings like thresholds and labels. This function automatically loads the data, builds the validation plan, executes all validation steps, and returns the interrogated results.
Parameters
yaml :
Union
[str
,Path
]-
YAML configuration as string or file path. Can be: (1) a YAML string containing the validation configuration, or (2) a Path object or string path to a YAML file.
set_tbl :
Union
[FrameT
,Any
, None] = None-
An optional table to override the table specified in the YAML configuration. This allows you to apply a YAML-defined validation workflow to a different table than what’s specified in the configuration. If provided, this table will replace the table defined in the YAML’s
tbl
field before executing the validation workflow. This can be any supported table type including DataFrame objects, Ibis table objects, CSV file paths, Parquet file paths, GitHub URLs, or database connection strings.
Returns
Raises
:
YAMLValidationError
-
If the YAML is invalid, malformed, or execution fails. This includes syntax errors, missing required fields, unknown validation methods, or data loading failures.
Examples
For the examples here, we’ll use YAML configurations to define validation workflows. Let’s start with a basic YAML workflow that validates the built-in small_table
dataset.
The validation table shows the results of our YAML-defined workflow. We can see that the rows_distinct()
validation failed (because there are duplicate rows in the table), while the column existence checks passed.
Now let’s create a more comprehensive validation workflow with thresholds and metadata:
# Advanced YAML configuration with thresholds and metadata
= '''
yaml_config tbl: small_table
tbl_name: small_table_demo
label: Comprehensive data validation
thresholds:
warning: 0.1
error: 0.25
critical: 0.35
steps:
- col_vals_gt:
columns: [d]
value: 100
- col_vals_regex:
columns: [b]
pattern: '[0-9]-[a-z]{3}-[0-9]{3}'
- col_vals_not_null:
columns: [date, a]
'''
# Execute the validation workflow
= pb.yaml_interrogate(yaml_config)
result print(f"Table name: {result.tbl_name}")
print(f"Label: {result.label}")
print(f"Total validation steps: {len(result.validation_info)}")
Table name: small_table_demo
Label: Comprehensive data validation
Total validation steps: 4
The validation results now include our custom table name and label. The thresholds we defined will determine when validation steps are marked as warnings, errors, or critical failures.
You can also load YAML configurations from files. Here’s how you would work with a YAML file:
from pathlib import Path
import tempfile
# Create a temporary YAML file for demonstration
= '''
yaml_content tbl: small_table
tbl_name: File-based Validation
steps:
- col_vals_between:
columns: [c]
left: 1
right: 10
- col_vals_in_set:
columns: [f]
set: [low, mid, high]
'''
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
f.write(yaml_content)= Path(f.name)
yaml_file_path
# Load and execute validation from file
= pb.yaml_interrogate(yaml_file_path)
result result
This approach is particularly useful for storing validation configurations as part of your data pipeline or version control system, allowing you to maintain validation rules alongside your code.
Using set_tbl=
to Override the Table
The set_tbl=
parameter allows you to override the table specified in the YAML configuration. This is useful when you have a template validation workflow but want to apply it to different tables:
import polars as pl
# Create a test table with similar structure to small_table
= pl.DataFrame({
test_table "date": ["2023-01-01", "2023-01-02", "2023-01-03"],
"a": [1, 2, 3],
"b": ["1-abc-123", "2-def-456", "3-ghi-789"],
"d": [150, 200, 250]
})
# Use the same YAML config but apply it to our test table
= '''
yaml_config tbl: small_table # This will be overridden
tbl_name: Test Table # This name will be used
steps:
- col_exists:
columns: [date, a, b, d]
- col_vals_gt:
columns: [d]
value: 100
'''
# Execute with table override
= pb.yaml_interrogate(yaml_config, set_tbl=test_table)
result print(f"Validation applied to: {result.tbl_name}")
result
Validation applied to: Test Table
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 3 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 4 |
col_exists()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 5 |
col_vals_gt()
|
✓ | 3 | 3 1.00 |
0 0.00 |
— | — | — | — |
This feature makes YAML configurations more reusable and flexible, allowing you to define validation logic once and apply it to multiple similar tables.