Contracts and pipelines are designed to be data, not code. They serialize cleanly to YAML, making them useful for version control, team collaboration, and configuration-driven workflows. This guide covers how to write, load, and use YAML-based contracts and pipelines.
When you define a contract in Python, it lives inside your application code. That’s fine for quick validation tasks, but in production data platforms you often want the specification of what data should look like to live separately from the code that processes it. YAML contracts give you exactly that separation with a human-readable, machine-parseable definition of data expectations that anyone on your team can read, review, and modify.
Why YAML?
YAML contracts offer several advantages over defining contracts purely in Python:
Version-controllable: track contract changes alongside your code in Git, with clear diffs when expectations evolve
Language-agnostic: contracts can be understood (and even authored) by non-Python team members such as data analysts, product managers, or compliance officers
Separates concerns: validation rules live apart from transform logic, meaning you can update expectations without touching application code
CI/CD friendly: load and enforce contracts in automated pipelines and gate deployments on contract compliance
Reviewable: pull requests that change data contracts are easy to review because the YAML format highlights exactly what changed in a readable way
The YAML approach doesn’t replace the Python API but, rather, it complements it. Use YAML for stable, well-defined contracts that change infrequently, and Python for exploratory validation or complex logic that benefits from programmatic construction.
Contract YAML Format
Every contract YAML file uses a contract: top-level key as its root element. Inside this key you define all the metadata, schema, and validation steps that make up your contract. The structure is designed to be self-documenting so someone reading the file should immediately understand what data this contract governs, who owns it, and what rules are enforced.
Here’s a complete example showing all the available fields:
contract:name: customer_recordsdirection: sourceversion:"1.2.0"owner: data-platform-teamconsumers:- analytics-team- ml-teamdescription:"Raw customer data from the CRM system"on_violation: warnschema:customer_id: Stringemail: Stringname: Stringsignup_date: Stringplan: Stringsteps:-col_vals_not_null:columns:- customer_id- email-col_vals_regex:columns: emailpattern:"^[^@]+@[^@]+\\.[^@]+$"-rows_distinct:columns_subset:- customer_id-col_vals_in_set:columns: planset:- free- pro- enterprisethresholds:warning:0.01error:0.05critical:0.10
Structure Reference
The table below summarizes every key available inside the contract: block. Only name is required and everything else has sensible defaults or can be omitted if not applicable to your use case.
Steps are the heart of a contract. They define the actual validation rules that will be checked against your data. In YAML, each step is a list item whose key is the name of a Pointblank validation method (like col_vals_gt or rows_distinct), and whose value is a dictionary of the arguments to pass to that method.
This mapping is direct: the YAML step col_vals_gt: {columns: amount, value: 0} is exactly equivalent to calling .col_vals_gt(columns="amount", value=0) in Python. Here are examples covering the most common patterns:
steps: # Step with keyword arguments-col_vals_gt:columns: amountvalue:0 # Step with a list of columns-col_vals_not_null:columns:- id- name- email # Step with multiple arguments-col_vals_between:columns: ageleft:0right:150 # Step with a set/list of values-col_vals_in_set:columns: statusset:- active- inactive- suspended # Step with no arguments-rows_complete:{}
Writing and Loading Contracts
Pointblank supports a complete round-trip workflow: define contracts in Python, save them to YAML for version control and sharing, then load them back in any environment. This section walks through each step of that workflow.
Saving to YAML
The .to_yaml() method serializes a Contract object to a YAML string. You can optionally pass a file path to write directly to disk. The resulting YAML is clean, readable, and contains all the information needed to reconstruct the contract exactly.
import pointblank as pbfrom pathlib import Path# Create a contract in Pythoncontract = pb.Contract( name="product_catalog", direction="source", version="1.0.0", owner="catalog-team", schema=pb.Schema( product_id="String", name="String", price="Float64", category="String", in_stock="Boolean", ), steps=[ pb.Step("col_vals_not_null", columns=["product_id", "name", "price"]), pb.Step("col_vals_gt", columns="price", value=0), pb.Step("col_vals_in_set", columns="category", set=["electronics", "clothing", "food", "home"]), pb.Step("rows_distinct", columns_subset=["product_id"]), ], thresholds=pb.Thresholds(warning=0.01, error=0.05),)# Save to YAMLyaml_output = contract.to_yaml("product_contract.yaml")print(yaml_output)
The Contract.from_yaml() class method reads a YAML file and reconstructs the full Contract object, including schema, steps, thresholds, and all metadata. The loaded contract is identical in behavior to one created directly in Python (there’s no functional difference).
# Load the contract backloaded_contract = pb.Contract.from_yaml("product_contract.yaml")print(loaded_contract)print(f"Steps: {len(loaded_contract.steps)}")print(f"Version: {loaded_contract.version}")
Once loaded, a contract works exactly like one defined in Python. Call .validate(data) to compile it into a Validate object, execute all checks, and get the standard Pointblank validation report. This is the core value proposition of YAML contracts: define once, validate anywhere.
While individual contracts validate data at a single point, pipelines enforce quality at both the input and output boundaries of a transformation. A pipeline YAML file packages the source contract, target contract, and pipeline-level settings (like global thresholds and short-circuit behavior) into a single, self-contained document.
This is especially powerful for team workflows: a data engineer defines the pipeline specification in YAML, commits it to version control, and any downstream system can load and execute it without knowing anything about the implementation details.
Let’s build a pipeline in Python and export it to see what the YAML looks like:
The generated YAML has three top-level sections that correspond to the three components of boundary enforcement: the pipeline: section for global settings, the source: section for the inbound contract, and the target: section for the outbound contract. Each section can be understood independently, which makes reviewing changes straightforward.
Loading a pipeline from YAML works the same way as loading a contract: call the from_yaml() class method with the file path. The returned Pipeline object is fully functional and ready to validate data or run a complete boundary enforcement workflow.
# Load the pipeline from YAMLloaded_pipeline = pb.Pipeline.from_yaml("event_pipeline.yaml")print(loaded_pipeline)
With the pipeline loaded, you provide data and a transform function just as you would with a pipeline defined in Python. The YAML only stores the contract specifications whereas the transform logic always lives in your Python code (since transforms are arbitrary functions that can’t be meaningfully serialized to YAML).
# Create test data and run the pipelineevents = pl.DataFrame( {"event_id": ["EVT-001", "EVT-002", "EVT-003", "EVT-004"],"user_id": ["U100", "U200", "U100", "U300"],"event_type": ["purchase", "click", "view", "signup"],"value_cents": [2500, 0, 0, 0], })def enrich_events(df: pl.DataFrame) -> pl.DataFrame:"""Convert cents to dollars, defaulting zero to a minimum value."""return df.with_columns( pl.when(pl.col("value_cents") >0) .then(pl.col("value_cents") /100) .otherwise(0.01) # Minimum trackable value .alias("value") ).drop("value_cents")result = loaded_pipeline.run(data=events, transform=enrich_events)print(f"Pipeline passed: {result.passed}")
The following patterns illustrate common ways teams use YAML contracts and pipelines in real-world data platforms. These aren’t exhaustive: they’re starting points that you can adapt to your own workflow.
Pattern 1: Environment-Specific Pipelines
A common need is to apply different tolerance levels in different environments. In development, you might accept 20% failure rates while iterating on a transform. In production, anything above 1% is a critical issue. YAML contracts make this easy: store the same validation logic in a shared contract, but load different pipeline configurations that set environment-appropriate thresholds.
# Same contracts, different thresholdsdev_pipeline = pb.Pipeline( source=source, target=target, thresholds=pb.Thresholds(warning=0.20, error=0.50), # Lenient for dev label="DEV: Event Pipeline",)prod_pipeline = pb.Pipeline( source=source, target=target, thresholds=pb.Thresholds(warning=0.01, error=0.05, critical=0.10), # Strict label="PROD: Event Pipeline",)# In practice, you'd select based on an env var:# import os# pipeline = pb.Pipeline.from_yaml(f"contracts/{os.environ['ENV']}_pipeline.yaml")
Pattern 2: Contract Library
As your data platform grows, you’ll accumulate many contracts. A recommended practice is to organize them into a directory structure that mirrors your data architecture. This creates a discoverable, browsable “library” of data quality expectations that serves as living documentation of your system’s data contracts.
# Load any contract from the libraryevents_source = pb.Contract.from_yaml("contracts/sources/raw_events.yaml")events_target = pb.Contract.from_yaml("contracts/targets/clean_events.yaml")# Build pipeline dynamicallypipeline = pb.Pipeline(source=events_source, target=events_target)
Pattern 3: CI/CD Integration
YAML contracts integrate naturally into CI/CD pipelines. Because contracts are plain files, you can load them in test suites, gate deployments on contract compliance, and fail builds when data quality regresses. This turns data quality from a manual spot-check into an automated, enforceable standard.
Here’s what a pytest-based contract test might look like:
# In your test suite or CI script:import pointblank as pbdef test_pipeline_contract(): pipeline = pb.Pipeline.from_yaml("contracts/pipelines/event_pipeline.yaml") test_data = load_test_fixture("events_sample.parquet") result = pipeline.run(data=test_data, transform=my_transform)assert result.passed, result.get_report()
Pattern 4: Composing Steps from Multiple Sources
Sometimes you want to define a set of “base” validation rules that apply to all tables (like “every table must have a non-null id column that’s unique”), and then layer on table-specific rules. YAML makes this easy because you can parse step definitions from YAML fragments and combine them programmatically.
This pattern is especially useful when you have organization-wide data quality standards that every contract should inherit.
# Base steps that apply to all tablesbase_steps_yaml ="""- col_vals_not_null: columns: [id]- rows_distinct: columns_subset: [id]"""import yamlbase_steps = [pb.Step.from_dict(s) for s in yaml.safe_load(base_steps_yaml)]# Compose with table-specific stepsfull_contract = pb.Contract( name="composed_contract", steps=base_steps + [ pb.Step("col_vals_gt", columns="score", value=0), pb.Step("col_vals_le", columns="score", value=100), ],)print(f"Total steps: {len(full_contract.steps)}")for step in full_contract.steps:print(f" - {step}")
A critical property of Pointblank’s YAML serialization is round-trip fidelity: when you save a contract to YAML and load it back, you get an identical object. No information is lost, no defaults are silently changed, and no steps are reordered. This guarantee means you can confidently use YAML as your source of truth without worrying about subtle differences between the YAML representation and the in-memory object.
Let’s verify this with a fully-specified contract:
Name matches: True
Direction matches: True
Version matches: True
Owner matches: True
Steps match: True
Thresholds match: True
Cleanup
The examples in this guide created several YAML files on disk to demonstrate the serialization workflow. The following hidden cell removes them so they don’t clutter your working directory. In your own projects, these files would live in version control and persist across sessions. You’d never need to clean them up like this.
Conclusion
The table below summarizes the key operations and how they map between the Python API and YAML:
Operation
Python API
YAML
Define contract
pb.Contract(...)
contract: block
Define pipeline
pb.Pipeline(...)
pipeline: + source: + target:
Save
.to_yaml("path.yaml")
—
Load contract
pb.Contract.from_yaml("path.yaml")
—
Load pipeline
pb.Pipeline.from_yaml("path.yaml")
—
Validate
.validate(data)
Load -> validate in Python
Run pipeline
.run(data, transform)
Load -> run in Python
YAML contracts and pipelines give you the best of both worlds: declarative, reviewable specifications that integrate seamlessly with Pointblank’s Python execution engine. The separation of what (YAML contracts) from how (Python transforms) creates a clean architecture where data quality expectations can evolve independently from processing logic: reviewed by different people, on different timelines, with clear accountability at each boundary.