Data Contracts

A data contract is a declarative specification of what data should look like at a particular point in your system. Rather than writing imperative validation code every time you need to check data, contracts let you define expectations as data and then enforce them anywhere.

Pointblank’s Contract class combines three things into a single, portable unit:

  1. Schema: what columns exist and what types they are
  2. Validation steps: semantic rules the data must satisfy
  3. Metadata: who owns it, what version it is, and what to do on failure

Contracts can be serialized to YAML, version-controlled, and shared across teams. They can serve as the single source of truth for data quality expectations at a given boundary.

Creating Your First Contract

The simplest contract just names a set of validation steps:

import pointblank as pb
import polars as pl

# Define a contract for customer data
customer_contract = pb.Contract(
    name="customer_records",
    steps=[
        pb.Step("col_vals_not_null", columns=["customer_id", "email"]),
        pb.Step("col_vals_regex", columns="email", pattern=r"^[^@]+@[^@]+\.[^@]+$"),
        pb.Step("rows_distinct", columns_subset=["customer_id"]),
    ],
)

This contract essentially says: “Any data calling itself customer_records must have non-null IDs and emails, valid email formats, and unique customer IDs.”

Now validate some data against it:

customers = pl.DataFrame(
    {
        "customer_id": ["C001", "C002", "C003", "C004"],
        "email": ["alice@example.com", "bob@corp.io", "charlie@mail.org", "dave@startup.co"],
        "name": ["Alice", "Bob", "Charlie", "Dave"],
        "signup_date": ["2024-01-15", "2024-02-20", "2024-03-10", "2024-04-05"],
    }
)

customer_contract.validate(customers)
Pointblank Validation
Contract: customer_records
Polarscustomer_records
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
customer_id 4 4
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
email 4 4
1.00
0
0.00
#4CA64C 3
col_vals_regex
col_vals_regex()
email ^[^@]+@[^@]+\.[^@]+$ 4 4
1.00
0
0.00
#4CA64C 4
rows_distinct
rows_distinct()
customer_id 4 4
1.00
0
0.00

The .validate() method compiles the contract into a Validate object, runs all the checks, and returns the interrogated result. You get the same rich validation report you’re used to but the contract itself is a reusable, declarative artifact.

Adding a Schema

Contracts can include a Schema to enforce structural expectations (column names and data types):

order_contract = pb.Contract(
    name="order_data",
    direction="source",
    schema=pb.Schema(
        order_id="String",
        customer_id="String",
        amount="Float64",
        quantity="Int64",
        status="String",
    ),
    steps=[
        pb.Step("col_vals_not_null", columns=["order_id", "customer_id", "amount"]),
        pb.Step("col_vals_gt", columns="amount", value=0),
        pb.Step("col_vals_ge", columns="quantity", value=1),
        pb.Step("col_vals_in_set", columns="status", set=["pending", "shipped", "delivered"]),
        pb.Step("rows_distinct", columns_subset=["order_id"]),
    ],
    version="1.0.0",
    owner="data-platform-team",
)

When a schema is defined, the contract automatically adds a col_schema_match() step before all other validation steps. This means the schema check runs first, and if the table doesn’t have the right columns and types, you’ll know immediately.

orders = pl.DataFrame(
    {
        "order_id": ["ORD-001", "ORD-002", "ORD-003", "ORD-004", "ORD-005"],
        "customer_id": ["C001", "C002", "C001", "C003", "C002"],
        "amount": [29.99, 149.50, 9.99, 75.00, 220.00],
        "quantity": [1, 3, 1, 2, 5],
        "status": ["shipped", "pending", "delivered", "pending", "shipped"],
    }
)

order_contract.validate(orders)
Pointblank Validation
Contract: order_data v1.0.0
Polarsorder_data
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
order_id 5 5
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
customer_id 5 5
1.00
0
0.00
#4CA64C 4
col_vals_not_null
col_vals_not_null()
amount 5 5
1.00
0
0.00
#4CA64C 5
col_vals_gt
col_vals_gt()
amount 0 5 5
1.00
0
0.00
#4CA64C 6
col_vals_ge
col_vals_ge()
quantity 1 5 5
1.00
0
0.00
#4CA64C 7
col_vals_in_set
col_vals_in_set()
status pending, shipped, delivered 5 5
1.00
0
0.00
#4CA64C 8
rows_distinct
rows_distinct()
order_id 5 5
1.00
0
0.00
Owner: data-platform-teamVersion: 1.0.0

Notes

Step 1 (schema_check) Schema validation passed.

Schema Comparison
TARGET EXPECTED
COLUMN DATA TYPE COLUMN DATA TYPE
1 order_id String 1 order_id String
2 customer_id String 2 customer_id String
3 amount Float64 3 amount Float64
4 quantity Int64 4 quantity Int64
5 status String 5 status String
Supplied Column Schema:
[('order_id', 'String'), ('customer_id', 'String'), ('amount', 'Float64'), ('quantity', 'Int64'), ('status', 'String')]
Schema Match Settings
COMPLETE
IN ORDER
COLUMN ≠ column
DTYPE ≠ dtype
float ≠ float64

The Step Class

Each validation rule in a contract is represented by a Step, which is a declarative description of a single validation method call. Steps store the method name and its arguments as plain data:

# These are equivalent ways to express a validation rule:
step1 = pb.Step("col_vals_gt", columns="revenue", value=0)
step2 = pb.Step("col_vals_between", columns="age", left=0, right=150)
step3 = pb.Step("col_vals_in_set", columns="country", set=["US", "UK", "CA", "AU"])
step4 = pb.Step("col_vals_regex", columns="phone", pattern=r"^\+?[0-9\-\(\) ]+$")

# Steps are data so you can inspect them
print(step1)
print(step1.method)
print(step1.kwargs)
Step('col_vals_gt', columns='revenue', value=0)
col_vals_gt
{'columns': 'revenue', 'value': 0}

Because steps are data (not function calls), they serialize cleanly to YAML/JSON and can be reconstructed anywhere. The method field corresponds directly to a validation method on the Validate class, and the kwargs are passed through verbatim.

Available Methods

Any validation method from the Validate class can be used in a Step. Here are the most common ones for contract definitions:

Method Purpose
col_vals_gt(), col_vals_lt(), col_vals_ge(), col_vals_le() Numeric bounds
col_vals_between(), col_vals_outside() Range checks
col_vals_eq(), col_vals_ne() Equality checks
col_vals_in_set(), col_vals_not_in_set() Categorical membership
col_vals_not_null(), col_vals_null() Null checks
col_vals_regex() Pattern matching
col_exists() Column presence
rows_distinct() Uniqueness (use columns_subset=)
rows_complete() No nulls in any column
row_count_match() Expected row count
col_count_match() Expected column count

Contract Metadata

Contracts support rich metadata that makes them self-documenting and suitable for team workflows:

production_contract = pb.Contract(
    name="clean_sales_report",
    direction="target",              # "source" or "target" boundary
    version="2.1.0",                 # Semantic versioning for evolution
    owner="data-platform-team",      # Who maintains this contract
    consumers=["analytics-team", "ml-team"],  # Who depends on it
    description="Validated, deduplicated sales data ready for downstream consumption.",
    on_violation="warn",             # "warn", "raise", or "log"
    schema=pb.Schema(
        sale_id="String",
        revenue="Float64",
        region="String",
    ),
    steps=[
        pb.Step("col_vals_not_null", columns=["sale_id", "revenue", "region"]),
        pb.Step("col_vals_gt", columns="revenue", value=0),
        pb.Step("rows_distinct", columns_subset=["sale_id"]),
    ],
    thresholds=pb.Thresholds(warning=0.01, error=0.05, critical=0.10),
)

print(production_contract)
Contract(name='clean_sales_report', direction='target', version='2.1.0', schema=<defined>, steps=3)

Direction

The direction parameter is metadata that signals where in a pipeline this contract applies:

  • "source": for inbound/raw data arriving from upstream
  • "target": for outbound data leaving your transform

Direction doesn’t change validation behavior, but it’s used in pipeline reports and helps teams understand the contract’s role in the system.

Violation Handling

The on_violation parameter controls what happens when validation fails (used by the Pipeline class, covered in the next guide page):

  • "warn" (default): issue a Python UserWarning
  • "raise": raise a RuntimeError (halts execution)
  • "log": log via the pointblank.contract logger

Using to_validate() for Custom Workflows

If you need more control, to_validate() gives you back an un-interrogated Validate object that you can extend with additional checks:

# Start from the contract, then add ad-hoc checks
validation = (
    order_contract
    .to_validate(orders)
    .col_vals_lt(columns="amount", value=500)  # Additional check not in the contract
    .interrogate()
)

validation
Pointblank Validation
Contract: order_data v1.0.0
Polarsorder_data
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
order_id 5 5
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
customer_id 5 5
1.00
0
0.00
#4CA64C 4
col_vals_not_null
col_vals_not_null()
amount 5 5
1.00
0
0.00
#4CA64C 5
col_vals_gt
col_vals_gt()
amount 0 5 5
1.00
0
0.00
#4CA64C 6
col_vals_ge
col_vals_ge()
quantity 1 5 5
1.00
0
0.00
#4CA64C 7
col_vals_in_set
col_vals_in_set()
status pending, shipped, delivered 5 5
1.00
0
0.00
#4CA64C 8
rows_distinct
rows_distinct()
order_id 5 5
1.00
0
0.00
#4CA64C 9
col_vals_lt
col_vals_lt()
amount 500 5 5
1.00
0
0.00
Owner: data-platform-teamVersion: 1.0.0

Notes

Step 1 (schema_check) Schema validation passed.

Schema Comparison
TARGET EXPECTED
COLUMN DATA TYPE COLUMN DATA TYPE
1 order_id String 1 order_id String
2 customer_id String 2 customer_id String
3 amount Float64 3 amount Float64
4 quantity Int64 4 quantity Int64
5 status String 5 status String
Supplied Column Schema:
[('order_id', 'String'), ('customer_id', 'String'), ('amount', 'Float64'), ('quantity', 'Int64'), ('status', 'String')]
Schema Match Settings
COMPLETE
IN ORDER
COLUMN ≠ column
DTYPE ≠ dtype
float ≠ float64

This is useful when a contract captures your baseline expectations, but a specific workflow needs extra checks on top.

Contract Equality and Composition

Steps support equality comparison, making it easy to verify contracts:

# Steps are equal if they have the same method and kwargs
s1 = pb.Step("col_vals_gt", columns="x", value=0)
s2 = pb.Step("col_vals_gt", columns="x", value=0)
s3 = pb.Step("col_vals_gt", columns="x", value=10)

print(f"s1 == s2: {s1 == s2}")
print(f"s1 == s3: {s1 == s3}")
s1 == s2: True
s1 == s3: False

You can compose contracts by combining steps from multiple sources:

# Common checks shared across all tables
common_steps = [
    pb.Step("col_vals_not_null", columns=["id"]),
    pb.Step("rows_distinct", columns_subset=["id"]),
]

# Table-specific checks
sales_steps = [
    pb.Step("col_vals_gt", columns="revenue", value=0),
    pb.Step("col_vals_in_set", columns="region", set=["NA", "EU", "APAC"]),
]

# Compose them
sales_contract = pb.Contract(
    name="sales_data",
    steps=common_steps + sales_steps,
    version="1.0.0",
)

print(f"Total steps: {len(sales_contract.steps)}")
Total steps: 4

Example: Validating with Failures

Let’s see what happens when data doesn’t meet the contract:

# Data with quality issues
bad_orders = pl.DataFrame(
    {
        "order_id": ["ORD-001", "ORD-002", "ORD-001", "ORD-004", None],   # duplicates + null
        "customer_id": ["C001", None, "C001", "C003", "C002"],            # null
        "amount": [29.99, -5.00, 9.99, 0.0, 220.00],                      # negative + zero
        "quantity": [1, 3, 1, 0, 5],                                      # zero (should be >= 1)
        "status": ["shipped", "pending", "invalid", "pending", "shipped"],# invalid value
    }
)

order_contract.validate(bad_orders)
Pointblank Validation
Contract: order_data v1.0.0
Polarsorder_data
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00
#4CA64C66 2
col_vals_not_null
col_vals_not_null()
order_id 5 4
0.80
1
0.20
#4CA64C66 3
col_vals_not_null
col_vals_not_null()
customer_id 5 4
0.80
1
0.20
#4CA64C 4
col_vals_not_null
col_vals_not_null()
amount 5 5
1.00
0
0.00
#4CA64C66 5
col_vals_gt
col_vals_gt()
amount 0 5 3
0.60
2
0.40
#4CA64C66 6
col_vals_ge
col_vals_ge()
quantity 1 5 4
0.80
1
0.20
#4CA64C66 7
col_vals_in_set
col_vals_in_set()
status pending, shipped, delivered 5 4
0.80
1
0.20
#4CA64C66 8
rows_distinct
rows_distinct()
order_id 5 3
0.60
2
0.40
Owner: data-platform-teamVersion: 1.0.0

Notes

Step 1 (schema_check) Schema validation passed.

Schema Comparison
TARGET EXPECTED
COLUMN DATA TYPE COLUMN DATA TYPE
1 order_id String 1 order_id String
2 customer_id String 2 customer_id String
3 amount Float64 3 amount Float64
4 quantity Int64 4 quantity Int64
5 status String 5 status String
Supplied Column Schema:
[('order_id', 'String'), ('customer_id', 'String'), ('amount', 'Float64'), ('quantity', 'Int64'), ('status', 'String')]
Schema Match Settings
COMPLETE
IN ORDER
COLUMN ≠ column
DTYPE ≠ dtype
float ≠ float64

The validation report clearly shows which steps failed and how many test units were affected. This makes it easy to diagnose data quality issues and communicate them to data producers.

What’s Next

Now that you understand contracts, the next guide page covers Pipelines. They combine a source contract, a transform, and a target contract into a complete boundary enforcement workflow. This lets you validate data at both the input and output of your data transformations.