Draft Validation

Draft Validation in Pointblank leverages large language models (LLMs) to automatically generate validation plans for your data. This feature is especially useful when starting validation on a new dataset or when you need to quickly establish baseline validation coverage.

The DraftValidation class connects to various LLM providers to analyze your data’s characteristics and generate a complete validation plan tailored to its structure and content.

How `DraftValidation` Works

When you use DraftValidation, the process works through these steps:

a statistical summary of your data is generated using the DataScan class
this summary is converted to JSON format and sent to your selected LLM provider
the LLM uses the summary along with knowledge about Pointblank’s validation capabilities to generate a validation plan
the result is returned as executable Python code that you can use directly or modify as needed

The entire process happens without sending all of the data to the LLM provider, but only a summary that includes column names, data types, basic statistics, and a small sample of values.

Requirements and Setup

To use the DraftValidation feature, you’ll need:

an API key from a supported LLM provider
the required Python packages installed

You can install all necessary dependencies with:

pip install pointblank[generate]

This will install the chatlas package and other dependencies required for DraftValidation.

Supported LLM Providers

The DraftValidation class supports multiple LLM providers:

Anthropic (Claude models)
OpenAI (GPT models)
Ollama (local LLMs)
Amazon Bedrock (AWS-hosted models)

Each provider has different capabilities and performance characteristics, but all can be used to generate validation plans through a consistent interface.

Basic Usage

The simplest way to use DraftValidation is to provide your data and specify an LLM model. Let’s try it out with the global_sales dataset.

import pointblank as pb

# Load a dataset
data = pb.load_dataset(dataset="global_sales", tbl_type="polars")

# Generate a validation plan
pb.DraftValidation(
    data=data,
    model="anthropic:claude-3-7-sonnet-latest",
    api_key="your_api_key_here"  # Replace with your actual API key
)

```python
import pointblank as pb

# Define schema based on column names and dtypes
schema = pb.Schema(columns=[
    ("product_id", "String"),
    ("product_category", "String"),
    ("customer_id", "String"),
    ("customer_segment", "String"),
    ("region", "String"),
    ("country", "String"),
    ("city", "String"),
    ("timestamp", "Datetime(time_unit='us', time_zone=None)"),
    ("quarter", "String"),
    ("month", "Int64"),
    ("year", "Int64"),
    ("price", "Float64"),
    ("quantity", "Int64"),
    ("status", "String"),
    ("email", "String"),
    ("revenue", "Float64"),
    ("tax", "Float64"),
    ("total", "Float64"),
    ("payment_method", "String"),
    ("sales_channel", "String")
])

# The validation plan
validation = (
    pb.Validate(
        data=your_data,  # Replace your_data with the actual data variable
        label="Draft Validation",
        thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
    )
    .col_schema_match(schema=schema)
    .col_vals_not_null(columns=[
        "product_category", "customer_segment", "region", "country",
        "price", "quantity", "status", "email", "revenue", "tax",
        "total", "payment_method", "sales_channel"
    ])
    .col_vals_between(columns="month", left=1, right=12, na_pass=True)
    .col_vals_between(columns="year", left=2021, right=2023, na_pass=True)
    .col_vals_gt(columns="price", value=0)
    .col_vals_gt(columns="quantity", value=0)
    .col_vals_gt(columns="revenue", value=0)
    .col_vals_gt(columns="tax", value=0)
    .col_vals_gt(columns="total", value=0)
    .col_vals_in_set(columns="product_category", set=[
        "Manufacturing", "Retail", "Healthcare"
    ])
    .col_vals_in_set(columns="customer_segment", set=[
        "Government", "Consumer", "SMB"
    ])
    .col_vals_in_set(columns="region", set=[
        "Asia Pacific", "Europe", "North America"
    ])
    .col_vals_in_set(columns="status", set=[
        "returned", "shipped", "delivered"
    ])
    .col_vals_in_set(columns="payment_method", set=[
        "Apple Pay", "PayPal", "Bank Transfer", "Credit Card"
    ])
    .col_vals_in_set(columns="sales_channel", set=[
        "Partner", "Distributor", "Phone"
    ])
    .row_count_match(count=50000)
    .col_count_match(count=20)
    .rows_distinct()
    .interrogate()
)

validation
```

Managing API Keys

While you can directly provide API keys as shown above, there are more secure approaches:

import os

# Get API key from environment variable
api_key = os.getenv("ANTHROPIC_API_KEY")

draft_validation = pb.DraftValidation(
    data=data,
    model="anthropic:claude-3-7-sonnet-latest",
    api_key=api_key
)

You can also store API keys in a .env file in your project’s root directory:

# Contents of .env file
ANTHROPIC_API_KEY=your_anthropic_api_key_here
OPENAI_API_KEY=your_openai_api_key_here

If your API keys have standard names (like ANTHROPIC_API_KEY or OPENAI_API_KEY), DraftValidation will automatically find and use them:

# No API key needed if stored in .env with standard names
draft_validation = pb.DraftValidation(
    data=data,
    model="anthropic:claude-3-7-sonnet-latest"
)

Example Output for `nycflights`

Here’s an example of a validation plan that might be generated by DraftValidation for the nycflights dataset:

pb.DraftValidation(
    pb.load_dataset(dataset="nycflights", tbl_type="duckdb",
    model="anthropic:claude-3-7-sonnet-latest"
)

```python
import pointblank as pb

# Define schema based on column names and dtypes
schema = pb.Schema(columns=[
    ("year", "int64"),
    ("month", "int64"),
    ("day", "int64"),
    ("dep_time", "int64"),
    ("sched_dep_time", "int64"),
    ("dep_delay", "int64"),
    ("arr_time", "int64"),
    ("sched_arr_time", "int64"),
    ("arr_delay", "int64"),
    ("carrier", "string"),
    ("flight", "int64"),
    ("tailnum", "string"),
    ("origin", "string"),
    ("dest", "string"),
    ("air_time", "int64"),
    ("distance", "int64"),
    ("hour", "int64"),
    ("minute", "int64")
])

# The validation plan
validation = (
    pb.Validate(
        data=your_data,  # Replace your_data with the actual data variable
        label="Draft Validation",
        thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
    )
    .col_schema_match(schema=schema)
    .col_vals_not_null(columns=[
        "year", "month", "day", "sched_dep_time", "carrier", "flight",
        "origin", "dest", "distance", "hour", "minute"
    ])
    .col_vals_between(columns="month", left=1, right=12)
    .col_vals_between(columns="day", left=1, right=31)
    .col_vals_between(columns="sched_dep_time", left=106, right=2359)
    .col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True)
    .col_vals_between(columns="air_time", left=20, right=695, na_pass=True)
    .col_vals_between(columns="distance", left=17, right=4983)
    .col_vals_between(columns="hour", left=1, right=23)
    .col_vals_between(columns="minute", left=0, right=59)
    .col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"])
    .col_count_match(count=18)
    .row_count_match(count=336776)
    .rows_distinct()
    .interrogate()
)

validation
```

Notice how the generated plan includes:

A schema validation with appropriate data types
Not-null checks for required columns
Range validations for numerical data
Set membership checks for categorical data
Row and column count validations
Appropriate handling of missing values with na_pass=True

Working with Model Providers

Specifying Models

When using DraftValidation, you specify the model in the format "provider:model_name":

# Using Anthropic's Claude model
pb.DraftValidation(data=data, model="anthropic:claude-3-7-sonnet-latest")

# Using OpenAI's GPT model
pb.DraftValidation(data=data, model="openai:gpt-4-turbo")

# Using a local model with Ollama
pb.DraftValidation(data=data, model="ollama:llama3:latest")

# Using Amazon Bedrock
pb.DraftValidation(data=data, model="bedrock:anthropic.claude-3-sonnet-20240229-v1:0")

Model Performance and Privacy

Different models have different capabilities when it comes to generating validation plans:

Anthropic Claude 3.7 Sonnet generally provides the most comprehensive and accurate validation plans
OpenAI GPT-4 models also perform well
Local models through Ollama can be useful for private data but they currently have reduced capabilities here

A key advantage of DraftValidation is that your actual dataset is not sent to the LLM provider. Instead, only a summary is transmitted, which includes:

the number of rows and columns
column names and data types
basic statistics (min, max, mean, median, missing values count)
a small sample of values from each column (usually 5-10 values)

This approach protects your data while still providing enough context for the LLM to generate relevant validation rules.

Customizing Generated Plans

The validation plan generated by DraftValidation is just a starting point. You’ll typically want to:

review the generated code for correctness
replace your_data with your actual data variable name that exists in your workspace
ensure the data object referenced is actually present in your workspace
adjust thresholds and validation parameters
add domain-specific validation rules
remove any unnecessary checks

For example, you might start by capturing the text representation of your draft validation. This will give you the raw Python code that you can copy into a new code cell in your notebook or script. From there, you can customize it by modifying thresholds to match your organization’s data quality standards, adding business-specific validation rules that require domain knowledge, or removing checks that aren’t relevant to your use case. Once you’ve made your modifications, you can execute the customized validation plan as you would any other Pointblank validation.

Under the Hood

The Generated Data Summary

To understand what the LLM works with, here’s an example of the data summary format that’s sent:

{
  "table_info": {
    "rows": 336776,
    "columns": 18,
    "table_type": "duckdb"
  },
  "column_info": [
    {
      "column_name": "year",
      "column_type": "int64",
      "missing_values": 0,
      "min": 2013,
      "max": 2013,
      "mean": 2013.0,
      "median": 2013,
      "sample_values": [2013, 2013, 2013, 2013, 2013]
    },
    {
      "column_name": "month",
      "column_type": "int64",
      "missing_values": 0,
      "min": 1,
      "max": 12,
      "mean": 6.548819,
      "median": 7,
      "sample_values": [1, 1, 1, 1, 1]
    },
    // Additional columns...
  ]
}

The Prompt Strategy

The DraftValidation class uses a carefully crafted prompt that instructs the LLM to:

use the schema information to create a Schema object
include col_vals_not_null() for columns with no missing values
add appropriate range validations based on min/max values
include row and column count validations
format the output as clean, executable Python code

The prompt also contains constraints to ensure consistent, high-quality results, such as using line breaks in long lists for readability, applying na_pass=True for columns with missing values, and avoiding duplicate validations.

Best Practices and Troubleshooting

When to Use `DraftValidation`

Drafting a validation is most useful when:

working with a new dataset for the first time
needing to quickly establish baseline validation
exploring potential validation rules before formalizing them
validating columns with consistent patterns (numeric ranges, categories, etc.)

Consider writing validation plans manually when you need very specific business rules, are working with sensitive data, need complex validation logic, or need to validate relationships between columns.

Recommended Workflow and Common Issues

Here’s a recommended workflow incorporating DraftValidation:

generate an initial plan with DraftValidation
review the generated validations for relevance
adjust thresholds and parameters as needed
add specific business logic and cross-column validations
store the final validation plan in version control

It’s possible that you might bump up against some issues. Here are some common ones and solutions you might try:

Authentication Errors: ensure your API key is valid and correctly passed to DraftValidation
Package Not Found: make sure you’ve installed the required packages with pip install pointblank[generate]
Unsupported Model: verify you’re using the correct provider:model format
Poor Quality Plans: try a more capable model

Conclusion

DraftValidation provides a powerful way to jumpstart your data validation process by leveraging LLMs to generate context-aware validation plans. By analyzing your data’s structure and content, DraftValidation can create comprehensive validation rules that would otherwise take significant time to develop manually.

The feature balances privacy (by sending only data summaries) with utility (by generating executable validation code). While the generated plans should always be reviewed and refined, they provide an excellent starting point for ensuring your data meets your quality requirements.

By understanding how DraftValidation works and how to customize its output, you can significantly accelerate your data validation workflows and improve the quality of your data throughout your projects.

How DraftValidation Works