Draft Validation
Draft Validation in Pointblank leverages large language models (LLMs) to automatically generate validation plans for your data. This feature is especially useful when starting validation on a new dataset or when you need to quickly establish baseline validation coverage.
The DraftValidation
class connects to various LLM providers to analyze your data’s characteristics and generate a complete validation plan tailored to its structure and content.
How DraftValidation
Works
When you use DraftValidation
, the process works through these steps:
- a statistical summary of your data is generated using the
DataScan
class - this summary is converted to JSON format and sent to your selected LLM provider
- the LLM uses the summary along with knowledge about Pointblank’s validation capabilities to generate a validation plan
- the result is returned as executable Python code that you can use directly or modify as needed
The entire process happens without sending your actual data to the LLM provider, only a summary that includes column names, data types, basic statistics, and a small sample of values.
Requirements and Setup
To use the DraftValidation
feature, you’ll need:
- an API key from a supported LLM provider
- the required Python packages installed
You can install all necessary dependencies with:
pip install pointblank[generate]
This will install the chatlas
package and other dependencies required for DraftValidation
.
Supported LLM Providers
The DraftValidation
class supports multiple LLM providers:
- Anthropic (Claude models)
- OpenAI (GPT models)
- Ollama (local LLMs)
- Amazon Bedrock (AWS-hosted models)
Each provider has different capabilities and performance characteristics, but all can be used to generate validation plans through a consistent interface.
Basic Usage
The simplest way to use DraftValidation
is to provide your data and specify an LLM model. Let’s try it out with the global_sales
dataset.
import pointblank as pb
# Load a dataset
= pb.load_dataset(dataset="global_sales", tbl_type="polars")
data
# Generate a validation plan
pb.DraftValidation(=data,
data="anthropic:claude-3-7-sonnet-latest",
model="your_api_key_here" # Replace with your actual API key
api_key )
```python
import pointblank as pb
# Define schema based on column names and dtypes
schema = pb.Schema(columns=[
("product_id", "String"),
("product_category", "String"),
("customer_id", "String"),
("customer_segment", "String"),
("region", "String"),
("country", "String"),
("city", "String"),
("timestamp", "Datetime(time_unit='us', time_zone=None)"),
("quarter", "String"),
("month", "Int64"),
("year", "Int64"),
("price", "Float64"),
("quantity", "Int64"),
("status", "String"),
("email", "String"),
("revenue", "Float64"),
("tax", "Float64"),
("total", "Float64"),
("payment_method", "String"),
("sales_channel", "String")
])
# The validation plan
validation = (
pb.Validate(
data=your_data, # Replace your_data with the actual data variable
label="Draft Validation",
thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
)
.col_schema_match(schema=schema)
.col_vals_not_null(columns=[
"product_category", "customer_segment", "region", "country",
"price", "quantity", "status", "email", "revenue", "tax",
"total", "payment_method", "sales_channel"
])
.col_vals_between(columns="month", left=1, right=12, na_pass=True)
.col_vals_between(columns="year", left=2021, right=2023, na_pass=True)
.col_vals_gt(columns="price", value=0)
.col_vals_gt(columns="quantity", value=0)
.col_vals_gt(columns="revenue", value=0)
.col_vals_gt(columns="tax", value=0)
.col_vals_gt(columns="total", value=0)
.col_vals_in_set(columns="product_category", set=[
"Manufacturing", "Retail", "Healthcare"
])
.col_vals_in_set(columns="customer_segment", set=[
"Government", "Consumer", "SMB"
])
.col_vals_in_set(columns="region", set=[
"Asia Pacific", "Europe", "North America"
])
.col_vals_in_set(columns="status", set=[
"returned", "shipped", "delivered"
])
.col_vals_in_set(columns="payment_method", set=[
"Apple Pay", "PayPal", "Bank Transfer", "Credit Card"
])
.col_vals_in_set(columns="sales_channel", set=[
"Partner", "Distributor", "Phone"
])
.row_count_match(count=50000)
.col_count_match(count=20)
.rows_distinct()
.interrogate()
)
validation
```
Managing API Keys
While you can directly provide API keys as shown above, there are more secure approaches:
import os
# Get API key from environment variable
= os.getenv("ANTHROPIC_API_KEY")
api_key
= pb.DraftValidation(
draft_validation =data,
data="anthropic:claude-3-7-sonnet-latest",
model=api_key
api_key )
You can also store API keys in a .env
file in your project’s root directory:
# Contents of .env file
ANTHROPIC_API_KEY=your_anthropic_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
If your API keys have standard names (like ANTHROPIC_API_KEY
or OPENAI_API_KEY
), DraftValidation
will automatically find and use them:
# No API key needed if stored in .env with standard names
= pb.DraftValidation(
draft_validation =data,
data="anthropic:claude-3-7-sonnet-latest"
model )
Example Output for nycflights
Here’s an example of a validation plan that might be generated by DraftValidation
for the nycflights
dataset:
pb.DraftValidation(="nycflights", tbl_type="duckdb",
pb.load_dataset(dataset="anthropic:claude-3-7-sonnet-latest"
model )
```python
import pointblank as pb
# Define schema based on column names and dtypes
schema = pb.Schema(columns=[
("year", "int64"),
("month", "int64"),
("day", "int64"),
("dep_time", "int64"),
("sched_dep_time", "int64"),
("dep_delay", "int64"),
("arr_time", "int64"),
("sched_arr_time", "int64"),
("arr_delay", "int64"),
("carrier", "string"),
("flight", "int64"),
("tailnum", "string"),
("origin", "string"),
("dest", "string"),
("air_time", "int64"),
("distance", "int64"),
("hour", "int64"),
("minute", "int64")
])
# The validation plan
validation = (
pb.Validate(
data=your_data, # Replace your_data with the actual data variable
label="Draft Validation",
thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
)
.col_schema_match(schema=schema)
.col_vals_not_null(columns=[
"year", "month", "day", "sched_dep_time", "carrier", "flight",
"origin", "dest", "distance", "hour", "minute"
])
.col_vals_between(columns="month", left=1, right=12)
.col_vals_between(columns="day", left=1, right=31)
.col_vals_between(columns="sched_dep_time", left=106, right=2359)
.col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True)
.col_vals_between(columns="air_time", left=20, right=695, na_pass=True)
.col_vals_between(columns="distance", left=17, right=4983)
.col_vals_between(columns="hour", left=1, right=23)
.col_vals_between(columns="minute", left=0, right=59)
.col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"])
.col_count_match(count=18)
.row_count_match(count=336776)
.rows_distinct()
.interrogate()
)
validation
```
Notice how the generated plan includes:
- A schema validation with appropriate data types
- Not-null checks for required columns
- Range validations for numerical data
- Set membership checks for categorical data
- Row and column count validations
- Appropriate handling of missing values with
na_pass=True
Working with Model Providers
Specifying Models
When using DraftValidation
, you specify the model in the format "provider:model_name"
:
# Using Anthropic's Claude model
=data, model="anthropic:claude-3-7-sonnet-latest")
pb.DraftValidation(data
# Using OpenAI's GPT model
=data, model="openai:gpt-4-turbo")
pb.DraftValidation(data
# Using a local model with Ollama
=data, model="ollama:llama3:latest")
pb.DraftValidation(data
# Using Amazon Bedrock
=data, model="bedrock:anthropic.claude-3-sonnet-20240229-v1:0") pb.DraftValidation(data
Model Performance and Privacy
Different models have different capabilities when it comes to generating validation plans:
- Anthropic Claude 3.7 Sonnet generally provides the most comprehensive and accurate validation plans
- OpenAI GPT-4 models also perform well
- Local models through Ollama can be useful for private data but may have reduced capabilities
A key advantage of DraftValidation
is that your actual dataset is not sent to the LLM provider. Instead, only a summary is transmitted, which includes:
- the number of rows and columns
- column names and data types
- basic statistics (min, max, mean, median, missing values count)
- a small sample of values from each column (usually 5-10 values)
This approach protects your data while still providing enough context for the LLM to generate relevant validation rules.
Customizing Generated Plans
The validation plan generated by DraftValidation
is just a starting point. You’ll typically want to:
- review the generated code for correctness
- replace
your_data
with your actual data variable name that exists in your workspace - ensure the data object referenced is actually present in your workspace
- adjust thresholds and validation parameters
- add domain-specific validation rules
- remove any unnecessary checks
For example, you might start by capturing the text representation of your draft validation. This will give you the raw Python code that you can copy into a new code cell in your notebook or script. From there, you can customize it by modifying thresholds to match your organization’s data quality standards, adding business-specific validation rules that require domain knowledge, or removing checks that aren’t relevant to your use case. Once you’ve made your modifications, you can execute the customized validation plan as you would any other Pointblank validation.
Under the Hood
The Generated Data Summary
To understand what the LLM works with, here’s an example of the data summary format that’s sent:
{
"table_info": {
"rows": 336776,
"columns": 18,
"table_type": "duckdb"
},
"column_info": [
{
"column_name": "year",
"column_type": "int64",
"missing_values": 0,
"min": 2013,
"max": 2013,
"mean": 2013.0,
"median": 2013,
"sample_values": [2013, 2013, 2013, 2013, 2013]
},
{
"column_name": "month",
"column_type": "int64",
"missing_values": 0,
"min": 1,
"max": 12,
"mean": 6.548819,
"median": 7,
"sample_values": [1, 1, 1, 1, 1]
},
// Additional columns...
]
}
The Prompt Strategy
The DraftValidation
class uses a carefully crafted prompt that instructs the LLM to:
- use the schema information to create a
Schema
object - include
col_vals_not_null()
for columns with no missing values - add appropriate range validations based on min/max values
- include row and column count validations
- format the output as clean, executable Python code
The prompt also contains constraints to ensure consistent, high-quality results, such as using line breaks in long lists for readability, applying na_pass=True
for columns with missing values, and avoiding duplicate validations.
Best Practices and Troubleshooting
When to Use DraftValidation
Drafting a validation is most useful when:
- working with a new dataset for the first time
- needing to quickly establish baseline validation
- exploring potential validation rules before formalizing them
- validating columns with consistent patterns (numeric ranges, categories, etc.)
Consider writing validation plans manually when you need very specific business rules, are working with sensitive data, need complex validation logic, or need to validate relationships between columns.
Recommended Workflow
A recommended workflow incorporating DraftValidation
:
- generate an initial plan with
DraftValidation
- review the generated validations for relevance
- adjust thresholds and parameters as needed
- add specific business logic and cross-column validations
- store the final validation plan in version control
Common Issues
- Authentication Errors: ensure your API key is valid and correctly passed to
DraftValidation
- Package Not Found: make sure you’ve installed the required packages with
pip install pointblank[generate]
- Unsupported Model: verify you’re using the correct
provider:model
format - Poor Quality Plans: try a more capable model or provide better context
Be aware that LLM providers typically have rate limits. If you’re generating many validation plans, consider using local models through Ollama, implementing rate limiting in your code, or caching generated plans for datasets that don’t change frequently.
Conclusion
DraftValidation
provides a powerful way to jumpstart your data validation process by leveraging LLMs to generate context-aware validation plans. By analyzing your data’s structure and content, DraftValidation
can create comprehensive validation rules that would otherwise take significant time to develop manually.
The feature balances privacy (by sending only data summaries) with utility (by generating executable validation code). While the generated plans should always be reviewed and refined, they provide an excellent starting point for ensuring your data meets your quality requirements.
By understanding how DraftValidation
works and how to customize its output, you can significantly accelerate your data validation workflows and improve the quality of your data throughout your projects.