DraftValidation

DraftValidation(self, data, model, api_key=None)

Draft a validation plan for a given table using an LLM.

By using a large language model (LLM) to draft a validation plan, you can quickly generate a starting point for validating a table. This can be useful when you have a new table and you want to get a sense of how to validate it (and adjustments could always be made later). The DraftValidation class uses the chatlas package to draft a validation plan for a given table using an LLM from either the "anthropic", "openai", "ollama" or "bedrock" provider. You can install all requirements for the class through an optional ‘generate’ install of Pointblank via pip install pointblank[generate].

Warning

The DraftValidation() class is still experimental. Please report any issues you encounter in the Pointblank issue tracker.

Parameters

data : FrameT | Any

The data to be used for drafting a validation plan.

model : str

The model to be used. This should be in the form of provider:model (e.g., "anthropic:claude-3-5-sonnet-latest"). Supported providers are "anthropic", "openai", "ollama", and "bedrock".

api_key : str | None = None

The API key to be used for the model.

Returns

: str

The drafted validation plan.

Constructing the model Argument

The model= argument should be constructed using the provider and model name separated by a colon (provider:model). The provider text can any of:

  • "anthropic" (Anthropic)
  • "openai" (OpenAI)
  • "ollama" (Ollama)
  • "bedrock" (Amazon Bedrock)

The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider’s documentation for the most up-to-date model names.

Notes on Authentication

Providing a valid API key as a string in the api_key argument is adequate for getting started but you should consider using a more secure method for handling API keys.

One way to do this is to load the API key from an environent variable and retrieve it using the os module (specifically the os.getenv() function). Places to store the API key might include .bashrc, .bash_profile, .zshrc, or .zsh_profile.

Another solution is to store one or more model provider API keys in an .env file (in the root of your project). If the API keys have correct names (e.g., ANTHROPIC_API_KEY or OPENAI_API_KEY) then DraftValidation will automatically load the API key from the .env file and there’s no need to provide the api_key argument. An .env file might look like this:

ANTHROPIC_API_KEY="your_anthropic_api_key_here"
OPENAI_API_KEY="your_openai_api_key_here"

There’s no need to have the python-dotenv package installed when using .env files in this way.

Notes on Data Sent to the Model Provider

The data sent to the model provider is a JSON summary of the table. This data summary is generated internally by DraftValidation using the DataScan class. The summary includes the following information:

  • the number of rows and columns in the table
  • the type of dataset (e.g., Polars, DuckDB, Pandas, etc.)
  • the column names and their types
  • column level statistics such as the number of missing values, min, max, mean, and median, etc.
  • a short list of data values in each column

The JSON summary is used to provide the model with the necessary information to draft a validation plan. As such, even very large tables can be used with the DraftValidation class since the contents of the table are not sent to the model provider.

The Amazon Bedrock is a special case since it is a self-hosted model and security controls are in place to ensure that data is kept within the user’s AWS environment. If using an Ollama model all data is handled locally, though only a few models are capable enough to perform the task of drafting a validation plan.

Examples

Let’s look at how the DraftValidation class can be used to draft a validation plan for a table. The table to be used is "nycflights", which is available here via the load_dataset() function. The model to be used is "anthropic:claude-3-5-sonnet-latest" (which performs very well compared to other LLMs). The example assumes that the API key is stored in an .env file as ANTHROPIC_API_KEY.

import pointblank as pb

# Load the "nycflights" dataset as a DuckDB table
data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")

# Draft a validation plan for the "nycflights" table
pb.DraftValidation(data=nycflights, model="anthropic:claude-3-5-sonnet-latest")

The output will be a drafted validation plan for the "nycflights" table and this will appear in the console.

```python
import pointblank as pb

# Define schema based on column names and dtypes
schema = pb.Schema(columns=[
    ("year", "int64"),
    ("month", "int64"),
    ("day", "int64"),
    ("dep_time", "int64"),
    ("sched_dep_time", "int64"),
    ("dep_delay", "int64"),
    ("arr_time", "int64"),
    ("sched_arr_time", "int64"),
    ("arr_delay", "int64"),
    ("carrier", "string"),
    ("flight", "int64"),
    ("tailnum", "string"),
    ("origin", "string"),
    ("dest", "string"),
    ("air_time", "int64"),
    ("distance", "int64"),
    ("hour", "int64"),
    ("minute", "int64")
])

# The validation plan
validation = (
    pb.Validate(
        data=your_data,  # Replace your_data with the actual data variable
        label="Draft Validation",
        thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
    )
    .col_schema_match(schema=schema)
    .col_vals_not_null(columns=[
        "year", "month", "day", "sched_dep_time", "carrier", "flight",
        "origin", "dest", "distance", "hour", "minute"
    ])
    .col_vals_between(columns="month", left=1, right=12)
    .col_vals_between(columns="day", left=1, right=31)
    .col_vals_between(columns="sched_dep_time", left=106, right=2359)
    .col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True)
    .col_vals_between(columns="air_time", left=20, right=695, na_pass=True)
    .col_vals_between(columns="distance", left=17, right=4983)
    .col_vals_between(columns="hour", left=1, right=23)
    .col_vals_between(columns="minute", left=0, right=59)
    .col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"])
    .col_count_match(count=18)
    .row_count_match(count=336776)
    .rows_distinct()
    .interrogate()
)

validation
```

The drafted validation plan can be copied and pasted into a Python script or notebook for further use. In other words, the generated plan can be adjusted as needed to suit the specific requirements of the table being validated.

Note that the output does not know how the data was obtained, so it uses the placeholder your_data in the data= argument of the Validate class. When adapted for use, this should be replaced with the actual data variable.