DraftValidation
self, data, model, api_key=None) DraftValidation(
Draft a validation plan for a given table using an LLM.
By using a large language model (LLM) to draft a validation plan, you can quickly generate a starting point for validating a table. This can be useful when you have a new table and you want to get a sense of how to validate it (and adjustments could always be made later). The DraftValidation
class uses the chatlas
package to draft a validation plan for a given table using an LLM from either the "anthropic"
, "openai"
, "ollama"
or "bedrock"
provider. You can install all requirements for the class through an optional ‘generate’ install of Pointblank via pip install pointblank[generate]
.
The DraftValidation()
class is still experimental. Please report any issues you encounter in the Pointblank issue tracker.
Parameters
data :
FrameT
|Any
-
The data to be used for drafting a validation plan.
model :
str
-
The model to be used. This should be in the form of
provider:model
(e.g.,"anthropic:claude-3-5-sonnet-latest"
). Supported providers are"anthropic"
,"openai"
,"ollama"
, and"bedrock"
. api_key :
str
| None = None-
The API key to be used for the model.
Returns
:
str
-
The drafted validation plan.
Constructing the model
Argument
The model=
argument should be constructed using the provider and model name separated by a colon (provider:model
). The provider text can any of:
"anthropic"
(Anthropic)"openai"
(OpenAI)"ollama"
(Ollama)"bedrock"
(Amazon Bedrock)
The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider’s documentation for the most up-to-date model names.
Notes on Authentication
Providing a valid API key as a string in the api_key
argument is adequate for getting started but you should consider using a more secure method for handling API keys.
One way to do this is to load the API key from an environent variable and retrieve it using the os
module (specifically the os.getenv()
function). Places to store the API key might include .bashrc
, .bash_profile
, .zshrc
, or .zsh_profile
.
Another solution is to store one or more model provider API keys in an .env
file (in the root of your project). If the API keys have correct names (e.g., ANTHROPIC_API_KEY
or OPENAI_API_KEY
) then DraftValidation will automatically load the API key from the .env
file and there’s no need to provide the api_key
argument. An .env
file might look like this:
ANTHROPIC_API_KEY="your_anthropic_api_key_here"
OPENAI_API_KEY="your_openai_api_key_here"
There’s no need to have the python-dotenv
package installed when using .env
files in this way.
Notes on Data Sent to the Model Provider
The data sent to the model provider is a JSON summary of the table. This data summary is generated internally by DraftValidation
using the DataScan
class. The summary includes the following information:
- the number of rows and columns in the table
- the type of dataset (e.g., Polars, DuckDB, Pandas, etc.)
- the column names and their types
- column level statistics such as the number of missing values, min, max, mean, and median, etc.
- a short list of data values in each column
The JSON summary is used to provide the model with the necessary information to draft a validation plan. As such, even very large tables can be used with the DraftValidation
class since the contents of the table are not sent to the model provider.
The Amazon Bedrock is a special case since it is a self-hosted model and security controls are in place to ensure that data is kept within the user’s AWS environment. If using an Ollama model all data is handled locally, though only a few models are capable enough to perform the task of drafting a validation plan.
Examples
Let’s look at how the DraftValidation
class can be used to draft a validation plan for a table. The table to be used is "nycflights"
, which is available here via the load_dataset()
function. The model to be used is "anthropic:claude-3-5-sonnet-latest"
(which performs very well compared to other LLMs). The example assumes that the API key is stored in an .env
file as ANTHROPIC_API_KEY
.
import pointblank as pb
# Load the "nycflights" dataset as a DuckDB table
= pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
data
# Draft a validation plan for the "nycflights" table
=nycflights, model="anthropic:claude-3-5-sonnet-latest") pb.DraftValidation(data
The output will be a drafted validation plan for the "nycflights"
table and this will appear in the console.
```python
import pointblank as pb
# Define schema based on column names and dtypes
schema = pb.Schema(columns=[
("year", "int64"),
("month", "int64"),
("day", "int64"),
("dep_time", "int64"),
("sched_dep_time", "int64"),
("dep_delay", "int64"),
("arr_time", "int64"),
("sched_arr_time", "int64"),
("arr_delay", "int64"),
("carrier", "string"),
("flight", "int64"),
("tailnum", "string"),
("origin", "string"),
("dest", "string"),
("air_time", "int64"),
("distance", "int64"),
("hour", "int64"),
("minute", "int64")
])
# The validation plan
validation = (
pb.Validate(
data=your_data, # Replace your_data with the actual data variable
label="Draft Validation",
thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
)
.col_schema_match(schema=schema)
.col_vals_not_null(columns=[
"year", "month", "day", "sched_dep_time", "carrier", "flight",
"origin", "dest", "distance", "hour", "minute"
])
.col_vals_between(columns="month", left=1, right=12)
.col_vals_between(columns="day", left=1, right=31)
.col_vals_between(columns="sched_dep_time", left=106, right=2359)
.col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True)
.col_vals_between(columns="air_time", left=20, right=695, na_pass=True)
.col_vals_between(columns="distance", left=17, right=4983)
.col_vals_between(columns="hour", left=1, right=23)
.col_vals_between(columns="minute", left=0, right=59)
.col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"])
.col_count_match(count=18)
.row_count_match(count=336776)
.rows_distinct()
.interrogate()
)
validation
```
The drafted validation plan can be copied and pasted into a Python script or notebook for further use. In other words, the generated plan can be adjusted as needed to suit the specific requirements of the table being validated.
Note that the output does not know how the data was obtained, so it uses the placeholder your_data
in the data=
argument of the Validate
class. When adapted for use, this should be replaced with the actual data variable.