import pointblank as pb
# Preview the small_table dataset we'll use throughout this guide
="small_table", tbl_type="polars")) pb.preview(pb.load_dataset(dataset
PolarsRows13Columns8 |
||||||||
Schema validation in Pointblank allows you to verify that your data conforms to an expected structure and type specification. This is particularly useful when ensuring data consistency across systems or validating incoming data against predefined requirements.
Let’s first look at the dataset we’ll use for the first example:
import pointblank as pb
# Preview the small_table dataset we'll use throughout this guide
pb.preview(pb.load_dataset(dataset="small_table", tbl_type="polars"))
PolarsRows13Columns8 |
||||||||
date_time Datetime |
date Date |
a Int64 |
b String |
c Int64 |
d Float64 |
e Boolean |
f String |
|
---|---|---|---|---|---|---|---|---|
1 | 2016-01-04 11:00:00 | 2016-01-04 | 2 | 1-bcd-345 | 3 | 3423.29 | True | high |
2 | 2016-01-04 00:32:00 | 2016-01-04 | 3 | 5-egh-163 | 8 | 9999.99 | True | low |
3 | 2016-01-05 13:32:00 | 2016-01-05 | 6 | 8-kdg-938 | 3 | 2343.23 | True | high |
4 | 2016-01-06 17:23:00 | 2016-01-06 | 2 | 5-jdo-903 | None | 3892.4 | False | mid |
5 | 2016-01-09 12:36:00 | 2016-01-09 | 8 | 3-ldm-038 | 7 | 283.94 | True | low |
9 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
10 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
11 | 2016-01-26 20:07:00 | 2016-01-26 | 4 | 2-dmx-010 | 7 | 833.98 | True | low |
12 | 2016-01-28 02:51:00 | 2016-01-28 | 2 | 7-dmx-010 | 8 | 108.34 | False | low |
13 | 2016-01-30 11:23:00 | 2016-01-30 | 1 | 3-dka-303 | None | 2230.09 | True | high |
A schema in Pointblank is created using the Schema
class which defines the expected structure of a table. Once created, you apply schema validation through the col_schema_match()
validation step.
# Create a schema definition matching small_table structure
schema = pb.Schema(
columns=[
("date_time",), # Only check column name
("date",), # Only check column name
("a", "Int64"), # Check name and type
("b", "String"), # Check name and type
("c", "Int64"), # Check name and type
("d", "Float64"), # Check name and type
("e", "Boolean"), # Check name and type
("f",), # Only check column name
]
)
# Validate the small_table against the schema
small_table_validation = (
pb.Validate(
data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
label="Schema validation of `small_table`.",
)
.col_schema_match(schema=schema)
.interrogate()
)
small_table_validation
Pointblank Validation | |||||||||||||
Schema validation of `small_table`. Polars |
|||||||||||||
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
col_schema_match()
|
— | SCHEMA | ✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
The output shows the validation passed successfully. When all columns have the correct names and types as specified in the schema, the validation passes with a single passing test unit. If there were discrepancies, this would fail, but the basic output wouldn’t show specific issues.
For detailed information about validation results, use get_step_report()
:
Report for Validation Step 1 ✓ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64 |
|||||||
TARGET | EXPECTED | ||||||
---|---|---|---|---|---|---|---|
COLUMN | DATA TYPE | COLUMN | DATA TYPE | ||||
1 | date_time | Datetime(time_unit='us', time_zone=None) | 1 | date_time | ✓ | — | |
2 | date | Date | 2 | date | ✓ | — | |
3 | a | Int64 | 3 | a | ✓ | Int64 | ✓ |
4 | b | String | 4 | b | ✓ | String | ✓ |
5 | c | Int64 | 5 | c | ✓ | Int64 | ✓ |
6 | d | Float64 | 6 | d | ✓ | Float64 | ✓ |
7 | e | Boolean | 7 | e | ✓ | Boolean | ✓ |
8 | f | String | 8 | f | ✓ | — | |
Supplied Column Schema: [('date_time',), ('date',), ('a', 'Int64'), ('b', 'String'), ('c', 'Int64'), ('d', 'Float64'), ('e', 'Boolean'), ('f',)] |
The step report provides specific details about which columns were checked and whether they matched the schema, helping diagnose issues when validation fails.
When defining a schema, you need to specify column names and optionally their data types. By default, Pointblank enforces strict validation where:
The schema definition accepts column types as string representations, which vary depending on your data source:
string
: Character data (may also be "String"
, "varchar"
, "character"
, etc.)integer
: Integer values (may also be "Int64"
, "int"
, "bigint"
, etc.)numeric
: Numeric values including integers and floating-point numbers (may also be "Float64"
, "double"
, "decimal"
, etc.)boolean
: Logical values (True
/False
) (may also be "Boolean"
, "bool"
, etc.)datetime
: Date and time values (may also be "Datetime"
, "timestamp"
, etc.)date
: Date values (may also be "Date"
, etc.)time
: Time valuesFor specific database engines or DataFrame libraries, you may need to use their exact type names (like "VARCHAR(255)"
for SQL databases or "Int64"
for Polars integers).
To easily determine the correct type string for columns in your data, Pointblank provides two helpful functions:
PolarsRows3Columns3 |
|||
id Int64 |
name String |
join_date Date |
|
---|---|---|---|
1 | 1 | Alice | 2020-01-01 |
2 | 2 | Bob | 2021-03-15 |
3 | 3 | Charlie | 2022-07-10 |
# Method 2: Using `col_summary_tbl()` which shows column types and other details
pb.col_summary_tbl(sample_df)
PolarsRows3Columns3 |
||||||||||||||
Column | NA | UQ | Mean | SD | Min | P5 | Q1 | Med | Q3 | P95 | Max | IQR | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | id Int64 |
0 0.00 |
3 1.00 |
2.00 | 1.00 | 1.00 | 1.10 | 1.50 | 2.00 | 2.50 | 2.90 | 3.00 | 1.00 | |
2 | name String |
0 0.00 |
3 1.00 |
5.00 SL |
2.00 SL |
3 SL |
— | — | 5 SL |
— | — | 7 SL |
— | |
3 | join_date Date |
0 0.00 |
3 1.00 |
— | — | 2020-01-01 – 2022-07-10 | — |
These functions help you identify the exact type strings to use in your schema definitions, eliminating guesswork and ensuring compatibility with your data source.
You can create a schema in four different ways, each with its own advantages. All schema objects can be printed to display their column names and data types.
columns=
This approach allows for mixed validation: some columns checked for both name and type, others only for name:
schema_tuples = pb.Schema(
# List of tuples approach: flexible for mixed type/name checking ---
columns=[
("name", "String"), # Check name and type
("age", "Int64"), # Check name and type
("height",) # Check name only
]
)
print(schema_tuples)
Pointblank Schema
name: String
age: Int64
height: <ANY>
This is the only method that allows checking just column names for some columns while checking both names and types for others.
columns=
This approach is often the most readable when defining a schema manually, especially for larger schemas:
schema_dict = pb.Schema(
# Dictionary approach (keys are column names, values are data types) ---
columns={
"name": "String",
"age": "Int64",
"height": "Float64",
"created_at": "Datetime"
}
)
print(schema_dict)
Pointblank Schema
name: String
age: Int64
height: Float64
created_at: Datetime
With this method, you must always provide both column names (as keys) and their types (as values).
For more readable code with a small number of columns:
schema_kwargs = pb.Schema(
# Keyword arguments approach (more readable for simple schemas) ---
name="String",
age="Int64",
height="Float64"
)
print(schema_kwargs)
Pointblank Schema
name: String
age: Int64
height: Float64
Like the dictionary method, this approach requires both column names and types.
tbl=
You can automatically extract a schema from an existing table:
import polars as pl
# Create a sample dataframe
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"height": [5.6, 6.0, 5.8]
})
# Extract schema from table
schema_from_table = pb.Schema(tbl=df)
print(schema_from_table)
Pointblank Schema
name: String
age: Int64
height: Float64
This is especially useful when you want to validate that future data matches the structure of a reference dataset.
You can specify multiple acceptable types for a column by providing a list of types:
# Schema with multiple possible types for a column
schema_multi_types = pb.Schema(
columns={
"name": "String",
"age": ["Int64", "Float64"], # Accept either integer or float
"active": "Boolean"
}
)
print(schema_multi_types)
Pointblank Schema
name: String
age: ['Int64', 'Float64']
active: Boolean
This is useful when working with data sources that might represent the same information in different ways (e.g., integers sometimes stored as floats).
When using col_schema_match()
, you can customize validation behavior with several important options:
Option | Default | Description |
---|---|---|
complete |
TRUE |
Require exact column presence (no extra columns allowed) |
in_order |
TRUE |
Enforce column order |
case_sensitive_colnames |
TRUE |
Make column name matching case-sensitive |
case_sensitive_dtypes |
TRUE |
Make data type matching case-sensitive |
full_match_dtypes |
TRUE |
Require exact (not partial) type name matches |
By default, col_schema_match()
requires a complete match between the schema’s columns and the table’s columns. You can make this more flexible:
# Create a sample table
users_table_extra = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"extra_col": ["a", "b", "c"] # Extra column not in schema
})
# Create a schema
schema = pb.Schema(
columns={"id": "Int64", "name": "String", "age": "Int64"}
)
# Validate without requiring all columns to be present
validation = (
pb.Validate(data=users_table_extra)
.col_schema_match(
schema=schema,
# Allow schema columns to be a subset ---
complete=False
)
.interrogate()
)
validation.get_step_report(i=1)
Report for Validation Step 1 ✓ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64 |
|||||||
TARGET | EXPECTED | ||||||
---|---|---|---|---|---|---|---|
COLUMN | DATA TYPE | COLUMN | DATA TYPE | ||||
1 | id | Int64 | 1 | id | ✓ | Int64 | ✓ |
2 | name | String | 2 | name | ✓ | String | ✓ |
3 | age | Int64 | 3 | age | ✓ | Int64 | ✓ |
4 | extra_col | String | |||||
Supplied Column Schema: [('id', 'Int64'), ('name', 'String'), ('age', 'Int64')] |
You can control whether column order matters in your validation:
# Create a sample table
users_table = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
})
# Create a schema
schema = pb.Schema(
columns={"name": "String", "age": "Int64", "id": "Int64"}
)
# Validate without enforcing column order
validation = (
pb.Validate(data=users_table)
.col_schema_match(
schema=schema,
# Don't enforce column order ---
in_order=False
)
.interrogate()
)
validation.get_step_report(i=1)
Report for Validation Step 1 ✓ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64 |
|||||||
TARGET | EXPECTED | ||||||
---|---|---|---|---|---|---|---|
COLUMN | DATA TYPE | COLUMN | DATA TYPE | ||||
1 | id | Int64 | 3 | id | ✓ | Int64 | ✓ |
2 | name | String | 1 | name | ✓ | String | ✓ |
3 | age | Int64 | 2 | age | ✓ | Int64 | ✓ |
Supplied Column Schema: [('name', 'String'), ('age', 'Int64'), ('id', 'Int64')] |
Control whether column names and data types are case-sensitive:
# Create schema with different case charactistics
case_schema = pb.Schema(
columns={"ID": "int64", "NAME": "string", "AGE": "int64"}
)
# Create validation with case-insensitive column names and types
validation = (
pb.Validate(data=users_table)
.col_schema_match(
schema=case_schema,
# Ignore case in column names ---
case_sensitive_colnames=False,
# Ignore case in data type names ---
case_sensitive_dtypes=False
)
.interrogate()
)
validation.get_step_report(i=1)
Report for Validation Step 1 ✓ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN = column DTYPE = dtype float ≠ float64 |
|||||||
TARGET | EXPECTED | ||||||
---|---|---|---|---|---|---|---|
COLUMN | DATA TYPE | COLUMN | DATA TYPE | ||||
1 | id | Int64 | 1 | ID | ✓ | int64 | ✓ |
2 | name | String | 2 | NAME | ✓ | string | ✓ |
3 | age | Int64 | 3 | AGE | ✓ | int64 | ✓ |
Supplied Column Schema: [('ID', 'int64'), ('NAME', 'string'), ('AGE', 'int64')] |
Control how strictly data types must match:
# Create schema with simplified type names
type_schema = pb.Schema(
# Using simplified type names ---
columns={"id": "int", "name": "str", "age": "int"}
)
# Allow partial type matches
validation = (
pb.Validate(data=users_table)
.col_schema_match(
schema=type_schema,
# Ignore case in data type names ---
case_sensitive_dtypes=False,
# Allow partial type name matches ---
full_match_dtypes=False
)
.interrogate()
)
validation.get_step_report(i=1)
Report for Validation Step 1 ✓ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE = dtype float = float64 |
|||||||
TARGET | EXPECTED | ||||||
---|---|---|---|---|---|---|---|
COLUMN | DATA TYPE | COLUMN | DATA TYPE | ||||
1 | id | Int64 | 1 | id | ✓ | int | ✓ |
2 | name | String | 2 | name | ✓ | str | ✓ |
3 | age | Int64 | 3 | age | ✓ | int | ✓ |
Supplied Column Schema: [('id', 'int'), ('name', 'str'), ('age', 'int')] |
This section explores common patterns for applying schema validation to different scenarios. Each pattern addresses specific validation needs you might encounter when working with real-world data. We’ll examine the step reports for these validations since they provide more detailed information about what was checked and how the validation performed, offering an intuitive way to understand the results beyond simple pass/fail indicators.
This section explores common patterns for applying schema validation to different scenarios. Each pattern addresses specific validation needs you might encounter when working with real-world data. We’ll examine the step reports (get_step_report()
) for these validations since they provide more detailed information about what was checked and how the validation performed, offering an intuitive way to understand the results beyond simple pass/fail indicators.
When you only care about column names but not their types:
# Create a schema with only column names
structure_schema = pb.Schema(
columns=["id", "name", "age", "extra_col"]
)
# Validate structure only
validation = (
pb.Validate(data=users_table_extra)
.col_schema_match(schema=structure_schema)
.interrogate()
)
validation.get_step_report(i=1)
Report for Validation Step 1 ✓ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64 |
|||||||
TARGET | EXPECTED | ||||||
---|---|---|---|---|---|---|---|
COLUMN | DATA TYPE | COLUMN | DATA TYPE | ||||
1 | id | Int64 | 1 | id | ✓ | — | |
2 | name | String | 2 | name | ✓ | — | |
3 | age | Int64 | 3 | age | ✓ | — | |
4 | extra_col | String | 4 | extra_col | ✓ | — | |
Supplied Column Schema: [('id',), ('name',), ('age',), ('extra_col',)] |
Validate types for critical columns but just presence for others:
# Mixed validation for different columns
mixed_schema = pb.Schema(
columns=[
("id", "Int64"), # Check name and type
("name", "String"), # Check name and type
("age",), # Check name only
("extra_col",) # Check name only
]
)
# Validate with mixed approach
validation = (
pb.Validate(data=users_table_extra)
.col_schema_match(schema=mixed_schema)
.interrogate()
)
validation.get_step_report(i=1)
Report for Validation Step 1 ✓ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64 |
|||||||
TARGET | EXPECTED | ||||||
---|---|---|---|---|---|---|---|
COLUMN | DATA TYPE | COLUMN | DATA TYPE | ||||
1 | id | Int64 | 1 | id | ✓ | Int64 | ✓ |
2 | name | String | 2 | name | ✓ | String | ✓ |
3 | age | Int64 | 3 | age | ✓ | — | |
4 | extra_col | String | 4 | extra_col | ✓ | — | |
Supplied Column Schema: [('id', 'Int64'), ('name', 'String'), ('age',), ('extra_col',)] |
As your data evolves, you might need to adapt your validation approach:
# Original schema
original_schema = pb.Schema(
columns={
"id": "Int64",
"name": "String"
}
)
# New data with additional columns
evolved_data = pl.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35], # New column
"active": [True, False, True] # New column
})
# Validate with flexible parameters
validation = (
pb.Validate(evolved_data)
.col_schema_match(
schema=original_schema,
complete=False, # Allow extra columns
in_order=False # Don't enforce order
)
.interrogate()
)
validation.get_step_report(i=1)
Report for Validation Step 1 ✓ COLUMN SCHEMA MATCH COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64 |
|||||||
TARGET | EXPECTED | ||||||
---|---|---|---|---|---|---|---|
COLUMN | DATA TYPE | COLUMN | DATA TYPE | ||||
1 | id | Int64 | 1 | id | ✓ | Int64 | ✓ |
2 | name | String | 2 | name | ✓ | String | ✓ |
3 | age | Int64 | |||||
4 | active | Boolean | |||||
Supplied Column Schema: [('id', 'Int64'), ('name', 'String')] |
Schema validation is often just one part of a comprehensive data validation strategy. You can combine schema checks with other validation steps:
# Define a schema
schema = pb.Schema(
columns={
"id": "Int64",
"name": "String",
"age": "Int64"
}
)
# Create a validation plan
validation = (
pb.Validate(
users_table,
label="User data validation",
thresholds=pb.Thresholds(warning=0.05, error=0.1)
)
# Add schema validation
.col_schema_match(schema=schema)
# Add other validation steps
.col_vals_not_null(columns="id")
.col_vals_gt(columns="age", value=26)
.interrogate()
)
validation
This approach allows you to first validate the structure of your data and then check specific business rules or constraints.
Define schemas early: document and define expected data structures early in your data workflow.
Choose the right creation method:
columns=<dict>
for readability with many columnscolumns=<list of tuples>
for mixed name/type validationkwargs
for small schemas with simple column namestbl=
to extract schemas from reference datasetsBe deliberate about strictness: choose validation parameters based on your specific needs:
complete=True
) for critical data interfacescomplete=False
, in_order=False
) for evolving datasetsReuse schemas: create schema definitions that can be reused across multiple validation contexts.
Version control schemas: as your data evolves, maintain versions of your schemas to track changes.
Extract schemas from reference data: when you have a ‘golden’ dataset that represents your ideal structure, use Schema(tbl=reference_data)
to extract its schema.
Consider type flexibility: use multiple types per column (["Int64", "Float64"]
) when working with data from diverse sources.
Combine with targeted validation: use schema validation for structural checks and add specific validation steps for business rules.
Schema validation provides a powerful mechanism for ensuring your data adheres to expected structural requirements. It serves as an excellent first line of defense in your data validation strategy, verifying that the data you’re working with has the expected shape before applying more detailed business rule validations.
The Schema
class offers multiple ways to define schemas, from manual specification with dictionaries or keyword arguments to automatic extraction from reference tables. When combined with the flexible options of col_schema_match()
, you can implement validation approaches ranging from strict structural enforcement to more flexible evolution-friendly checks.
By understanding the different schema creation methods and validation options, you can efficiently validate the structure of your data tables and ensure they meet your requirements before processing.