Validate.col_schema_match

Validate.col_schema_match(
    schema,
    complete=True,
    in_order=True,
    case_sensitive_colnames=True,
    case_sensitive_dtypes=True,
    full_match_dtypes=True,
    pre=None,
    thresholds=None,
    active=True,
)

Do columns in the table (and their types) match a predefined schema?

The col_schema_match() method works in conjunction with an object generated by the Schema class. That class object is the expectation for the actual schema of the target table. The validation step operates over a single test unit, which is whether the schema matches that of the table (within the constraints enforced by the complete=, and in_order= options).

Parameters

schema : Schema

A Schema object that represents the expected schema of the table. This object is generated by the Schema class.

complete : bool = True

Should the schema match be complete? If True, then the target table must have all columns specified in the schema. If False, then the table can have additional columns not in the schema (i.e., the schema is a subset of the target table’s columns).

in_order : bool = True

Should the schema match be in order? If True, then the columns in the schema must appear in the same order as they do in the target table. If False, then the order of columns in the schema and the target table can differ.

case_sensitive_colnames : bool = True

Should the schema match be case-sensitive with regard to column names? If True, then the column names in the schema and the target table must match exactly. If False, then the column names are compared in a case-insensitive manner.

case_sensitive_dtypes : bool = True

Should the schema match be case-sensitive with regard to column data types? If True, then the column data types in the schema and the target table must match exactly. If False, then the column data types are compared in a case-insensitive manner.

full_match_dtypes : bool = True

Should the schema match require a full match of data types? If True, then the column data types in the schema and the target table must match exactly. If False then substring matches are allowed, so a schema data type of Int would match a target table data type of Int64.

pre : Callable | None = None

A pre-processing function or lambda to apply to the data table for the validation step.

thresholds : int | float | bool | tuple | dict | Thresholds = None

Failure threshold levels so that the validation step can react accordingly when exceeding the set levels for different states (warn, stop, and notify). This can be created simply as an integer or float denoting the absolute number or fraction of failing test units for the ‘warn’ level. Otherwise, you can use a tuple of 1-3 values, a dictionary of 1-3 entries, or a Thresholds object.

active : bool = True

A boolean value indicating whether the validation step should be active. Using False will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged).

Returns

: Validate

The Validate object with the added validation step.

Examples

For the examples here, we’ll use a simple Polars DataFrame with three columns (string, integer, and float). The table is shown below:

import pointblank as pb
import polars as pl

tbl = pl.DataFrame(
    {
        "a": ["apple", "banana", "cherry", "date"],
        "b": [1, 6, 3, 5],
        "c": [1.1, 2.2, 3.3, 4.4],
    }
)

pb.preview(tbl)
a
String
b
Int64
c
Float64
1 apple 1 1.1
2 banana 6 2.2
3 cherry 3 3.3
4 date 5 4.4

Let’s validate that the columns in the table match a predefined schema. A schema can be defined using the Schema class.

schema = pb.Schema(
    columns=[("a", "String"), ("b", "Int64"), ("c", "Float64")]
)

You can print the schema object to verify that the expected schema is as intended.

print(schema)
Pointblank Schema
  a: String
  b: Int64
  c: Float64

Now, we’ll use the col_schema_match() method to validate the table against the expected schema object. There is a single test unit for this validation step (whether the schema matches the table or not).

validation = (
    pb.Validate(data=tbl)
    .col_schema_match(schema=schema)
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00

The validation table shows that the schema matches the table. The single test unit passed since the table columns and their types match the schema.