Validate.col_schema_match

Validate.col_schema_match(
    schema,
    complete=True,
    in_order=True,
    case_sensitive_colnames=True,
    case_sensitive_dtypes=True,
    full_match_dtypes=True,
    pre=None,
    thresholds=None,
    actions=None,
    brief=None,
    active=True,
)

Do columns in the table (and their types) match a predefined schema?

The col_schema_match() method works in conjunction with an object generated by the Schema class. That class object is the expectation for the actual schema of the target table. The validation step operates over a single test unit, which is whether the schema matches that of the table (within the constraints enforced by the complete=, and in_order= options).

Parameters

schema : Schema

A Schema object that represents the expected schema of the table. This object is generated by the Schema class.

complete : bool = True

Should the schema match be complete? If True, then the target table must have all columns specified in the schema. If False, then the table can have additional columns not in the schema (i.e., the schema is a subset of the target table’s columns).

in_order : bool = True

Should the schema match be in order? If True, then the columns in the schema must appear in the same order as they do in the target table. If False, then the order of columns in the schema and the target table can differ.

case_sensitive_colnames : bool = True

Should the schema match be case-sensitive with regard to column names? If True, then the column names in the schema and the target table must match exactly. If False, then the column names are compared in a case-insensitive manner.

case_sensitive_dtypes : bool = True

Should the schema match be case-sensitive with regard to column data types? If True, then the column data types in the schema and the target table must match exactly. If False, then the column data types are compared in a case-insensitive manner.

full_match_dtypes : bool = True

Should the schema match require a full match of data types? If True, then the column data types in the schema and the target table must match exactly. If False then substring matches are allowed, so a schema data type of Int would match a target table data type of Int64.

pre : Callable | None = None

An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the Preprocessing section for more information on how to use this argument.

thresholds : int | float | bool | tuple | dict | Thresholds = None

Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in Validate(thresholds=...). The default is None, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the Thresholds section for information on how to set threshold levels.

actions : Actions | None = None

Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the Actions class should be used to define the actions.

brief : str | bool | None = None

An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like "{step}" to insert the step number, or "{auto}" to include an automatically generated brief. If True the entire brief will be automatically generated. If None (the default) then there won’t be a brief.

active : bool = True

A boolean value indicating whether the validation step should be active. Using False will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged).

Returns

: Validate

The Validate object with the added validation step.

Preprocessing

The pre= argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied.

The preprocessing function can be any callable that takes a table as input and returns a modified table. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the Validate object or used in subsequent validation steps.

Thresholds

The thresholds= parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in Validate(thresholds=...).

There are three threshold levels: ‘warning’, ‘error’, and ‘critical’. The threshold values can either be set as a proportion failing of all test units (a value between 0 to 1), or, the absolute number of failing test units (as integer that’s 1 or greater).

Thresholds can be defined using one of these input schemes:

  1. use the Thresholds class (the most direct way to create thresholds)
  2. provide a tuple of 1-3 values, where position 0 is the ‘warning’ level, position 1 is the ‘error’ level, and position 2 is the ‘critical’ level
  3. create a dictionary of 1-3 value entries; the valid keys: are ‘warning’, ‘error’, and ‘critical’
  4. a single integer/float value denoting absolute number or fraction of failing test units for the ‘warning’ level only

If the number of failing test units exceeds set thresholds, the validation step will be marked as ‘warning’, ‘error’, or ‘critical’. All of the threshold levels don’t need to be set, you’re free to set any combination of them.

Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the actions= parameter).

Examples

For the examples here, we’ll use a simple Polars DataFrame with three columns (string, integer, and float). The table is shown below:

import pointblank as pb
import polars as pl

tbl = pl.DataFrame(
    {
        "a": ["apple", "banana", "cherry", "date"],
        "b": [1, 6, 3, 5],
        "c": [1.1, 2.2, 3.3, 4.4],
    }
)

pb.preview(tbl)
a
String
b
Int64
c
Float64
1 apple 1 1.1
2 banana 6 2.2
3 cherry 3 3.3
4 date 5 4.4

Let’s validate that the columns in the table match a predefined schema. A schema can be defined using the Schema class.

schema = pb.Schema(
    columns=[("a", "String"), ("b", "Int64"), ("c", "Float64")]
)

You can print the schema object to verify that the expected schema is as intended.

print(schema)
Pointblank Schema
  a: String
  b: Int64
  c: Float64

Now, we’ll use the col_schema_match() method to validate the table against the expected schema object. There is a single test unit for this validation step (whether the schema matches the table or not).

validation = (
    pb.Validate(data=tbl)
    .col_schema_match(schema=schema)
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00

The validation table shows that the schema matches the table. The single test unit passed since the table columns and their types match the schema.