Preprocessing

While the validation methods available can do a lot for you, there’s likewise a lot of things you can’t easily do with them. What if you wanted to validate:

These are more complicated types of validations, yet checks of this type are very commonplace. We don’t need to have a very large library of validation methods to tackle each an every case; the number of combinations indeed seems exceedingly large. Instead, let’s transform the table through a preprocessing step and expose the key values. In conjunction with that sort of table mutation, we then can use the existing library of validation methods.

Central to this approach is the idea of composability. Pointblank makes it easy to safely transform the input table in a given validation step via the pre= argument. Any computed columns are available for the (short) lifetime of the validation step during interrogation.

Now, through a series of examples, let’s look at the process of performing the validations mentioned above. We’ll use the small_table dataset for all of the examples. Here it is in its entirety:

PolarsRows13Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291.03 True mid
7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843.34 True high
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1035.64 False low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

The Basics of Preprocessing the Input Table

In getting to grips with the basics, we’ll try to validate that string lengths in the b column are less than 10 characters. We can’t directly use the col_vals_lt() validation method with that column because it is meant to be used with a column of numeric values. Let’s just give that method what it needs and create a column with string lengths! The target table is a Polars DataFrame so we’ll provide a lambda function that uses the Polars API to add in that numeric column:

import polars as pl

(
    pb.Validate(
        data=pb.load_dataset("small_table"),
        tbl_name="small_table",
        label="String lengths"
    )
    .col_vals_lt(
        columns="string_lengths",  # the generated column through `pre=`
        value=10,                  # the string length value to be less than
        pre=lambda df: df.with_columns(string_lengths=pl.col("b").str.len_chars())
    )
    .interrogate()
)
Pointblank Validation
String lengths
Polarssmall_table
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_lt
col_vals_lt()
string_lengths 10 13 13
1.00
0
0.00

The validation was successfully constructed and we can see from the validation report table that all strings in b had lengths less than 10 characters. Also note that the icon under the TBL column is no longer a rightward-facing arrow, but one that is indicative of transformation.

From the code itself, we’re not directly testing the b column. Instead the test is of the string_lengths column generated by the lambda provided to pre=. We used Polars to do the transformation work here and that’s the piece that generates numerical values of string lengths in the computed column.

That transformation occurs only during interrogation and only for that validation step. Any prior or subsequent steps would normally use the as-provided small_table. Having the possibility for data transformation being isolated at the step level means that you don’t have to generate separate validation plans for each form of the data, you’re free to fluidly transform the target table as necessary for perform validations on different representations of the data.

Aside from using a lambda, you can pass in a custom function. Just make sure not to evaluate it at the pre= parameter (everything is stored lazily until interrogation time). Here’s an example of that approach:

def add_string_lengths(df):
    return df.with_columns(string_lengths=pl.col("b").str.len_chars())

(
    pb.Validate(
        data=pb.load_dataset("small_table"),
        tbl_name="small_table",
        label="String lengths"
    )
    .col_vals_lt(columns=pb.last_n(1), value=10, pre=add_string_lengths)
    .interrogate()
)
Pointblank Validation
String lengths
Polarssmall_table
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_lt
col_vals_lt()
string_lengths 10 13 13
1.00
0
0.00

The column-generating logic was placed in the add_string_lengths() function, which is then passed to pre=. We also know that the column generated will be the final one in the column series, so the last_n() column selector obviates the need to provide "string_lengths" here (you’ll still find the target column name echoed on the validation report table).

Using Narwhals to Preprocess Many Types of DataFrames

In this previous example we used a Polars table (the load_dataset() returns a Polars DataFrame by default). You might have a situation where where you perform data validation variously on Pandas and Polars DataFrames. This is where Narwhals becomes handy.

Let’s obtain small_table as a Pandas DataFrame. We’ll construct a validation step to verify that the median of column c is greater than the median in column a.

import narwhals as nw

(
    pb.Validate(
        data=pb.load_dataset("small_table", tbl_type="pandas"),
        tbl_name="small_table",
        label="Median comparison",
    )
    .col_vals_gt(
        columns="c",
        value=pb.col("a"),
        pre=lambda df: nw.from_native(df).select(nw.median("c"), nw.median("a"))
    )
    .interrogate()
)
Pointblank Validation
Median comparison
Pandassmall_table
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
c a 1 1
1.00
0
0.00

There’s a bit to unpack here so let’s look at at the lambda function first. Narwhals can translate a Pandas DataFrame to a Narwhals DataFrame with its from_native() function. After that initiating step, you’re free to use the Narwhals API (which is modeled on a subset of the Polars API) to do the necessary data transformation. In this case, we are getting the medians of the c and a columns and ending up with a one-row, two-column table.

The goal is to check that the median value of c is greater than the corresponding median of column a, so we set up columns= and value= parameters in that way within the col_vals_gt() validation method call.

Swapping in Different DataFrames Just in Time

Let’s now try to prepare the final validation scenario, checking that there are at least three instances of every categorical value in column f (which contains string values in the set of "low", "mid", and "high"). This time, we’ll prepare the transformed table (transformed by Polars expressions) outside of the Pointblank code. Then, we’ll plug in the data_transformed DataFrame with in lambda expression in pre=:

data_original = pb.load_dataset("small_table")
data_transformed = data_original.group_by("f").len(name="n")

v = (
    pb.Validate(
        data=pb.load_dataset("small_table"),
        tbl_name="small_table",
        label="Category counts",
    )
    .col_vals_ge(columns="n", value=3, pre=lambda x: data_transformed)
    .interrogate()
)

We can see from the validation report table that there are three test units. This corresponds to a row for each of the categorical value counts. From the report, we find that two of the three test units are passing test units (turns out there are only two instances of "mid" in column f).