Preprocessing

While the available validation methods can do a lot for you, there’s likewise a lot of things you can’t easily do with them. What if you wanted to validate that

string lengths in a column are less than 10 characters?
the median of values in a column is less than the median of values in another column?
there are at least three instances of every categorical value in a column?

These constitute more sophisticated validation requirements, yet such examinations are quite prevalent in practice. Rather than expanding our library to encompass every conceivable validation scenario (a pursuit that would yield an unwieldy and potentially infinite collection) we instead employ a more elegant approach. By transforming the table under examination through judicious preprocessing and exposing key metrics, we may subsequently employ the existing collection of validation methods. This compositional strategy affords us considerable analytical power while maintaining conceptual clarity and implementation parsimony.

Central to this approach is the idea of composability. Pointblank makes it easy to safely transform the target table for a given validation via the pre= argument. Any computed columns are available for the (short) lifetime of the validation step during interrogation. This composability means:

we can validate on different forms of the initial dataset (e.g., validating on aggregate forms, validating on calculated columns, etc.)
there’s no need to start an entirely new validation process for each transformed version of the data (i.e., one tabular report could be produced instead of several)

This compositional paradigm allows us to use data transformation effectively within our validation workflows, maintaining both flexibility and clarity in our data quality assessments.

Transforming Data with Lambda Functions

Now, through examples, let’s look at the process of performing the validations mentioned above. We’ll use the small_table dataset for all of the examples. Here it is in its entirety:

	date_time Datetime	date Date	a Int64	b String	c Int64	d Float64	e Boolean	f String
PolarsRows13Columns8
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	True	low
3	2016-01-05 13:32:00	2016-01-05	6	8-kdg-938	3	2343.23	True	high
4	2016-01-06 17:23:00	2016-01-06	2	5-jdo-903	None	3892.4	False	mid
5	2016-01-09 12:36:00	2016-01-09	8	3-ldm-038	7	283.94	True	low
6	2016-01-11 06:15:00	2016-01-11	4	2-dhe-923	4	3291.03	True	mid
7	2016-01-15 18:46:00	2016-01-15	7	1-knw-093	3	843.34	True	high
8	2016-01-17 11:27:00	2016-01-17	4	5-boe-639	2	1035.64	False	low
9	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
10	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
11	2016-01-26 20:07:00	2016-01-26	4	2-dmx-010	7	833.98	True	low
12	2016-01-28 02:51:00	2016-01-28	2	7-dmx-010	8	108.34	False	low
13	2016-01-30 11:23:00	2016-01-30	1	3-dka-303	None	2230.09	True	high

In getting to grips with the basics, we’ll try to validate that string lengths in the b column are less than 10 characters. We can’t directly use the col_vals_lt() validation method with that column because it is meant to be used with a column of numeric values. Let’s just give that method what it needs and create a column with string lengths! The target table is a Polars DataFrame so we’ll provide a lambda function that uses the Polars API to add in that numeric column:

import polars as pl

(
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
        tbl_name="small_table",
        label="String lengths"
    )
    .col_vals_lt(
        columns="string_lengths",  # the generated column through `pre=`
        value=10,                  # the string length value to be less than
        pre=lambda df: df.with_columns(string_lengths=pl.col("b").str.len_chars())
    )
    .interrogate()
)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
String lengths Polarssmall_table
#4CA64C	1	col_vals_lt()	string_lengths	10	✓	13	13 1.00	0 0.00	—	—	—	—

The validation was successfully constructed and we can see from the validation report table that all strings in b had lengths less than 10 characters. Also note that the icon under the TBL column is no longer a rightward-facing arrow, but one that is indicative of a transformation taking place.

Let’s examine the transformation approach more closely. In the previous example, we’re not directly testing the b column itself. Instead, we’re validating the string_lengths column that was generated by the lambda function provided to pre=. The Polars API’s with_columns() method does the heavy lifting, creating numerical values that represent each string’s length in the original column.

That transformation occurs only during interrogation and only for that validation step. Any prior or subsequent steps would normally use the as-provided small_table. Having the possibility for data transformation being isolated at the step level means that you don’t have to generate separate validation plans for each form of the data, you’re free to fluidly transform the target table as necessary for perform validations on different representations of the data.

Using Custom Functions for Preprocessing

While lambda functions work well for simple transformations, custom named functions can make your validation code more organized and reusable, especially for complex preprocessing logic. Let’s implement the same string length validation using a dedicated function:

def add_string_lengths(df):
    return df.with_columns(string_lengths=pl.col("b").str.len_chars())

(
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
        tbl_name="small_table",
        label="String lengths for column `b`."
    )
    .col_vals_lt(
        columns=pb.last_n(1),
        value=10,
        pre=add_string_lengths
    )
    .interrogate()
)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
String lengths for column `b`. Polarssmall_table
#4CA64C	1	col_vals_lt()	string_lengths	10	✓	13	13 1.00	0 0.00	—	—	—	—

The column-generating logic was placed in the add_string_lengths() function, which is then passed to pre=. Notice we’re using pb.last_n(1) in the columns parameter. This is a convenient column selector that targets the last column in the DataFrame, which in our case is the newly created string_lengths column. This saves us from having to explicitly write out the column name, making our code more adaptable if column names change. Despite not specifying the name directly, you’ll still see the actual column name (string_lengths) displayed in the validation report.

Creating Parameterized Preprocessing Functions

So far we’ve used simple functions and lambdas, but sometimes you may want to create more flexible preprocessing functions that can be configured with parameters. Let’s create a reusable function that can calculate string lengths for any column:

def string_length_calculator(column_name):
    """Returns a preprocessing function that calculates string lengths for the specified column."""
    def preprocessor(df):
        return df.with_columns(string_lengths=pl.col(column_name).str.len_chars())
    return preprocessor

# Validate string lengths in column b
(
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="polars"),
        tbl_name="small_table",
        label="String lengths for column `b`."
    )
    .col_vals_lt(
        columns=pb.last_n(1),
        value=10,
        pre=string_length_calculator(column_name="b")
    )
    .interrogate()
)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
String lengths for column `b`. Polarssmall_table
#4CA64C	1	col_vals_lt()	string_lengths	10	✓	13	13 1.00	0 0.00	—	—	—	—

This pattern is called a function factory, which is a function that creates and returns another function. The outer function (string_length_calculator()) accepts parameters that customize the behavior of the returned preprocessing function. The inner function (preprocessor()) is what actually gets called during validation.

This approach offers several benefits as it:

creates reusable, configurable preprocessing functions
keeps your validation code DRY
allows you to separate configuration from implementation
enables easy application of the same transformation to different columns

You could extend this pattern to create even more sophisticated preprocessing functions with multiple parameters, default values, and complex logic.

Using Narwhals to Preprocess Many Types of DataFrames

In this previous example we used a Polars table. You might have a situation where you perform data validation variously on Pandas and Polars DataFrames. This is where Narwhals becomes handy: it provides a single, consistent API that works across multiple DataFrame types, eliminating the need to learn and switch between different APIs depending on your data source.

Let’s obtain small_table as a Pandas DataFrame. We’ll construct a validation step to verify that the median of column c is greater than the median in column a.

import narwhals as nw

(
    pb.Validate(
        data=pb.load_dataset(dataset="small_table", tbl_type="pandas"),
        tbl_name="small_table",
        label="Median comparison.",
    )
    .col_vals_gt(
        columns="c",
        value=pb.col("a"),
        pre=lambda df: nw.from_native(df).select(nw.median("c"), nw.median("a"))
    )
    .interrogate()
)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
Median comparison. Pandassmall_table
#4CA64C	1	col_vals_gt()	c	a	✓	1	1 1.00	0 0.00	—	—	—	—

The goal is to check that the median value of c is greater than the corresponding median of column a, which we set up through the columns= and value= parameters in the col_vals_gt() method.

There’s a bit to unpack here so let’s look at at the lambda function first. Narwhals can translate a Pandas DataFrame to a Narwhals DataFrame with its from_native() function. After that initiating step, you’re free to use the Narwhals API (which is modeled on a subset of the Polars API) to do the necessary data transformation. In this case, we are getting the medians of the c and a columns and ending up with a one-row, two-column table.

We should note that the transformed table is, perhaps surprisingly, a Narwhals DataFrame (we didn’t have to end with .to_native()). Pointblank is able to work directly with the Narwhals DataFrame for validation purposes, which makes the workflow more concise.

One more thing to note: Pointblank provides a convenient syntactic sugar for working with Narwhals. If you name the lambda parameter dfn instead of df, the system automatically applies nw.from_native() to the input DataFrame first. This lets you write more concise code without having to explicitly convert the DataFrame to a Narwhals format.

Swapping in a Totally Different DataFrame

Sometimes data validation requires looking at completely transformed versions of your data (such as aggregated summaries, pivoted views, or even reference tables). While this approach goes against the typical paradigm of validating a single target table, there are legitimate use cases where you might need to validate properties that only emerge after significant transformations.

Let’s now try to prepare the final validation scenario, checking that there are at least three instances of every categorical value in column f (which contains string values in the set of "low", "mid", and "high"). This time, we’ll prepare the transformed table (transformed by Polars expressions) outside of the Pointblank code.

data_original = pb.load_dataset(dataset="small_table", tbl_type="polars")
data_transformed = data_original.group_by("f").len(name="n")

data_transformed

shape: (3, 2)

f	n
str	u32
"low"	5
"high"	6
"mid"	2

Then, we’ll plug in the data_transformed DataFrame with a lambda expression in pre=:

(
    pb.Validate(
        data=data_original,
        tbl_name="small_table",
        label="Category counts.",
    )
    .col_vals_ge(
        columns="n",
        value=3,
        pre=lambda x: data_transformed
    )
    .interrogate()
)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C
Pointblank Validation
Category counts. Polarssmall_table
#4CA64C66	1	col_vals_ge()	n	3	✓	3	2 0.67	1 0.33	—	—	—

We can see from the validation report table that there are three test units. This corresponds to a row for each of the categorical value counts. From the report, we find that two of the three test units are passing test units (turns out there are only two instances of "mid" in column f).

Note that the swapped-in table can be any table type that Pointblank supports, like a Polars DataFrame (as shown here), a Pandas DataFrame, a Narwhals DataFrame, or any other compatible format. This flexibility allows you to validate properties of your data that might only be apparent after significant reshaping or aggregation.

Conclusion

The preprocessing capabilities in Pointblank provide the power and flexibility for validating complex data properties beyond what’s directly possible with the standard validation methods. Through the pre= parameter, you can:

transform your data on-the-fly with computed columns
generate aggregated metrics to validate statistical properties
work seamlessly across different DataFrame types using Narwhals
swap in completely different tables when validating properties that emerge only after transformation

By combining these preprocessing techniques with Pointblank’s validation methods, you can create comprehensive data quality checks that address virtually any validation scenario without needing an endless library of specialized validation functions. This composable approach keeps your validation code concise while allowing you to verify even the most complex data quality requirements.

Remember that preprocessing happens just for the specific validation step, keeping your validation plan organized and maintaining the integrity of your original data throughout the rest of the validation process.