import pointblank as pb
import polars as pl
tbl = pl.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": [2, 2, 2, 2, 2],
}
)
pb.preview(tbl)col_sd_lt()method
Does the column standard deviation satisfy a less than comparison?
USAGE
Validate.col_sd_lt(
columns,
value=None,
tol=0,
thresholds=None,
brief=False,
actions=None,
active=True,
)The col_sd_lt() validation method checks whether the standard deviation of values in a column is less than a specified value=. This is an aggregation-based validation where the entire column is reduced to a single standard deviation value that is then compared against the target. The comparison used in this function is standard deviation(column) < value.
Unlike row-level validations (e.g., col_vals_gt()), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely.
Parameters
columns :_PBUnresolvedColumn-
A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the standard deviation to be computed.
value :float|Column|ReferenceColumn| None = None-
The value to compare the column standard deviation against. This can be: (1) a numeric literal (
intorfloat), (2) acol()object referencing another column whose standard deviation will be used for comparison, (3) aref()object referencing a column in reference data (whenValidate(reference=)has been set), or (4)Noneto automatically compare against the same column in reference data (shorthand forref(column_name)when reference data is set). tol :Tolerance= 0-
A tolerance value for the comparison. The default is
0, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, withtol=0.5, a standard deviation that differs from the target by up to0.5will still pass. Thetol=parameter expands the acceptable range for the comparison. Forcol_sd_lt(), a tolerance oftol=0.5would mean the standard deviation can be within0.5of the target value and still pass validation. thresholds :float|bool|tuple|dict| Thresholds | None = None-
Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g.,
1) to indicate pass/fail, or as proportions where any value less than1.0means failure is acceptable. brief :str|bool= False-
An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like
"{step}"to insert the step number, or"{auto}"to include an automatically generated brief. IfTruethe entire brief will be automatically generated. IfNone(the default) then there won’t be a brief. actions : Actions | None = None-
Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the
Actionsclass should be used to define the actions. active :bool= True-
A boolean value indicating whether the validation step should be active. Using
Falsewill make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged).
Returns
Using Reference Data
The col_sd_lt() method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines.
To use reference data, set the reference= parameter when creating the Validate object:
validation = (
pb.Validate(data=current_data, reference=baseline_data)
.col_sd_lt(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue)
.interrogate()
)When value=None and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the ref() helper:
.col_sd_lt(columns="revenue", value=pb.ref("baseline_revenue"))Understanding Tolerance
The tol= parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable.
The tol= parameter expands the acceptable range for the comparison. For col_sd_lt(), a tolerance of tol=0.5 would mean the standard deviation can be within 0.5 of the target value and still pass validation.
For equality comparisons (col_*_eq), the tolerance creates a range [value - tol, value + tol] within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary.
Thresholds
The thresholds= parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in Validate(thresholds=...).
There are three threshold levels: ‘warning’, ‘error’, and ‘critical’. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts:
thresholds=1means any failure triggers a ‘warning’thresholds=(1, 1, 1)means any failure triggers all three levels
Thresholds can be defined using one of these input schemes:
- use the
Thresholdsclass (the most direct way to create thresholds) - provide a tuple of 1-3 values, where position
0is the ‘warning’ level, position1is the ‘error’ level, and position2is the ‘critical’ level - create a dictionary of 1-3 value entries; the valid keys: are ‘warning’, ‘error’, and ‘critical’
- a single integer/float value denoting absolute number or fraction of failing test units for the ‘warning’ level only
Examples
For the examples, we’ll use a simple Polars DataFrame with numeric columns. The table is shown below:
Let’s validate that the standard deviation of column a is less than 2:
validation = (
pb.Validate(data=tbl)
.col_sd_lt(columns="a", value=2)
.interrogate()
)
validation| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
col_sd_lt()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
The validation result shows whether the standard deviation comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column.
When validating multiple columns, each column gets its own validation step:
validation = (
pb.Validate(data=tbl)
.col_sd_lt(columns=["a", "b"], value=2)
.interrogate()
)
validation| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
col_sd_lt()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — | |||
| #4CA64C | 2 |
col_sd_lt()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
Using tolerance for flexible comparisons:
validation = (
pb.Validate(data=tbl)
.col_sd_lt(columns="a", value=2, tol=1.0)
.interrogate()
)
validation| STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #4CA64C | 1 |
col_sd_lt()
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |