CDISC Clinical Data Standards

Clinical trial data follows strict organizational standards defined by CDISC (Clinical Data Interchange Standards Consortium). These standards specify exactly which variables must appear in each dataset, what values are permitted, how dates should be formatted, and how analysis datasets trace back to their source observations. Regulatory agencies like the FDA and PMDA require CDISC-compliant data for drug submissions, making adherence to these standards mandatory for pharmaceutical organizations.

Pointblank provides native support for the three major CDISC data models: SDTM (Study Data Tabulation Model) for raw collected data, ADaM (Analysis Data Model) for analysis-ready datasets, and Define-XML for the metadata documents that describe both. Whether you are preparing a regulatory submission, running quality checks on incoming CRO data, or building automated validation pipelines for clinical data warehouses, Pointblank can generate the appropriate checks directly from the standard specifications.

Prerequisites

CDISC XML parsing (Define-XML and Controlled Terminology files) requires the lxml library:

pip install lxml

Or install Pointblank with the CDISC extra:

pip install pointblank[cdisc]

Reading metadata from SAS Transport (.xpt) files (used in the Exporting Metadata example below) additionally requires the pyreadstat library:

pip install pyreadstat

The SDTM and ADaM domain templates are built into Pointblank and require no additional dependencies. They encode the structural requirements from the SDTM Implementation Guide 3.4 and the ADaM Implementation Guide 1.1 directly in Python, so you can validate clinical datasets without needing the original XML specification documents.

Define-XML Import

Define-XML is the CDISC standard for documenting dataset structure. It describes every variable in a submission package: its name, label, data type, length, origin, and associated controlled terminology. Pointblank can parse Define-XML 2.0 and 2.1 documents and extract this metadata into a form suitable for validation.

Importing a Define-XML File

The import_metadata() function with format="cdisc_define" reads a Define-XML file and returns a MetadataPackage containing metadata for all datasets defined in the document. So that you can run these examples without supplying your own files, Pointblank bundles a small set of sample metadata documents that you can locate with load_metadata_example():

import pointblank as pb

# Locate the bundled Define-XML example and import all of its datasets
define_path = pb.load_metadata_example("define.xml")
package = pb.import_metadata(define_path, format="cdisc_define")

# List the datasets defined in the document
for name in package.keys():
    meta = package[name]
    print(f"{name}: {meta.dataset_label} ({len(meta.variables)} variables)")

DM: Demographics (13 variables)
AE: Adverse Events (10 variables)

Each dataset in the package is a MetadataImport object with full variable-level metadata. You can access individual datasets by name and generate validation from them. Here we validate a small Demographics table against the metadata extracted from the Define-XML:

import polars as pl

# Get metadata for the Demographics domain
dm_meta = package["DM"]

# A small Demographics dataset to validate
dm_data = pl.DataFrame({
    "STUDYID": ["XYZ789"] * 4,
    "DOMAIN": ["DM"] * 4,
    "USUBJID": ["XYZ789-001", "XYZ789-002", "XYZ789-003", "XYZ789-004"],
    "SUBJID": ["001", "002", "003", "004"],
    "RFSTDTC": ["2024-01-15", "2024-01-20", "2024-02-01", "2024-02-10"],
    "RFENDTC": ["2024-06-15", "2024-06-20", "2024-07-01", "2024-07-10"],
    "SITEID": ["SITE01", "SITE01", "SITE02", "SITE02"],
    "AGE": [45, 62, 38, 55],
    "AGEU": ["YEARS"] * 4,
    "SEX": ["M", "F", "M", "F"],
    "RACE": ["WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN", "WHITE"],
    "ARMCD": ["DRUG", "PLACEBO", "DRUG", "PLACEBO"],
    "ARM": ["Active Drug 10mg", "Placebo", "Active Drug 10mg", "Placebo"],
})

# Generate validation for the Demographics data from the Define-XML metadata
validation = dm_meta.to_validate(data=dm_data).interrogate()
validation

Pointblank Validation

Validation: DM (cdisc_define)

Polars

STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT

#4CA64C

col_schema_match()

—

SCHEMA

✓

1
1.00

0
0.00

—

#4CA64C

col_vals_not_null()

STUDYID

—

✓

4
1.00

0
0.00

—

#4CA64C

col_vals_not_null()

DOMAIN

—

✓

4
1.00

0
0.00

—

#4CA64C

col_vals_not_null()

USUBJID

—

✓

4
1.00

0
0.00

—

#4CA64C

col_vals_not_null()

SUBJID

—

✓

4
1.00

0
0.00

—

#4CA64C

col_vals_in_set()

SEX

M, F, U

✓

4
1.00

0
0.00

—

#4CA64C

col_vals_in_set()

RACE

WHITE, BLACK OR AFRICAN AMERICAN, ASIAN, AMERICAN INDIAN OR ALASKA NATIVE, NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER

✓

4
1.00

0
0.00

—

Notes

Step 1 (schema_check) ✓ Schema validation passed.

Schema Comparison

TARGET			EXPECTED
	COLUMN	DATA TYPE		COLUMN		DATA TYPE
1	STUDYID	String	1	STUDYID	✓	String	✓
2	DOMAIN	String	2	DOMAIN	✓	String	✓
3	USUBJID	String	3	USUBJID	✓	String	✓
4	SUBJID	String	4	SUBJID	✓	String	✓
5	RFSTDTC	String	5	RFSTDTC	✓	String	✓
6	RFENDTC	String	6	RFENDTC	✓	String	✓
7	SITEID	String	7	SITEID	✓	String	✓
8	AGE	Int64	8	AGE	✓	Int64	✓
9	AGEU	String	9	AGEU	✓	String	✓
10	SEX	String	10	SEX	✓	String	✓
11	RACE	String	11	RACE	✓	String	✓
12	ARMCD	String	12	ARMCD	✓	String	✓
13	ARM	String	13	ARM	✓	String	✓
Supplied Column Schema: `[('STUDYID', 'String'), ('DOMAIN', 'String'), ('USUBJID', 'String'), ('SUBJID', 'String'), ('RFSTDTC', 'String'), ('RFENDTC', 'String'), ('SITEID', 'String'), ('AGE', 'Int64'), ('AGEU', 'String'), ('SEX', 'String'), ('RACE', 'String'), ('ARMCD', 'String'), ('ARM', 'String')]`
Schema Match Settings COMPLETE IN ORDER COLUMN ≠ column DTYPE ≠ dtype float ≠ float64

What Gets Extracted

Define-XML documents contain rich structural metadata. Pointblank extracts the following elements:

Define-XML Element	Pointblank Mapping
ItemGroupDef (dataset)	MetadataImport per dataset
ItemDef (variable)	VariableMetadata with name, label, dtype
DataType attribute	Mapped to Pointblank dtype (String, Int64, Float64, etc.)
Length attribute	`max_length` constraint on VariableMetadata
SignificantDigits	`significant_digits` on VariableMetadata
Origin (CRF, Derived, etc.)	`origin` field
CodeListRef	`codelist_ref` linking to the associated codelist
ComputationalMethod	`computational_method` for derived variables
Role/RoleCodeListOID	`cdisc_role` (Identifier, Topic, etc.)
CodeList	Codelist object with all permitted values
Mandatory=“Yes”	`required=True` on VariableMetadata

Controlled Terminology from Define-XML

Define-XML documents embed the codelists that constrain variable values. When Pointblank parses a Define-XML, all codelists are extracted and linked to their respective variables. The to_validate() method then generates col_vals_in_set() checks for each variable that references a codelist:

# Inspect codelists referenced by the Demographics domain
for cl_name, codelist in dm_meta.codelists.items():
    print(f"{cl_name}: {codelist.to_set()[:5]}...")  # first 5 values
    print(f"  Extensible: {codelist.extensible}")

Sex: ['M', 'F', 'U']...
  Extensible: False
Race: ['WHITE', 'BLACK OR AFRICAN AMERICAN', 'ASIAN', 'AMERICAN INDIAN OR ALASKA NATIVE', 'NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER']...
  Extensible: False

The extensible flag records an important distinction. Non-extensible codelists require strict adherence: any value not in the codelist is a conformance issue. Extensible codelists permit sponsor-defined additions, so a value outside the published set is not necessarily an error.

Pointblank currently generates the same col_vals_in_set() check for both kinds of codelist. For extensible codelists you will typically want to relax that step to reflect that additions are allowed, for example by attaching a warning-level threshold to the step rather than treating every out-of-set value as a hard failure.

CDISC Controlled Terminology Import

Beyond the codelists embedded in Define-XML, CDISC publishes standalone Controlled Terminology packages as XML files. These contain the canonical value sets for concepts like SEX, RACE, ROUTE OF ADMINISTRATION, and hundreds of others. Pointblank can parse these directly.

Importing a CT package returns a MetadataPackage whose entries are keyed by codelist name. Each entry is a MetadataImport that holds the codelist itself:

# Import the bundled CDISC CT example
ct_path = pb.load_metadata_example("sdtm_ct.xml")
ct = pb.import_metadata(ct_path, format="cdisc_ct")

# List the codelists in the package
print(list(ct.keys()))

# Access a codelist by name
sex_codelist = ct["Sex"].codelists["Sex"]
print(f"SEX values: {sex_codelist.to_set()}")
print(f"Extensible: {sex_codelist.extensible}")

['Sex', 'Severity/Intensity Scale for Adverse Events', 'No Yes Response', 'Race', 'Route of Administration']
SEX values: ['F', 'M', 'U', 'UNDIFFERENTIATED']
Extensible: False

Once you have a codelist, its permitted values feed directly into a col_vals_in_set() check:

# Use the CT-derived value set in a validation
validation = (
    pb.Validate(data=dm_data)
    .col_vals_in_set(columns="SEX", set=sex_codelist.to_set())
    .interrogate()
)
validation

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
2026-07-22\|23:24:29 Polars
#4CA64C	1	col_vals_in_set()	SEX	F, M, U, UNDIFFERENTIATED	✓	4	4 1.00	0 0.00	—	—	—	—

Controlled Terminology packages version quarterly (e.g., 2024-03-29, 2024-06-28). Referencing a specific version ensures reproducible validation results. In production pipelines, you would pin the CT version to match what was specified in your study’s Define-XML.

SDTM Domain Templates

The Study Data Tabulation Model organizes clinical trial data into domains: Demographics (DM), Adverse Events (AE), Laboratory Results (LB), Vital Signs (VS), and many others. Each domain has a defined set of required and expected variables, with specific roles, types, and length constraints.

Pointblank includes built-in templates for eight commonly used SDTM domains. These templates encode the structural requirements from the SDTM Implementation Guide 3.4 directly, so you can validate data against the standard without needing a Define-XML file.

Available Domains

import pointblank as pb
from pointblank.metadata import list_sdtm_domains, get_sdtm_domain

# List all available SDTM domain templates
domains = list_sdtm_domains()
for d in domains:
    template = get_sdtm_domain(d)
    req_count = sum(1 for v in template.variables if v.required)
    print(f"  {d}: {template.label} ({req_count} required, {len(template.variables)} total vars)")

  AE: Adverse Events (6 required, 28 total vars)
  CM: Concomitant Medications (5 required, 17 total vars)
  DM: Demographics (9 required, 26 total vars)
  DS: Disposition (6 required, 11 total vars)
  EX: Exposure (5 required, 15 total vars)
  LB: Laboratory Test Results (6 required, 26 total vars)
  MH: Medical History (5 required, 12 total vars)
  VS: Vital Signs (6 required, 19 total vars)

Each template provides the full variable specification for its domain, including which variables are required (core=“Req”), expected (core=“Exp”), or permissible (core=“Perm”) per the Implementation Guide.

Inspecting a Domain Template

You can examine the variable specifications for any domain to understand what Pointblank will check:

# Get the Demographics domain template
dm = get_sdtm_domain("DM")

print(f"Domain: {dm.domain} - {dm.label}")
print(f"Class: {dm.domain_class}")
print(f"Repeating: {dm.repeating}")
print()

# Show required variables
print("Required variables (core='Req'):")
for var in dm.variables:
    if var.required:
        ct_info = f" [CT: {var.controlled_term}]" if var.controlled_term else ""
        print(f"  {var.name:12s} {var.dtype:4s} {var.role:12s} {var.label}{ct_info}")

Domain: DM - Demographics
Class: Special Purpose
Repeating: False

Required variables (core='Req'):
  STUDYID      Char Identifier   Study Identifier
  DOMAIN       Char Identifier   Domain Abbreviation
  USUBJID      Char Identifier   Unique Subject Identifier
  SUBJID       Char Topic        Subject Identifier for the Study
  SITEID       Char Qualifier    Study Site Identifier
  SEX          Char Qualifier    Sex [CT: SEX]
  ARMCD        Char Qualifier    Planned Arm Code
  ARM          Char Qualifier    Description of Planned Arm
  COUNTRY      Char Qualifier    Country [CT: COUNTRY]

Structural Validation

The validate_sdtm_structure() function performs a quick check that a dataset contains all required variables for its domain. This is useful as a fast pre-check before running the full validation workflow:

import polars as pl
from pointblank.metadata import validate_sdtm_structure

# A minimal Demographics dataset
dm_data = pl.DataFrame({
    "STUDYID": ["STUDY01"] * 4,
    "DOMAIN": ["DM"] * 4,
    "USUBJID": ["STUDY01-001", "STUDY01-002", "STUDY01-003", "STUDY01-004"],
    "SUBJID": ["001", "002", "003", "004"],
    "RFSTDTC": ["2024-01-15", "2024-01-20", "2024-02-01", "2024-02-10"],
    "RFENDTC": ["2024-06-15", "2024-06-20", "2024-07-01", "2024-07-10"],
    "SITEID": ["SITE01", "SITE01", "SITE02", "SITE02"],
    "AGE": [45, 62, 38, 55],
    "AGEU": ["YEARS"] * 4,
    "SEX": ["M", "F", "M", "F"],
    "RACE": ["WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN", "WHITE"],
    "ARMCD": ["DRUG", "PLACEBO", "DRUG", "PLACEBO"],
    "ARM": ["Active Drug 10mg", "Placebo", "Active Drug 10mg", "Placebo"],
    "COUNTRY": ["USA", "USA", "GBR", "GBR"],
})

result = validate_sdtm_structure(dm_data, domain="DM")
print(f"Valid: {result['valid']}")
if result["missing_required"]:
    print(f"Missing required: {result['missing_required']}")
if result["unknown_variables"]:
    print(f"Unknown variables: {result['unknown_variables'][:5]}")

Valid: True

Full SDTM Validation

The validate_sdtm() function generates a comprehensive validation workflow that checks far more than just structure. It produces a Validate object with checks for required variable non-nullness, DOMAIN value correctness, sequence number positivity, string length constraints, and ISO 8601 date formatting:

from pointblank.metadata import validate_sdtm

# Generate and run the full SDTM DM validation
validation = validate_sdtm(data=dm_data, domain="DM").interrogate()
validation

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
SDTM DM Validation Polars
#4CA64C	1	col_vals_not_null()	STUDYID	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_not_null()	DOMAIN	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	3	col_vals_not_null()	USUBJID	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	4	col_vals_not_null()	SUBJID	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	5	col_vals_not_null()	SITEID	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	6	col_vals_not_null()	SEX	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	7	col_vals_not_null()	ARMCD	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	8	col_vals_not_null()	ARM	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	9	col_vals_not_null()	COUNTRY	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	10	col_vals_in_set()	DOMAIN	DM	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	11	col_vals_expr() STUDYID length <= 20	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	12	col_vals_expr() DOMAIN length <= 2	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	13	col_vals_expr() USUBJID length <= 40	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	14	col_vals_expr() SUBJID length <= 20	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	15	col_vals_expr() RFSTDTC length <= 64	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	16	col_vals_expr() RFENDTC length <= 64	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	17	col_vals_expr() SITEID length <= 20	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	18	col_vals_expr() AGEU length <= 10	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	19	col_vals_expr() SEX length <= 2	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	20	col_vals_expr() RACE length <= 60	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	21	col_vals_expr() ARMCD length <= 20	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	22	col_vals_expr() ARM length <= 200	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	23	col_vals_expr() COUNTRY length <= 3	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	24	col_vals_regex()	RFSTDTC	^(\d{4})(-\d{2}(-\d{2}(T\d{2}(:\d{2}(:\d{2})?)?)?)?)?$	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	25	col_vals_regex()	RFENDTC	^(\d{4})(-\d{2}(-\d{2}(T\d{2}(:\d{2}(:\d{2})?)?)?)?)?$	✓	4	4 1.00	0 0.00	—	—	—	—

The validation checks the following rules automatically:

Check	Description
Required variables non-null	Every variable with core=“Req” must have no nulls
DOMAIN value	The DOMAIN column must contain only the expected domain code
Sequence numbers	`--SEQ` variables must be positive integers
String lengths	Character variables must not exceed their defined max length
ISO 8601 dates	All `--DTC` timing variables must match the CDISC date pattern

ISO 8601 Date Validation

CDISC uses a specific subset of ISO 8601 that allows partial dates. A date might be fully specified as 2024-03-15T10:30:00 or partially specified as just 2024-03 (year and month known, day unknown). The validation checks that all timing variables (--DTC columns like RFSTDTC, AESTDTC, LBDTC) conform to this pattern:

Valid:   2024-03-15T10:30:00  (full datetime)
Valid:   2024-03-15            (date only)
Valid:   2024-03               (year-month only)
Valid:   2024                   (year only)
Invalid: 03/15/2024            (wrong format)
Invalid: 15-Mar-2024           (wrong format)

This catches a common data quality issue where dates are entered in locale-specific formats rather than the required ISO 8601 pattern.

Catching Conformance Problems

The examples so far have used clean data, so every check passes. The point of validation, of course, is to surface problems. Here is the same DM domain with three deliberate defects: a missing USUBJID (a required variable), a row where DOMAIN is "AE" instead of "DM", and a RFSTDTC value entered in the wrong date format:

# A Demographics dataset with three injected conformance issues
dm_data_with_issues = pl.DataFrame({
    "STUDYID": ["STUDY01"] * 4,
    "DOMAIN": ["DM", "DM", "AE", "DM"],                       # row 3: wrong domain code
    "USUBJID": ["STUDY01-001", "STUDY01-002", None, "STUDY01-004"],  # row 3: required but null
    "SUBJID": ["001", "002", "003", "004"],
    "RFSTDTC": ["2024-01-15", "03/15/2024", "2024-02-01", "2024-02-10"],  # row 2: not ISO 8601
    "SITEID": ["SITE01", "SITE01", "SITE02", "SITE02"],
    "AGE": [45, 62, 38, 55],
    "AGEU": ["YEARS"] * 4,
    "SEX": ["M", "F", "M", "F"],
    "RACE": ["WHITE", "ASIAN", "WHITE", "ASIAN"],
    "ARMCD": ["DRUG", "PLACEBO", "DRUG", "PLACEBO"],
    "ARM": ["Active Drug 10mg", "Placebo", "Active Drug 10mg", "Placebo"],
})

validation = validate_sdtm(data=dm_data_with_issues, domain="DM").interrogate()
validation

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
SDTM DM Validation Polars
#4CA64C	1	col_vals_not_null()	STUDYID	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_not_null()	DOMAIN	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C66	3	col_vals_not_null()	USUBJID	—	✓	4	3 0.75	1 0.25	—	—	—
#4CA64C	4	col_vals_not_null()	SUBJID	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	5	col_vals_not_null()	SITEID	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	6	col_vals_not_null()	SEX	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	7	col_vals_not_null()	ARMCD	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	8	col_vals_not_null()	ARM	—	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C66	9	col_vals_in_set()	DOMAIN	DM	✓	4	3 0.75	1 0.25	—	—	—
#4CA64C	10	col_vals_expr() STUDYID length <= 20	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	11	col_vals_expr() DOMAIN length <= 2	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	12	col_vals_expr() USUBJID length <= 40	—	COLUMN EXPR	✓	4	3 0.75	0 0.00	—	—	—	—
#4CA64C	13	col_vals_expr() SUBJID length <= 20	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	14	col_vals_expr() RFSTDTC length <= 64	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	15	col_vals_expr() SITEID length <= 20	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	16	col_vals_expr() AGEU length <= 10	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	17	col_vals_expr() SEX length <= 2	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	18	col_vals_expr() RACE length <= 60	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	19	col_vals_expr() ARMCD length <= 20	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C	20	col_vals_expr() ARM length <= 200	—	COLUMN EXPR	✓	4	4 1.00	0 0.00	—	—	—	—
#4CA64C66	21	col_vals_regex()	RFSTDTC	^(\d{4})(-\d{2}(-\d{2}(T\d{2}(:\d{2}(:\d{2})?)?)?)?)?$	✓	4	3 0.75	1 0.25	—	—	—

Three steps now report a failing unit: the col_vals_not_null() check on USUBJID, the col_vals_in_set() check on DOMAIN, and the ISO 8601 col_vals_regex() check on RFSTDTC. Each failing step links to the specific rows so you can trace a conformance issue back to the offending records.

Converting SDTM Templates to MetadataImport

If you prefer to work with the standard MetadataImport interface (for example, to use to_schema() or combine SDTM metadata with other sources), you can convert a domain template:

from pointblank.metadata import sdtm_to_metadata

# Convert the DM template to a MetadataImport
dm_meta = sdtm_to_metadata(domain="DM", study_id="STUDY01")

print(f"Format: {dm_meta.source_format}")
print(f"Dataset: {dm_meta.dataset_name}")
print(f"Variables: {len(dm_meta.variables)}")

# Generate a schema from it
schema = dm_meta.to_schema()
print(f"Schema columns: {len(schema.columns)}")

Format: cdisc_sdtm
Dataset: DM
Variables: 26
Schema columns: 26

ADaM Dataset Templates

The Analysis Data Model builds on top of SDTM by adding derived variables, population flags, and analysis-specific structures. ADaM datasets are the basis for statistical analyses in clinical trials, and their structure is tightly specified to ensure reproducibility and traceability back to the source data.

Pointblank includes templates for four ADaM dataset structures: ADSL (subject-level analysis), BDS (Basic Data Structure for repeated measures), ADAE (adverse events analysis), and ADTTE (time-to-event analysis).

Available ADaM Datasets

from pointblank.metadata import list_adam_datasets, get_adam_dataset

# List all available ADaM dataset templates
datasets = list_adam_datasets()
for d in datasets:
    template = get_adam_dataset(d)
    req_count = sum(1 for v in template.variables if v.required)
    flag_count = sum(1 for v in template.variables if v.is_population_flag)
    print(f"  {d}: {template.label}")
    print(f"      {req_count} required vars, {flag_count} population flags")

  ADAE: Adverse Event Analysis Dataset
      5 required vars, 0 population flags
  ADSL: Subject Level Analysis Dataset
      5 required vars, 7 population flags
  ADTTE: Time-to-Event Analysis Dataset
      8 required vars, 0 population flags
  BDS: Basic Data Structure
      5 required vars, 0 population flags

ADSL: Subject-Level Analysis

ADSL is the foundational ADaM dataset. It contains one row per subject with all the key demographic and treatment information needed for analysis. Every other ADaM dataset merges back to ADSL for population definitions.

adsl_template = get_adam_dataset("ADSL")
print(f"Dataset class: {adsl_template.dataset_class}")
print(f"\nPopulation flags:")
for var in adsl_template.variables:
    if var.is_population_flag:
        print(f"  {var.name}: {var.label}")

Dataset class: ADSL

Population flags:
  SAFFL: Safety Population Flag
  ITTFL: Intent-To-Treat Population Flag
  EFFFL: Efficacy Population Flag
  RANDFL: Randomized Population Flag
  ENRLFL: Enrolled Population Flag
  PPROTFL: Per-Protocol Population Flag
  COMPLFL: Completers Population Flag

Population flags (SAFFL, ITTFL, EFFFL, etc.) define which subjects belong to each analysis population. They must contain only the values “Y” or “N”, with no nulls. Pointblank’s ADaM validation checks this automatically.

Full ADaM Validation

The validate_adam() function generates comprehensive checks tailored to each dataset type:

import polars as pl
from pointblank.metadata import validate_adam

# Create a minimal ADSL dataset
adsl_data = pl.DataFrame({
    "STUDYID": ["STUDY01"] * 5,
    "USUBJID": [f"STUDY01-{i:03d}" for i in range(1, 6)],
    "SUBJID": [f"{i:03d}" for i in range(1, 6)],
    "SITEID": ["SITE01", "SITE01", "SITE02", "SITE02", "SITE01"],
    "TRT01P": ["Drug A", "Placebo", "Drug A", "Placebo", "Drug A"],
    "TRT01A": ["Drug A", "Placebo", "Drug A", "Placebo", "Drug A"],
    "AGE": [45, 62, 38, 55, 48],
    "AGEU": ["YEARS"] * 5,
    "SEX": ["M", "F", "M", "F", "M"],
    "RACE": ["WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN", "WHITE", "WHITE"],
    "SAFFL": ["Y", "Y", "Y", "Y", "N"],
    "ITTFL": ["Y", "Y", "Y", "Y", "Y"],
    "EFFFL": ["Y", "Y", "N", "Y", "N"],
    "TRTSDT": ["2024-01-15", "2024-01-20", "2024-02-01", "2024-02-10", "2024-02-15"],
    "TRTEDT": ["2024-06-15", "2024-06-20", "2024-07-01", "2024-07-10", "2024-07-15"],
})

# Run ADaM ADSL validation
validation = validate_adam(data=adsl_data, dataset="ADSL").interrogate()
validation

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
ADaM ADSL Validation Polars
#4CA64C	1	col_vals_not_null()	STUDYID	—	✓	5	5 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_not_null()	USUBJID	—	✓	5	5 1.00	0 0.00	—	—	—	—
#4CA64C	3	col_vals_not_null()	SUBJID	—	✓	5	5 1.00	0 0.00	—	—	—	—
#4CA64C	4	col_vals_not_null()	SITEID	—	✓	5	5 1.00	0 0.00	—	—	—	—
#4CA64C	5	col_vals_not_null()	TRT01P	—	✓	5	5 1.00	0 0.00	—	—	—	—
#4CA64C	6	col_vals_in_set()	SAFFL	Y, N	✓	5	5 1.00	0 0.00	—	—	—	—
#4CA64C	7	col_vals_in_set()	ITTFL	Y, N	✓	5	5 1.00	0 0.00	—	—	—	—
#4CA64C	8	col_vals_in_set()	EFFFL	Y, N	✓	5	5 1.00	0 0.00	—	—	—	—
#4CA64C	9	col_vals_not_null()	TRT01P	—	✓	5	5 1.00	0 0.00	—	—	—	—

ADaM Validation Checks by Dataset Type

The checks generated by validate_adam() vary depending on the dataset type. Each type has its own domain-specific rules in addition to the common required-variable and population-flag checks:

Dataset	Specific Checks
ADSL	TRT01P non-null, all population flags are Y/N
BDS	PARAMCD length at most 8 characters
ADAE	TRTEMFL is Y/N, AESEQ is positive
ADTTE	CNSR is 0 or 1, AVAL (time) is non-negative

Here is an example validating a BDS (Basic Data Structure) dataset:

# Create a minimal BDS dataset (e.g., ADLB - laboratory analysis)
bds_data = pl.DataFrame({
    "STUDYID": ["STUDY01"] * 6,
    "USUBJID": ["STUDY01-001"] * 3 + ["STUDY01-002"] * 3,
    "PARAMCD": ["ALT", "AST", "BILI"] * 2,
    "PARAM": [
        "Alanine Aminotransferase (U/L)",
        "Aspartate Aminotransferase (U/L)",
        "Bilirubin (umol/L)",
    ] * 2,
    "AVAL": [25.0, 30.0, 12.0, 45.0, 38.0, 15.0],
    "ABLFL": ["Y", "Y", "Y", "N", "N", "N"],
    "ANL01FL": ["Y"] * 6,
    "TRTA": ["Drug A"] * 3 + ["Placebo"] * 3,
})

validation = validate_adam(data=bds_data, dataset="BDS").interrogate()
validation

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
ADaM BDS Validation Polars
#4CA64C	1	col_vals_not_null()	STUDYID	—	✓	6	6 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_not_null()	USUBJID	—	✓	6	6 1.00	0 0.00	—	—	—	—
#4CA64C	3	col_vals_not_null()	PARAMCD	—	✓	6	6 1.00	0 0.00	—	—	—	—
#4CA64C	4	col_vals_not_null()	PARAM	—	✓	6	6 1.00	0 0.00	—	—	—	—
#4CA64C	5	col_vals_not_null()	AVAL	—	✓	6	6 1.00	0 0.00	—	—	—	—
#4CA64C	6	col_vals_expr() PARAMCD length <= 8	—	COLUMN EXPR	✓	6	6 1.00	0 0.00	—	—	—	—

Structural Validation

Like SDTM, ADaM provides a quick structural check via validate_adam_structure():

from pointblank.metadata import validate_adam_structure

result = validate_adam_structure(adsl_data, dataset="ADSL")
print(f"Valid: {result['valid']}")
print(f"Missing required: {result['missing_required']}")
print(f"Population flags present: {result.get('population_flags_present', [])}")

Valid: True
Missing required: []
Population flags present: []

Converting ADaM Templates to MetadataImport

The adam_to_metadata() function converts an ADaM template into the standard MetadataImport format, giving you access to to_schema() and to_validate():

from pointblank.metadata import adam_to_metadata

# Convert ADSL template to MetadataImport
adsl_meta = adam_to_metadata(dataset="ADSL", study_id="STUDY01")

print(f"Format: {adsl_meta.source_format}")
print(f"Version: {adsl_meta.source_version}")
print(f"Variables: {len(adsl_meta.variables)}")

# You can also use it through the import_metadata dispatcher
meta = pb.import_metadata("ADSL", format="cdisc_adam", dataset="ADSL")
print(f"Same result: {meta.dataset_name}")

Format: cdisc_adam
Version: IG 1.1
Variables: 30
Same result: ADSL

Frictionless Data Packages

While not a clinical standard, Frictionless Data Packages are widely used in open data and research contexts. They describe tabular data with JSON schemas that specify column types, constraints (minimum, maximum, enum, pattern), and primary keys. Pointblank imports these seamlessly.

Importing a Frictionless Schema

The bundled datapackage.json example describes a transactions table. Importing it yields a MetadataImport with one VariableMetadata per column:

# Import the bundled Frictionless Data Package example
datapackage_path = pb.load_metadata_example("datapackage.json")
meta = pb.import_metadata(datapackage_path, format="frictionless")

print(f"Dataset: {meta.dataset_name}")
for v in meta.variables:
    print(f"  {v.name:16s} {v.dtype}")

Dataset: transactions
  transaction_id   String
  customer_id      String
  amount           Float64
  quantity         Int64
  category         String
  sale_date        Date
  discount_pct     Float64
  email            String

Frictionless constraints map directly onto Pointblank validation steps:

Frictionless constraint	Pointblank step
`"required": true`	col_vals_not_null()
`"unique": true`	rows_distinct()
`"minimum": 0`	`col_vals_ge(value=0)`
`"maximum": 100`	`col_vals_le(value=100)`
`"pattern": "..."`	`col_vals_regex(pattern="...")`
`"enum": [...]`	`col_vals_in_set(set=[...])`

The constraint mapping is direct and complete. Every constraint expressible in a Frictionless Table Schema has a corresponding Pointblank validation step, making the translation lossless.

A standalone Table Schema (without the surrounding data package wrapper) imports the same way with format="table_schema".

CSVW (CSV on the Web)

The W3C’s CSVW standard provides similar capabilities to Frictionless but uses JSON-LD and aligns with linked data principles. Pointblank imports CSVW metadata with the same interface:

# Import the bundled CSVW example
csvw_path = pb.load_metadata_example("weather_csvw.json")
meta = pb.import_metadata(csvw_path, format="csvw")

# CSVW column descriptors become VariableMetadata
print(f"Dataset: {meta.dataset_name} ({len(meta.variables)} variables)")
print("Columns:", ", ".join(v.name for v in meta.variables))

Dataset: weather_observations (7 variables)
Columns: station_id, timestamp, temperature_c, humidity_pct, wind_speed_kmh, precipitation_mm, condition

Exporting Metadata

Pointblank can also export validation metadata in Frictionless format. This is useful when you want to share data quality expectations with tools that understand the Frictionless ecosystem:

import json
import tempfile
from pathlib import Path

# Read metadata from the bundled SAS Transport (XPT) example
xpt_path = pb.load_metadata_example("dm.xpt")
meta = pb.import_metadata(xpt_path, format="xpt")

# Export it as a Frictionless Table Schema
out_path = Path(tempfile.gettempdir()) / "dm_table_schema.json"
pb.export_metadata(meta, out_path, format="frictionless")

# Inspect the exported schema: how many fields, and the first two definitions
schema = json.loads(out_path.read_text())
print(f"Exported {len(schema['fields'])} field definitions. First two:")
print(json.dumps(schema["fields"][:2], indent=2))

Exported 12 field definitions. First two:
[
  {
    "name": "STUDYID",
    "type": "string",
    "title": "Study Identifier",
    "constraints": {
      "maxLength": 6
    }
  },
  {
    "name": "DOMAIN",
    "type": "string",
    "title": "Domain Abbreviation",
    "constraints": {
      "maxLength": 2
    }
  }
]

The exported document contains the column definitions and constraints from the original metadata, formatted as a valid Frictionless Table Schema that other tools can consume.

Combining Multiple Metadata Sources

In practice, clinical data validation often combines metadata from multiple sources. The Define-XML provides the authoritative variable definitions, but you might also want to check against SDTM domain rules and controlled terminology packages. Pointblank supports this by letting you compose validation workflows from different metadata sources:

from pointblank.metadata import validate_sdtm

# Load the Define-XML for variable-level constraints
package = pb.import_metadata(pb.load_metadata_example("define.xml"), format="cdisc_define")
dm_meta = package["DM"]

# Generate validation from Define-XML metadata
validation = dm_meta.to_validate(data=dm_data)

# The SDTM template adds domain-specific rules not in the Define-XML
# (ISO 8601 checks, sequence number rules, etc.)
sdtm_validation = validate_sdtm(data=dm_data, domain="DM")

# Run both and compare how many checks each contributes
define_results = validation.interrogate()
sdtm_results = sdtm_validation.interrogate()

print(f"Define-XML checks: {len(define_results.validation_info)}")
print(f"SDTM template checks: {len(sdtm_results.validation_info)}")

Define-XML checks: 7
SDTM template checks: 25

This layered approach gives you the flexibility to apply different levels of validation depending on your needs. The Define-XML checks enforce what was specifically documented for your study, while the SDTM template checks enforce the broader standard requirements that apply universally.

Null Flavors and Structured Missingness

Clinical data uses standardized HL7/CDISC null flavors to record why a value is absent (e.g., "NASK" = not asked, "UNK" = unknown, "NA" = not applicable). Pointblank ships a pre-built MissingSpec for these codes via MissingSpec.from_cdisc_null_flavors():

cdisc = pb.MissingSpec.from_cdisc_null_flavors()

print("NASK ->", cdisc.reason_for("NASK"))   # not_asked
print("UNK  ->", cdisc.reason_for("UNK"))     # unknown
print("boundary codes:", cdisc.values_for_category("boundary"))

NASK -> not_asked
UNK  -> unknown
boundary codes: ['PINF', 'NINF']

This spec can be passed to missing_vals_tbl() for a reason-by-reason breakdown, or to the col_vals_*() and dedicated missingness validation methods (col_pct_missing(), col_missing_coded(), col_missing_only_coded(), col_missing_consistent()) to validate data while accounting for the null flavor codes. See the Missing Values Reporting and Validation Methods articles for the full set of capabilities.

Conclusion

CDISC data validation with Pointblank covers the full spectrum of clinical trial data management: from parsing Define-XML documents and controlled terminology packages to validating individual datasets against SDTM and ADaM structural rules. The built-in domain templates encode years of regulatory guidance into ready-to-use validation workflows, letting you check data compliance with a single function call. For teams preparing regulatory submissions, this means catching structural issues, date format errors, and terminology violations early in the data pipeline, well before the formal submission review process begins.