Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development.
Note
Throughout this guide, we use pb.preview() to display generated datasets with nice HTML formatting. This is optional: pb.generate_dataset() returns a standard DataFrame that you can display or manipulate however you prefer.
Quick Start
Generate test data using a schema with field constraints:
import pointblank as pb# Define a schema with typed field specificationsschema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), name=pb.string_field(preset="name"), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=80), status=pb.string_field(allowed=["active", "pending", "inactive"]),)# Generate 100 rows of test data (seed ensures reproducibility)pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns5
user_id
Int64
name
String
email
String
age
Int64
status
String
1
7188536481533917197
Vivienne Rios
vivienne.rios@gmail.com
77
pending
2
2674009078779859984
William Schaefer
williamschaefer@aol.com
67
active
3
7652102777077138151
Lily Hansen
lilyhansen@hotmail.com
78
active
4
157503859921753049
Shirley Mays
shirley.mays27@aol.com
36
inactive
5
2829213282471975080
Sean Dawson
sean.dawson29@aol.com
75
pending
96
7027508096731143831
Kathryn Green
kathryn.green@hotmail.com
55
active
97
6055996548456656575
Daniel Morris
dmorris@yahoo.com
39
inactive
98
3822709996092631588
William Cooper
williamcooper@protonmail.com
24
inactive
99
1522653102058131295
Lane Sawyer
l_sawyer@zoho.com
41
active
100
5690877051669225499
Paisley Sandoval
paisley_sandoval@gmail.com
75
pending
Field Types
Pointblank provides helper functions for defining typed columns with constraints:
Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data.
String Fields with Presets
Presets generate realistic data like names, emails, and addresses. When you include related fields like name and email in the same schema, Pointblank ensures coherence (e.g., the email address will be derived from the person’s name), making the generated data more realistic:
This coherence extends to other related fields like user_name, which will also reflect the person’s name when included alongside name and email fields.
String Fields with Patterns
Use regex patterns to generate strings matching specific formats:
Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification.
This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others.
Date and Datetime Fields
Temporal fields accept Python date and datetime objects for their range boundaries, generating values uniformly distributed within the specified period:
The same pattern applies to time_field() and duration_field(), allowing you to generate realistic temporal data for any use case.
Available Presets
The preset= parameter in string_field() supports many data types:
Personal Data:
name: full name (first + last)
name_full: full name with potential prefix/suffix
first_name: first name only
last_name: last name only
email: email address
Location Data:
address: full street address
city: city name
state: state/province name
country: country name
postcode: postal/ZIP code
latitude: latitude coordinate
longitude: longitude coordinate
Business Data:
company: company name
job: job title
catch_phrase: business catch phrase
Internet Data:
url: website URL
domain_name: domain name
ipv4: IPv4 address
ipv6: IPv6 address
user_name: username
password: password
Financial Data:
credit_card_number: credit card number
iban: International Bank Account Number
currency_code: currency code (USD, EUR, etc.)
Identifiers:
uuid4: UUID version 4
ssn: Social Security Number (US format)
license_plate: vehicle license plate
Text:
word: single word
sentence: full sentence
paragraph: paragraph of text
text: multiple paragraphs
Miscellaneous:
color_name: color name
file_name: file name
file_extension: file extension
mime_type: MIME type
Country-Specific Data
One of the most powerful features is generating locale-aware data. Use the country= parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets.
Let’s create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures consistency across related fields. The city, address, postcode, and coordinates will all correspond to the same location:
Here’s German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany:
Fehrbelliner Straße 9073, Wohnung 623, 10481 Prenzlauer Berg
10480
52.558487
13.418282
200
Bianca Bollmann
Augsburg
Hochfeldstraße 8381, Wohnung 18, 86343 Augsburg
86950
48.409668
10.856927
Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan’s geographic boundaries:
Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil’s CEP format:
This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters.
Supported Countries
Pointblank currently supports 50 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., "US") or alpha-3 codes (e.g., "USA").
Europe (32 countries):
Austria (AT), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czech Republic (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Germany (DE), Greece (GR), Hungary (HU), Iceland (IS), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Russia (RU), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), Switzerland (CH), United Kingdom (GB)
Americas (7 countries):
Argentina (AR), Brazil (BR), Canada (CA), Chile (CL), Colombia (CO), Mexico (MX), United States (US)
Asia-Pacific (10 countries):
Australia (AU), China (CN), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), New Zealand (NZ), Philippines (PH), South Korea (KR), Taiwan (TW)
Middle East (1 country):
Turkey (TR)
Additional countries and expanded coverage are planned for future releases.
Output Formats
The generate_dataset() function supports multiple output formats via the output= parameter, making it easy to integrate with your preferred data processing library.
Both formats work seamlessly with Pointblank’s validation functions, so you can choose whichever fits best with your existing data pipeline.
Using Generated Data for Validation Testing
A common use case is generating test data to validate your validation rules:
# Define a schema with constraintsschema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=100), status=pb.string_field(allowed=["active", "pending", "inactive"]),)# Generate test datatest_data = pb.generate_dataset(schema, n=100, seed=23)# Validate the generated data (it should pass all checks)validation = ( pb.Validate(test_data) .col_vals_gt("user_id", 0) .col_vals_regex("email", r".+@.+\..+") .col_vals_between("age", 18, 100) .col_vals_in_set("status", ["active", "pending", "inactive"]) .interrogate())validation
Pointblank Validation
2026-02-09|16:00:05
Polars
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
E
C
EXT
#4CA64C
1
col_vals_gt()
user_id
0
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
2
col_vals_regex()
email
.+@.+\..+
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
3
col_vals_between()
age
[18, 100]
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
4
col_vals_in_set()
status
active, pending, inactive
✓
100
100 1.00
0 0.00
—
—
—
—
Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline.
Conclusion
Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can:
quickly prototype validation rules before working with production data
create reproducible test fixtures for automated testing and CI/CD pipelines
generate locale-specific data for internationalization testing across many countries
ensure coherent relationships between related fields like names, emails, and addresses
produce datasets of any size with consistent, realistic values
Whether you’re building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality.