Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development.
Note
Throughout this guide, we use pb.preview() to display generated datasets with nice HTML formatting. This is optional: pb.generate_dataset() returns a standard DataFrame that you can display or manipulate however you prefer.
Quick Start
Generate test data using a schema with field constraints:
import pointblank as pb# Define a schema with typed field specificationsschema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), name=pb.string_field(preset="name"), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=80), status=pb.string_field(allowed=["active", "pending", "inactive"]),)# Generate 100 rows of test data (seed ensures reproducibility)pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns5
user_id
Int64
name
String
email
String
age
Int64
status
String
1
7188536481533917197
Doris Martin
d_martin@aol.com
77
pending
2
2674009078779859984
Nancy Gonzalez
nancygonzalez@icloud.com
67
active
3
7652102777077138151
Jessica Turner
jturner@aol.com
78
active
4
157503859921753049
George Evans
georgeevans@zoho.com
36
inactive
5
2829213282471975080
Patricia Williams
pwilliams@outlook.com
75
pending
96
7027508096731143831
Isaiah Murphy
isaiah.murphy@zoho.com
55
active
97
6055996548456656575
Brittany Rodriguez
brodriguez@yandex.com
39
inactive
98
3822709996092631588
Megan Stevens
mstevens26@aol.com
24
inactive
99
1522653102058131295
Pamela Jenkins
pjenkins29@yandex.com
41
active
100
5690877051669225499
Stephanie Santos
stephanie.santos40@gmail.com
75
pending
Field Types
Pointblank provides helper functions for defining typed columns with constraints:
Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data.
String Fields with Presets
Presets generate realistic data like names, emails, and addresses. When you include related fields like name and email in the same schema, Pointblank ensures coherence (e.g., the email address will be derived from the person’s name), making the generated data more realistic:
This coherence extends to other related fields like user_name, which will also reflect the person’s name when included alongside name and email fields.
String Fields with Patterns
Use regex patterns to generate strings matching specific formats:
Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification.
This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others.
Date and Datetime Fields
Temporal fields accept Python date and datetime objects for their range boundaries, generating values uniformly distributed within the specified period:
The same pattern applies to time_field() and duration_field(), allowing you to generate realistic temporal data for any use case.
Available Presets
The preset= parameter in string_field() supports many data types:
Personal Data:
name: full name (first + last)
name_full: full name with optional prefix/suffix (e.g., “Dr. Ana Sousa”, “Prof. Tanaka Yuki”)
first_name: first name only
last_name: last name only
email: email address
phone_number: phone number in country-specific format
Location Data:
address: full street address
city: city name
state: state/province name
country: country name
country_code_2: ISO 3166-1 alpha-2 country code (e.g., "US")
country_code_3: ISO 3166-1 alpha-3 country code (e.g., "USA")
postcode: postal/ZIP code
latitude: latitude coordinate
longitude: longitude coordinate
Business Data:
company: company name
job: job title
catch_phrase: business catch phrase
Internet Data:
url: website URL
domain_name: domain name
ipv4: IPv4 address
ipv6: IPv6 address
user_name: username
password: password
Financial Data:
credit_card_number: credit card number
iban: International Bank Account Number
currency_code: currency code (USD, EUR, etc.)
Identifiers:
uuid4: UUID version 4
md5: MD5 hash (32 hex characters)
sha1: SHA-1 hash (40 hex characters)
sha256: SHA-256 hash (64 hex characters)
ssn: Social Security Number (country-specific format)
license_plate: vehicle license plate (location-aware for CA, US, DE, AU, GB)
Barcodes:
ean8: EAN-8 barcode with valid check digit
ean13: EAN-13 barcode with valid check digit
Date/Time:
date_this_year: a date within the current year
date_this_decade: a date within the current decade
date_between: a random date between 2000 and 2025
date_range: two dates joined with an en-dash (e.g., "2012-05-12 – 2015-11-22")
future_date: a date up to 1 year in the future
past_date: a date up to 10 years in the past
time: a time value
Text:
word: single word
sentence: full sentence
paragraph: paragraph of text
text: multiple paragraphs
Miscellaneous:
color_name: color name
file_name: file name
file_extension: file extension
mime_type: MIME type
user_agent: browser user agent string (country-weighted)
Profile Fields
When generating person-profile data, you often need several related presets together: a name, an email derived from that name, an address, a phone number, and so on. Rather than wiring up each column individually, the profile_fields() helper returns a ready-made dictionary of StringField objects that you can unpack directly into a Schema().
Basic Usage
With no arguments, profile_fields() returns the standard set of seven columns: first_name, last_name, email, city, state, postcode, and phone_number. All coherence rules apply automatically: emails are derived from names, and city/state/postcode/phone are internally consistent.
The ** operator unpacks the dictionary into keyword arguments, as if you had written each string_field(preset=...) call by hand.
Choosing a Set
Three built-in sets control how many columns are generated:
Set
Columns
"minimal"
first_name, last_name, email, phone_number
"standard"
first_name, last_name, email, city, state, postcode, phone_number
"full"
first_name, last_name, email, address, city, state, postcode, phone_number, company, job
# Minimal profile: just name, email, and phonepb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(set="minimal")), n=100, seed=23, ))
PolarsRows100Columns4
first_name
String
last_name
String
email
String
phone_number
String
1
Weston
Parker
weston.parker23@gmail.com
(214) 473-2777
2
Hazel
Torres
hazel723@hotmail.com
(213) 862-8023
3
Lawrence
Mitchell
lawrence_mitchell@zoho.com
(602) 517-0098
4
Maria
Garcia
m_garcia@hotmail.com
(720) 949-4047
5
Michael
Hoffman
michael.hoffman@gmail.com
(469) 191-2143
96
Daniel
Torres
daniel_torres@icloud.com
(346) 922-7116
97
Helen
Simpson
hsimpson20@yandex.com
(346) 367-9252
98
Mark
Graham
mark.graham65@mail.com
(252) 882-0135
99
Brian
Moore
bmoore95@zoho.com
(310) 178-9819
100
Michael
Ward
michael_ward@yahoo.com
(707) 873-4304
# Full profile: includes address, company, and job titlepb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(set="full")), n=100, seed=23, ))
PolarsRows100Columns10
first_name
String
last_name
String
email
String
address
String
city
String
state
String
postcode
String
phone_number
String
company
String
job
String
1
Weston
Parker
wparker89@outlook.com
3365 Richmond Avenue, Suite 422, Lubbock, Texas 79421
Use include= to add presets to the base set and exclude= to remove them. Both accept lists of preset names. The available profile presets are: first_name, last_name, name, email, address, city, state, postcode, phone_number, company, and job.
# Standard set + company columnpb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(include=["company"])), n=100, seed=23, ))
PolarsRows100Columns8
first_name
String
last_name
String
email
String
city
String
state
String
postcode
String
phone_number
String
company
String
1
Weston
Parker
weston.parker23@gmail.com
Lubbock
Texas
79404
(832) 760-5399
Innovative Systems Solutions
2
Hazel
Torres
hazel723@hotmail.com
Anaheim
California
92873
(805) 788-7427
Sterling Engineering
3
Lawrence
Mitchell
lawrence_mitchell@zoho.com
Phoenix
Arizona
85027
(928) 958-2589
Goldman Sachs
4
Maria
Garcia
m_garcia@hotmail.com
Denver
Colorado
80277
(719) 064-6663
Evans Group
5
Michael
Hoffman
michael.hoffman@gmail.com
San Antonio
Texas
78208
(210) 070-1000
Goodwin and Garrett
96
Daniel
Torres
daniel_torres@icloud.com
El Paso
Texas
79944
(214) 099-8902
Henry Construction
97
Helen
Simpson
hsimpson20@yandex.com
El Paso
Texas
79930
(956) 223-4585
Thompson Technologies
98
Mark
Graham
mark.graham65@mail.com
Charlotte
North Carolina
28222
(910) 859-9554
Universal Consulting
99
Brian
Moore
bmoore95@zoho.com
Los Angeles
California
90058
(858) 861-0525
Long Industries
100
Michael
Ward
michael_ward@yahoo.com
San Diego
California
92147
(626) 922-1048
Pioneer Solutions
# Standard set without city and statepb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(exclude=["city", "state"])), n=100, seed=23, ))
PolarsRows100Columns5
first_name
String
last_name
String
email
String
postcode
String
phone_number
String
1
Weston
Parker
weston.parker23@gmail.com
79404
(832) 760-5399
2
Hazel
Torres
hazel723@hotmail.com
92873
(805) 788-7427
3
Lawrence
Mitchell
lawrence_mitchell@zoho.com
85027
(928) 958-2589
4
Maria
Garcia
m_garcia@hotmail.com
80277
(719) 064-6663
5
Michael
Hoffman
michael.hoffman@gmail.com
78208
(210) 070-1000
96
Daniel
Torres
daniel_torres@icloud.com
79944
(214) 099-8902
97
Helen
Simpson
hsimpson20@yandex.com
79930
(956) 223-4585
98
Mark
Graham
mark.graham65@mail.com
28222
(910) 859-9554
99
Brian
Moore
bmoore95@zoho.com
90058
(858) 861-0525
100
Michael
Ward
michael_ward@yahoo.com
92147
(626) 922-1048
You can combine include= and exclude= in the same call, as long as the same preset does not appear in both.
Column Prefixes
The prefix= parameter prepends a string to every column name. This is especially useful when a schema needs two independent profiles (e.g., sender and recipient):
Each prefixed group maintains its own coherence: the sender’s email is derived from the sender’s name, and the recipient’s email from the recipient’s name.
Combining with Other Field Types
Since profile_fields() returns a plain dictionary, it composes naturally with any other field types:
One of the most powerful features is generating locale-aware data. Use the country= parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets.
Let’s create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures consistency across related fields. The city, address, postcode, and coordinates will all correspond to the same location:
Here’s German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany:
Friedensstraße 9713, Whg. 474, 60597 Frankfurt am Main
60674
50.212245
8.711472
4
Juliane Münz
Leipzig
Lindenauer Markt 6249, Whg. 489, 04541 Leipzig
04992
51.276862
12.458890
5
Anton Baumann
Köln
Aachener Straße 7203, 50125 Köln
50589
50.967264
6.795838
196
Franziska Wendt
Ulm
Marktplatz 6251, Whg. 535, 89984 Ulm
89226
48.395296
10.001962
197
Lennart Berger
München
Brienner Straße 1390, Whg. 389, 80255 München
80835
48.206882
11.674262
198
Julia Knecht
Ludwigshafen am Rhein
Friedrichstraße 3204, 67944 Ludwigshafen am Rhein
67305
49.473668
8.437782
199
Sebastian Thiel
Gelsenkirchen
Husemannstraße 453, Whg. 273, 45732 Gelsenkirchen
45992
51.568689
7.082531
200
Trude Kaiser
Kassel
Königstraße 1394, 34406 Kassel
34736
51.326544
9.494319
Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan’s geographic boundaries:
Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil’s CEP format:
This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters.
Data Coherence
Pointblank automatically links related columns to produce realistic rows. There are three coherence systems that activate based on which presets appear together in a schema:
Address coherence activates when any address-related preset is present (address, city, state, postcode, latitude, longitude, phone_number, license_plate). All of these fields will refer to the same location within each row.
Person coherence activates when any person-related preset is present (name, name_full, first_name, last_name, email, user_name). The email and username are derived from the person’s name.
Business coherence activates when bothjob and company are present. When active:
the company and job title are drawn from the same industry (e.g., a nurse will work at a hospital, not a law firm).
name_full gains profession-matched titles: a doctor may appear as “Dr. Ana Sousa” and a professor as “Prof. Tanaka Yuki”. For German-speaking countries (DE, AT, CH), the honorific stacks before the professional title (e.g., “Herr Dr. med. Klaus Weber”).
integer columns whose name contains age (e.g., age, person_age) are automatically constrained to working-age range (22–65).
Here’s an example showing all three coherence systems working together:
License plate coherence is part of address coherence. For CA, US, DE, AU, and GB, license plates follow real subregion-specific formats when location fields are present. For example, an Ontario row produces plates like "CABC 123" while a British Columbia row produces "AB1 23C". Letters I, O, Q, and U are excluded from plate generation, matching real-world restrictions.
Supported Countries
Pointblank currently supports 100 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., "US") or alpha-3 codes (e.g., "USA").
Europe (38 countries):
Armenia (AM), Austria (AT), Azerbaijan (AZ), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czech Republic (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Georgia (GE), Germany (DE), Greece (GR), Hungary (HU), Iceland (IS), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Moldova (MD), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Russia (RU), Serbia (RS), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), Switzerland (CH), Ukraine (UA), United Kingdom (GB)
Americas (19 countries):
Argentina (AR), Bolivia (BO), Brazil (BR), Canada (CA), Chile (CL), Colombia (CO), Costa Rica (CR), Dominican Republic (DO), Ecuador (EC), El Salvador (SV), Guatemala (GT), Honduras (HN), Jamaica (JM), Mexico (MX), Panama (PA), Paraguay (PY), Peru (PE), United States (US), Uruguay (UY)
Asia-Pacific (22 countries):
Australia (AU), Bangladesh (BD), Cambodia (KH), China (CN), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), Kazakhstan (KZ), Malaysia (MY), Myanmar (MM), Nepal (NP), New Zealand (NZ), Pakistan (PK), Philippines (PH), Singapore (SG), South Korea (KR), Sri Lanka (LK), Taiwan (TW), Thailand (TH), Uzbekistan (UZ), Vietnam (VN)
Middle East & Africa (21 countries):
Algeria (DZ), Cameroon (CM), Egypt (EG), Ethiopia (ET), Ghana (GH), Israel (IL), Jordan (JO), Kenya (KE), Lebanon (LB), Morocco (MA), Mozambique (MZ), Nigeria (NG), Rwanda (RW), Saudi Arabia (SA), Senegal (SN), South Africa (ZA), Tanzania (TZ), Tunisia (TN), Turkey (TR), Uganda (UG), United Arab Emirates (AE)
Additional countries and expanded coverage are planned for future releases.
Mixing Multiple Countries
When you need test data that spans multiple locales (e.g., simulating an international customer base), you can pass a list or dict to the country= parameter instead of a single string.
Passing a list of country codes splits rows equally across those countries. Here, 200 rows are divided evenly among the US, Germany, and Japan (~67 each):
To control the proportion of rows per country, pass a dict mapping country codes to weights. The following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:
Weights are auto-normalized, so {"US": 7, "DE": 2, "FR": 1} is equivalent to the example above. Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly n.
By default, rows from different countries are interleaved randomly (shuffle=True). Set shuffle=False to keep rows grouped by country in the order the countries are listed:
All coherence systems (address, person, business) work correctly within each country’s batch of rows. A French row will have a French name with a matching French email; a Japanese row will have a Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates) are generated independently for each batch but still respect their field constraints.
Frequency-Weighted Sampling
By default, names and cities are sampled uniformly at random from the locale data, giving every entry the same probability of being selected. Real-world distributions are far from uniform though: “James” and “Maria” appear orders of magnitude more often than “Thaddeus” or “Xiomara”, and more people live in New York City than in Flagstaff. The weighted=True parameter makes generated data reflect this natural skew.
With weighting enabled you will see popular names like James, John, Mary, and Patricia appear more frequently, while unusual names surface only occasionally. Similarly, cities like New York, Los Angeles, and Chicago dominate the output while smaller cities appear less often.
The feature works by organizing locale data into four frequency tiers. Each tier has a sampling probability that determines how likely its members are to be selected:
Tier
Probability
Contents
very_common
45%
The top ~10% of entries by real-world frequency
common
30%
The next ~20% of entries
uncommon
20%
The next ~30% of entries
rare
5%
The remaining ~40% of entries
When a value is needed, a tier is first chosen according to these probabilities and then a single entry is picked uniformly at random within that tier. This two-step approach keeps sampling fast while producing a realistic long-tail distribution. Setting weighted=False pools all entries across every tier and samples them uniformly, which can be useful when you want an even spread rather than a realistic distribution.
Weighted sampling combines seamlessly with multi-country mixing. Each country’s batch uses its own tiered data independently, so a mixed dataset will have weighted US names alongside weighted German names:
All 100 supported country locales have tiered name and location data, so weighted=True produces realistic frequency distributions for every country.
Output Formats
The generate_dataset() function supports multiple output formats via the output= parameter, making it easy to integrate with your preferred data processing library.
Both formats work seamlessly with Pointblank’s validation functions, so you can choose whichever fits best with your existing data pipeline.
Using Generated Data for Validation Testing
A common use case is generating test data to validate your validation rules:
# Define a schema with constraintsschema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=100), status=pb.string_field(allowed=["active", "pending", "inactive"]),)# Generate test datatest_data = pb.generate_dataset(schema, n=100, seed=23)# Validate the generated data (it should pass all checks)validation = ( pb.Validate(test_data) .col_vals_gt("user_id", 0) .col_vals_regex("email", r".+@.+\..+") .col_vals_between("age", 18, 100) .col_vals_in_set("status", ["active", "pending", "inactive"]) .interrogate())validation
Pointblank Validation
2026-02-26|14:33:03
Polars
STEP
COLUMNS
VALUES
TBL
EVAL
UNITS
PASS
FAIL
W
E
C
EXT
#4CA64C
1
col_vals_gt()
user_id
0
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
2
col_vals_regex()
email
.+@.+\..+
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
3
col_vals_between()
age
[18, 100]
✓
100
100 1.00
0 0.00
—
—
—
—
#4CA64C
4
col_vals_in_set()
status
active, pending, inactive
✓
100
100 1.00
0 0.00
—
—
—
—
Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline.
Pytest Fixture
When Pointblank is installed, a generate_datasetpytest fixture is automatically available in all your test files. There is no need to import anything or add configuration to conftest.py: the fixture is registered via pytest’s plugin system.
The fixture works identically to pb.generate_dataset(), but with one key difference: when you don’t supply a seed= parameter, a deterministic seed is automatically derived from the test’s fully-qualified name. This means:
the same test always produces the same data: no manual seed management required.
different tests get different seeds, so they exercise different datasets.
you can still pass an explicit seed= to override the automatic seed when needed.
Basic Usage
Use it by adding generate_dataset to your test function’s parameter list:
Calling the fixture multiple times within the same test produces different (but still deterministic) data on each call:
def test_merge_pipeline(generate_dataset): customers = generate_dataset(customer_schema, n=1000, country="US") orders = generate_dataset(order_schema, n=5000)# Each call gets a unique seed derived from the test name + call index,# so both DataFrames are deterministic and different from each other. result = merge_pipeline(customers, orders)assert result.shape[0] >0
Testing Across Locales
The fixture makes locale testing particularly concise when combined with pytest.mark.parametrize:
You can also use .default_seed to reproduce the exact dataset outside of pytest:
# In a REPL or notebook, reproduce the data from a failed test:import pointblank as pbdf = pb.generate_dataset(schema, n=500, seed=<default_seed_from_output>)
Seed Stability
A given seed (whether explicit or auto-derived) is guaranteed to produce identical output within the same Pointblank version. Across versions, changes to country data files or generator logic may alter the output for a given seed.
For CI pipelines that require bit-exact data across library upgrades, we recommend saving generated DataFrames as Parquet or CSV snapshot files rather than relying on cross-version seed stability. This is the same approach used by snapshot-testing tools like pytest-snapshot and syrupy.
Conclusion
Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can:
quickly prototype validation rules before working with production data
create reproducible test fixtures for automated testing and CI/CD pipelines
generate locale-specific data for internationalization testing across 100 countries
ensure coherent relationships between related fields like names, emails, addresses, jobs, and license plates
produce datasets of any size with consistent, realistic values
Whether you’re building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality.