Test Data Generation

Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development.

Note

Throughout this guide, we use pb.preview() to display generated datasets with nice HTML formatting. This is optional: pb.generate_dataset() returns a standard DataFrame that you can display or manipulate however you prefer.

Quick Start

Generate test data using a schema with field constraints:

import pointblank as pb

# Define a schema with typed field specifications
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=80),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate 100 rows of test data (seed ensures reproducibility)
pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	user_id Int64	name String	email String	age Int64	status String
PolarsRows100Columns5
1	7188536481533917197	Doris Martin	d_martin@aol.com	77	pending
2	2674009078779859984	Nancy Gonzalez	nancygonzalez@icloud.com	67	active
3	7652102777077138151	Jessica Turner	jturner@aol.com	78	active
4	157503859921753049	George Evans	georgeevans@zoho.com	36	inactive
5	2829213282471975080	Patricia Williams	pwilliams@outlook.com	75	pending
96	7027508096731143831	Isaiah Murphy	isaiah.murphy@zoho.com	55	active
97	6055996548456656575	Brittany Rodriguez	brodriguez@yandex.com	39	inactive
98	3822709996092631588	Megan Stevens	mstevens26@aol.com	24	inactive
99	1522653102058131295	Pamela Jenkins	pjenkins29@yandex.com	41	active
100	5690877051669225499	Stephanie Santos	stephanie.santos40@gmail.com	75	pending

Field Types

Pointblank provides helper functions for defining typed columns with constraints:

Function	Description	Key Parameters
`int_field()`	Integer columns	`min_val`, `max_val`, `allowed`, `unique`
`float_field()`	Float columns	`min_val`, `max_val`, `allowed`
`string_field()`	String columns	`preset`, `pattern`, `allowed`, `unique`
`bool_field()`	Boolean columns	`p_true` (probability of True)
`date_field()`	Date columns	`min_val`, `max_val`
`datetime_field()`	Datetime columns	`min_val`, `max_val`
`time_field()`	Time columns	`min_val`, `max_val`
`duration_field()`	Duration columns	`min_val`, `max_val`
`profile_fields()`	Bundled person-profile fields	`set`, `split_name`, `include`, `exclude`, `prefix`

Integer Fields

Integer fields support range constraints with min_val and max_val, discrete allowed values with allowed, and uniqueness enforcement with unique=True:

schema = pb.Schema(
    id=pb.int_field(min_val=1000, max_val=9999, unique=True),
    quantity=pb.int_field(min_val=1, max_val=100),
    rating=pb.int_field(allowed=[1, 2, 3, 4, 5]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	id Int64	quantity Int64	rating Int64
PolarsRows100Columns3
1	5749	100	3
2	2368	38	1
3	1279	11	1
4	6025	3	5
5	7942	76	3
96	5330	64	2
97	8634	31	1
98	9982	43	2
99	4221	70	1
100	8520	19	5

The unique=True constraint ensures no duplicate values appear in that column, which is useful for generating primary keys or identifiers.

Float Fields

Float fields work similarly to integers, with min_val and max_val defining the range of generated values:

schema = pb.Schema(
    price=pb.float_field(min_val=0.0, max_val=1000.0),
    discount=pb.float_field(min_val=0.0, max_val=0.5),
    temperature=pb.float_field(min_val=-40.0, max_val=50.0),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	price Float64	discount Float64	temperature Float64
PolarsRows100Columns3
1	924.8652516259452	0.4624326258129726	43.23787264633508
2	948.6057779931772	0.47430288899658857	45.37452001938594
3	892.4333440485793	0.44621667202428966	40.31900096437214
4	83.55067683068363	0.04177533841534181	-32.48043908523847
5	592.0272268857353	0.29601361344286764	13.282450419716177
96	444.6925279641446	0.2223462639820723	0.022327516773010814
97	342.7762214585577	0.17138811072927884	-9.150140068729808
98	892.3288689140903	0.4461644344570452	40.309598202268134
99	813.7559456012128	0.4068779728006064	33.238035104109144
100	895.1816604808429	0.44759083024042146	40.56634944327587

Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data.

String Fields with Presets

Presets generate realistic data like names, emails, and addresses. When you include related fields like name and email in the same schema, Pointblank ensures coherence (e.g., the email address will be derived from the person’s name), making the generated data more realistic:

schema = pb.Schema(
    full_name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	full_name String	email String	company String	city String
PolarsRows100Columns4
1	Weston Parker	weston.parker23@gmail.com	Innovative Systems Solutions	Lubbock
2	Hazel Torres	hazel723@hotmail.com	Sterling Engineering	Anaheim
3	Lawrence Mitchell	lawrence_mitchell@zoho.com	Goldman Sachs	Phoenix
4	Maria Garcia	m_garcia@hotmail.com	Evans Group	Denver
5	Michael Hoffman	michael.hoffman@gmail.com	Goodwin and Garrett	San Antonio
96	Daniel Torres	daniel_torres@icloud.com	Henry Construction	El Paso
97	Helen Simpson	hsimpson20@yandex.com	Thompson Technologies	El Paso
98	Mark Graham	mark.graham65@mail.com	Universal Consulting	Charlotte
99	Brian Moore	bmoore95@zoho.com	Long Industries	Los Angeles
100	Michael Ward	michael_ward@yahoo.com	Pioneer Solutions	San Diego

This coherence extends to other related fields like user_name, which will also reflect the person’s name when included alongside name and email fields.

String Fields with Patterns

Use regex patterns to generate strings matching specific formats:

schema = pb.Schema(
    product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"),
    phone=pb.string_field(pattern=r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"),
    hex_color=pb.string_field(pattern=r"#[0-9A-F]{6}"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	product_code String	phone String	hex_color String
PolarsRows100Columns3
1	CAS-6685	(109) 668-2347	#209DCB
2	XGI-0397	(397) 117-0865	#68E07E
3	DCW-6086	(309) 293-9594	#32FD0D
4	YBG-9529	(917) 797-2285	#161B56
5	XLS-9459	(911) 609-9495	#B9A2F5
96	THG-2900	(993) 511-5415	#A7A37B
97	CHC-3681	(065) 802-0822	#47E498
98	HKT-3552	(927) 701-4276	#AF75D8
99	OEW-4157	(365) 419-1062	#5CCD95
100	FSX-8948	(897) 459-3038	#0F3220

Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification.

Boolean Fields

Control the probability of True values:

schema = pb.Schema(
    is_active=pb.bool_field(p_true=0.8),      # 80% True
    is_premium=pb.bool_field(p_true=0.2),     # 20% True
    is_verified=pb.bool_field(),              # 50% True (default)
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	is_active Boolean	is_premium Boolean	is_verified Boolean
PolarsRows100Columns3
1	False	False	False
2	False	False	False
3	False	False	False
4	True	True	True
5	True	False	False
96	True	False	True
97	True	False	True
98	False	False	False
99	False	False	False
100	False	False	False

This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others.

Date and Datetime Fields

Temporal fields accept Python date and datetime objects for their range boundaries, generating values uniformly distributed within the specified period:

from datetime import date, datetime

schema = pb.Schema(
    birth_date=pb.date_field(
        min_date=date(1960, 1, 1),
        max_date=date(2005, 12, 31)
    ),
    created_at=pb.datetime_field(
        min_date=datetime(2024, 1, 1),
        max_date=datetime(2024, 12, 31)
    ),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	birth_date Date	created_at Datetime
PolarsRows100Columns2
1	1986-01-03	2024-12-25 04:22:08
2	1967-06-30	2024-10-29 16:22:23
3	1961-07-13	2024-04-22 14:13:08
4	1987-07-09	2024-12-12 14:04:53
5	1998-01-06	2024-11-18 04:49:47
96	1969-04-14	2024-07-29 13:15:44
97	1975-03-23	2024-04-28 08:49:29
98	1981-05-29	2024-12-13 09:42:37
99	1982-09-14	2024-10-28 23:35:39
100	1968-12-21	2024-06-25 14:22:27

The same pattern applies to time_field() and duration_field(), allowing you to generate realistic temporal data for any use case.

Available Presets

The preset= parameter in string_field() supports many data types:

Personal Data:

name: full name (first + last)
name_full: full name with optional prefix/suffix (e.g., “Dr. Ana Sousa”, “Prof. Tanaka Yuki”)
first_name: first name only
last_name: last name only
email: email address
phone_number: phone number in country-specific format

Location Data:

address: full street address
city: city name
state: state/province name
country: country name
country_code_2: ISO 3166-1 alpha-2 country code (e.g., "US")
country_code_3: ISO 3166-1 alpha-3 country code (e.g., "USA")
postcode: postal/ZIP code
latitude: latitude coordinate
longitude: longitude coordinate

Business Data:

company: company name
job: job title
catch_phrase: business catch phrase

Internet Data:

url: website URL
domain_name: domain name
ipv4: IPv4 address
ipv6: IPv6 address
user_name: username
password: password

Financial Data:

credit_card_number: credit card number
iban: International Bank Account Number
currency_code: currency code (USD, EUR, etc.)

Identifiers:

uuid4: UUID version 4
md5: MD5 hash (32 hex characters)
sha1: SHA-1 hash (40 hex characters)
sha256: SHA-256 hash (64 hex characters)
ssn: Social Security Number (country-specific format)
license_plate: vehicle license plate (location-aware for CA, US, DE, AU, GB)

Barcodes:

ean8: EAN-8 barcode with valid check digit
ean13: EAN-13 barcode with valid check digit

Date/Time:

date_this_year: a date within the current year
date_this_decade: a date within the current decade
date_between: a random date between 2000 and 2025
date_range: two dates joined with an en-dash (e.g., "2012-05-12 – 2015-11-22")
future_date: a date up to 1 year in the future
past_date: a date up to 10 years in the past
time: a time value

Text:

word: single word
sentence: full sentence
paragraph: paragraph of text
text: multiple paragraphs

Miscellaneous:

color_name: color name
file_name: file name
file_extension: file extension
mime_type: MIME type
user_agent: browser user agent string (country-weighted)

Profile Fields

When generating person-profile data, you often need several related presets together: a name, an email derived from that name, an address, a phone number, and so on. Rather than wiring up each column individually, the profile_fields() helper returns a ready-made dictionary of StringField objects that you can unpack directly into a Schema().

Basic Usage

With no arguments, profile_fields() returns the standard set of seven columns: first_name, last_name, email, city, state, postcode, and phone_number. All coherence rules apply automatically: emails are derived from names, and city/state/postcode/phone are internally consistent.

schema = pb.Schema(
    user_id=pb.int_field(unique=True, min_val=1),
    **pb.profile_fields(),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	user_id Int64	first_name String	last_name String	email String	city String	state String	postcode String	phone_number String
PolarsRows100Columns8
1	7188536481533917197	Weston	Parker	weston.parker23@gmail.com	Lubbock	Texas	79404	(832) 760-5399
2	2674009078779859984	Hazel	Torres	hazel723@hotmail.com	Anaheim	California	92873	(805) 788-7427
3	7652102777077138151	Lawrence	Mitchell	lawrence_mitchell@zoho.com	Phoenix	Arizona	85027	(928) 958-2589
4	157503859921753049	Maria	Garcia	m_garcia@hotmail.com	Denver	Colorado	80277	(719) 064-6663
5	2829213282471975080	Michael	Hoffman	michael.hoffman@gmail.com	San Antonio	Texas	78208	(210) 070-1000
96	7027508096731143831	Daniel	Torres	daniel_torres@icloud.com	El Paso	Texas	79944	(214) 099-8902
97	6055996548456656575	Helen	Simpson	hsimpson20@yandex.com	El Paso	Texas	79930	(956) 223-4585
98	3822709996092631588	Mark	Graham	mark.graham65@mail.com	Charlotte	North Carolina	28222	(910) 859-9554
99	1522653102058131295	Brian	Moore	bmoore95@zoho.com	Los Angeles	California	90058	(858) 861-0525
100	5690877051669225499	Michael	Ward	michael_ward@yahoo.com	San Diego	California	92147	(626) 922-1048

The ** operator unpacks the dictionary into keyword arguments, as if you had written each string_field(preset=...) call by hand.

Choosing a Set

Three built-in sets control how many columns are generated:

Set	Columns
`"minimal"`	`first_name`, `last_name`, `email`, `phone_number`
`"standard"`	`first_name`, `last_name`, `email`, `city`, `state`, `postcode`, `phone_number`
`"full"`	`first_name`, `last_name`, `email`, `address`, `city`, `state`, `postcode`, `phone_number`, `company`, `job`

# Minimal profile: just name, email, and phone
pb.preview(
    pb.generate_dataset(
        pb.Schema(**pb.profile_fields(set="minimal")),
        n=100, seed=23,
    )
)

	first_name String	last_name String	email String	phone_number String
PolarsRows100Columns4
1	Weston	Parker	weston.parker23@gmail.com	(214) 473-2777
2	Hazel	Torres	hazel723@hotmail.com	(213) 862-8023
3	Lawrence	Mitchell	lawrence_mitchell@zoho.com	(602) 517-0098
4	Maria	Garcia	m_garcia@hotmail.com	(720) 949-4047
5	Michael	Hoffman	michael.hoffman@gmail.com	(469) 191-2143
96	Daniel	Torres	daniel_torres@icloud.com	(346) 922-7116
97	Helen	Simpson	hsimpson20@yandex.com	(346) 367-9252
98	Mark	Graham	mark.graham65@mail.com	(252) 882-0135
99	Brian	Moore	bmoore95@zoho.com	(310) 178-9819
100	Michael	Ward	michael_ward@yahoo.com	(707) 873-4304

# Full profile: includes address, company, and job title
pb.preview(
    pb.generate_dataset(
        pb.Schema(**pb.profile_fields(set="full")),
        n=100, seed=23,
    )
)

	first_name String	last_name String	email String	address String	city String	state String	postcode String	phone_number String	company String	job String
PolarsRows100Columns10
1	Weston	Parker	wparker89@outlook.com	3365 Richmond Avenue, Suite 422, Lubbock, Texas 79421	Lubbock	Texas	79470	(210) 260-3732	Patterson Networks	System Administrator
2	Hazel	Torres	htorres35@aol.com	353 East South Street, Anaheim, California 92898	Anaheim	California	92835	(858) 831-4064	Anaheim Real Estate	Insurance Agent
3	Lawrence	Mitchell	lawrence.mitchell1@hotmail.com	2647 Tatum Boulevard, Phoenix, Arizona 85023	Phoenix	Arizona	85041	(623) 810-1674	Ford	Interior Designer
4	Maria	Garcia	mgarcia@aol.com	5145 Arizona Avenue, Apt 761, Denver, Colorado 80255	Denver	Colorado	80287	(970) 925-8950	Carter Electric	HVAC Technician
5	Michael	Hoffman	m_hoffman@outlook.com	5209 Grayson Street, Suite 333, San Antonio, Texas 78299	San Antonio	Texas	78248	(346) 332-9728	San Antonio Engineering Group	Construction Manager
96	Daniel	Torres	daniel.torres48@icloud.com	3678 Rim Road, Suite 33, El Paso, Texas 79943	El Paso	Texas	79962	(682) 623-1636	Pioneer Strategy Group	Consultant
97	Helen	Simpson	helen.simpson1@hotmail.com	1411 Mesa Street, El Paso, Texas 79930	El Paso	Texas	79955	(940) 605-2352	Powell & Gordon	Electrical Engineer
98	Mark	Graham	mgraham26@mail.com	2065 Craighead Road, Apt 521, Charlotte, North Carolina 28289	Charlotte	North Carolina	28261	(704) 064-0131	Graham & Allen	Mechanical Engineer
99	Brian	Moore	brian263@icloud.com	2072 2nd Street Los Angeles, Suite 893, Los Angeles, California 90034	Los Angeles	California	90037	(323) 001-2154	The Quality Hotels	Chef
100	Michael	Ward	m_ward@protonmail.com	7966 K Street, San Diego, California 92109	San Diego	California	92194	(310) 703-1118	Morgan Stanley	Bank Teller

Combined vs. Split Names

By default, names are split into first_name and last_name columns. Set split_name=False to get a single name column instead:

pb.preview(
    pb.generate_dataset(
        pb.Schema(**pb.profile_fields(set="minimal", split_name=False)),
        n=100, seed=23,
    )
)

	name String	email String	phone_number String
PolarsRows100Columns3
1	Weston Parker	weston.parker23@gmail.com	(214) 473-2777
2	Hazel Torres	hazel723@hotmail.com	(213) 862-8023
3	Lawrence Mitchell	lawrence_mitchell@zoho.com	(602) 517-0098
4	Maria Garcia	m_garcia@hotmail.com	(720) 949-4047
5	Michael Hoffman	michael.hoffman@gmail.com	(469) 191-2143
96	Daniel Torres	daniel_torres@icloud.com	(346) 922-7116
97	Helen Simpson	hsimpson20@yandex.com	(346) 367-9252
98	Mark Graham	mark.graham65@mail.com	(252) 882-0135
99	Brian Moore	bmoore95@zoho.com	(310) 178-9819
100	Michael Ward	michael_ward@yahoo.com	(707) 873-4304

Adding and Removing Columns

Use include= to add presets to the base set and exclude= to remove them. Both accept lists of preset names. The available profile presets are: first_name, last_name, name, email, address, city, state, postcode, phone_number, company, and job.

# Standard set + company column
pb.preview(
    pb.generate_dataset(
        pb.Schema(**pb.profile_fields(include=["company"])),
        n=100, seed=23,
    )
)

	first_name String	last_name String	email String	city String	state String	postcode String	phone_number String	company String
PolarsRows100Columns8
1	Weston	Parker	weston.parker23@gmail.com	Lubbock	Texas	79404	(832) 760-5399	Innovative Systems Solutions
2	Hazel	Torres	hazel723@hotmail.com	Anaheim	California	92873	(805) 788-7427	Sterling Engineering
3	Lawrence	Mitchell	lawrence_mitchell@zoho.com	Phoenix	Arizona	85027	(928) 958-2589	Goldman Sachs
4	Maria	Garcia	m_garcia@hotmail.com	Denver	Colorado	80277	(719) 064-6663	Evans Group
5	Michael	Hoffman	michael.hoffman@gmail.com	San Antonio	Texas	78208	(210) 070-1000	Goodwin and Garrett
96	Daniel	Torres	daniel_torres@icloud.com	El Paso	Texas	79944	(214) 099-8902	Henry Construction
97	Helen	Simpson	hsimpson20@yandex.com	El Paso	Texas	79930	(956) 223-4585	Thompson Technologies
98	Mark	Graham	mark.graham65@mail.com	Charlotte	North Carolina	28222	(910) 859-9554	Universal Consulting
99	Brian	Moore	bmoore95@zoho.com	Los Angeles	California	90058	(858) 861-0525	Long Industries
100	Michael	Ward	michael_ward@yahoo.com	San Diego	California	92147	(626) 922-1048	Pioneer Solutions

# Standard set without city and state
pb.preview(
    pb.generate_dataset(
        pb.Schema(**pb.profile_fields(exclude=["city", "state"])),
        n=100, seed=23,
    )
)

	first_name String	last_name String	email String	postcode String	phone_number String
PolarsRows100Columns5
1	Weston	Parker	weston.parker23@gmail.com	79404	(832) 760-5399
2	Hazel	Torres	hazel723@hotmail.com	92873	(805) 788-7427
3	Lawrence	Mitchell	lawrence_mitchell@zoho.com	85027	(928) 958-2589
4	Maria	Garcia	m_garcia@hotmail.com	80277	(719) 064-6663
5	Michael	Hoffman	michael.hoffman@gmail.com	78208	(210) 070-1000
96	Daniel	Torres	daniel_torres@icloud.com	79944	(214) 099-8902
97	Helen	Simpson	hsimpson20@yandex.com	79930	(956) 223-4585
98	Mark	Graham	mark.graham65@mail.com	28222	(910) 859-9554
99	Brian	Moore	bmoore95@zoho.com	90058	(858) 861-0525
100	Michael	Ward	michael_ward@yahoo.com	92147	(626) 922-1048

You can combine include= and exclude= in the same call, as long as the same preset does not appear in both.

Column Prefixes

The prefix= parameter prepends a string to every column name. This is especially useful when a schema needs two independent profiles (e.g., sender and recipient):

schema = pb.Schema(
    **pb.profile_fields(set="minimal", prefix="sender_"),
    **pb.profile_fields(set="minimal", prefix="recipient_"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	sender_first_name String	sender_last_name String	sender_email String	sender_phone_number String	recipient_first_name String	recipient_last_name String	recipient_email String	recipient_phone_number String
PolarsRows100Columns8
1	Weston	Parker	weston.parker23@gmail.com	(214) 473-2777	Weston	Parker	westonparker@protonmail.com	(469) 259-3105
2	Hazel	Torres	hazel723@hotmail.com	(213) 862-8023	Hazel	Torres	hazel.torres@protonmail.com	(415) 618-3541
3	Lawrence	Mitchell	lawrence_mitchell@zoho.com	(602) 517-0098	Lawrence	Mitchell	l_mitchell@mail.com	(602) 091-0791
4	Maria	Garcia	m_garcia@hotmail.com	(720) 949-4047	Maria	Garcia	maria.garcia@zoho.com	(303) 951-9806
5	Michael	Hoffman	michael.hoffman@gmail.com	(469) 191-2143	Michael	Hoffman	m_hoffman@aol.com	(281) 996-2763
96	Daniel	Torres	daniel_torres@icloud.com	(346) 922-7116	Daniel	Torres	daniel.torres37@yandex.com	(903) 366-1281
97	Helen	Simpson	hsimpson20@yandex.com	(346) 367-9252	Helen	Simpson	helensimpson@hotmail.com	(682) 536-4410
98	Mark	Graham	mark.graham65@mail.com	(252) 882-0135	Mark	Graham	mark990@hotmail.com	(919) 621-4237
99	Brian	Moore	bmoore95@zoho.com	(310) 178-9819	Brian	Moore	brianmoore@yahoo.com	(707) 462-8838
100	Michael	Ward	michael_ward@yahoo.com	(707) 873-4304	Michael	Ward	michael_ward@yandex.com	(408) 618-6815

Each prefixed group maintains its own coherence: the sender’s email is derived from the sender’s name, and the recipient’s email from the recipient’s name.

Combining with Other Field Types

Since profile_fields() returns a plain dictionary, it composes naturally with any other field types:

schema = pb.Schema(
    id=pb.int_field(unique=True, min_val=1000),
    **pb.profile_fields(),
    active=pb.bool_field(p_true=0.8),
    signup_date=pb.date_field(
        min_date="2024-01-01",
        max_date="2025-12-31",
    ),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23, country="DE"))

	id Int64	first_name String	last_name String	email String	city String	state String	postcode String	phone_number String	active Boolean	signup_date Date
PolarsRows100Columns10
1	7188536481533918196	Agathe	Kramer	akramer@gmail.com	Potsdam	Brandenburg	14820	(0335) 989-6751	False	2024-10-23
2	2674009078779860983	Rüdiger	Altmann	ruediger.altmann69@gmail.com	Halle (Saale)	Sachsen-Anhalt	06210	(0345) 239-7780	False	2024-03-26
3	7652102777077139150	Bodo	Meyer	bodo723@web.de	Frankfurt am Main	Hessen	60473	(0611) 231-3262	False	2024-01-18
4	157503859921754048	Stefanie	Meyer	stefanie_meyer@posteo.de	Leipzig	Sachsen	04277	(0371) 874-7863	True	2025-08-29
5	2829213282471976079	Fabian	Thomas	f_thomas@web.de	Köln	Nordrhein-Westfalen	50708	(0203) 870-5463	True	2024-11-10
96	7027508096731144830	Maike	Fuchs	maike_fuchs@yahoo.de	Frankfurt am Main	Hessen	60188	(0611) 017-4531	True	2024-06-20
97	6055996548456657574	Sebastian	König	sebastian_koenig@t-online.de	Augsburg	Bayern	86110	(0841) 916-6870	True	2025-04-05
98	3822709996092632587	Diana	Henkel	dhenkel20@arcor.de	Frankfurt am Main	Hessen	60198	(06151) 561-5880	False	2024-02-04
99	1522653102058132294	Matthias	Keßler	matthias.kessler65@outlook.de	Zwickau	Sachsen	08826	(0351) 850-7507	False	2024-02-27
100	5690877051669226498	Dirk	Neumann	dneumann95@posteo.de	Nürnberg	Bayern	90954	(0941) 885-4226	False	2025-10-07

Country-Specific Data

One of the most powerful features is generating locale-aware data. Use the country= parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets.

Let’s create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures consistency across related fields. The city, address, postcode, and coordinates will all correspond to the same location:

# Schema with linked location fields
schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    address=pb.string_field(preset="address"),
    postcode=pb.string_field(preset="postcode"),
    latitude=pb.string_field(preset="latitude"),
    longitude=pb.string_field(preset="longitude"),
)

Here’s German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="DE"))

	name String	city String	address String	postcode String	latitude String	longitude String
PolarsRows200Columns6
1	Niklas Schulte	Potsdam	Jägertor 8211, Whg. 737, 14097 Potsdam	14448	52.428914	13.064566
2	Erik Becker	Halle (Saale)	Hansering 9509, Whg. 636, 06678 Halle (Saale)	06101	51.484572	11.937119
3	Marco Albrecht	Frankfurt am Main	Friedensstraße 9713, Whg. 474, 60597 Frankfurt am Main	60674	50.212245	8.711472
4	Juliane Münz	Leipzig	Lindenauer Markt 6249, Whg. 489, 04541 Leipzig	04992	51.276862	12.458890
5	Anton Baumann	Köln	Aachener Straße 7203, 50125 Köln	50589	50.967264	6.795838
196	Franziska Wendt	Ulm	Marktplatz 6251, Whg. 535, 89984 Ulm	89226	48.395296	10.001962
197	Lennart Berger	München	Brienner Straße 1390, Whg. 389, 80255 München	80835	48.206882	11.674262
198	Julia Knecht	Ludwigshafen am Rhein	Friedrichstraße 3204, 67944 Ludwigshafen am Rhein	67305	49.473668	8.437782
199	Sebastian Thiel	Gelsenkirchen	Husemannstraße 453, Whg. 273, 45732 Gelsenkirchen	45992	51.568689	7.082531
200	Trude Kaiser	Kassel	Königstraße 1394, 34406 Kassel	34736	51.326544	9.494319

Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan’s geographic boundaries:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="JP"))

	name String	city String	address String	postcode String	latitude String	longitude String
PolarsRows200Columns6
1	Kenji Ozawa	Kasuga	816-3132 Fukuoka Kasuga Shirane 2869-411	816-5387	33.532717	130.479907
2	Yuki Sugimoto	Higashihiroshima	739-7478 Hiroshima Higashihiroshima Kouchi-dori 5279-446	739-6297	34.402366	132.767214
3	Osamu Takahashi	Fukuoka	810-8705 Fukuoka Fukuoka Jonan-dori 4932-302	810-3425	33.676810	130.414218
4	Yuki Sakai	Saitama	330-3255 Saitama Saitama Minuma-dori 3157	330-4163	35.927124	139.649673
5	Haruka Yamazaki	Kitakyushu	802-7756 Fukuoka Kitakyushu Yamada-dori 281	802-7878	33.908803	130.911548
196	Mayumi Takahashi	Fukuoka	810-7017 Fukuoka Fukuoka Imajuku-dori 9700	810-1106	33.558128	130.415678
197	Katsuya Kawamura	Hiroshima	730-1406 Hiroshima Hiroshima Ujina-dori 281	730-6070	34.398047	132.440192
198	Kazue Nakata	Sendai	980-1935 Miyagi Sendai Kawauchi 1182	980-1544	38.255431	140.876065
199	Nozomi Shiraishi	Matsudo	271-5711 Chiba Matsudo Akiyama-dori 6082-279	271-5161	35.818062	139.882260
200	Ayaka Asano	Kure	737-9841 Hiroshima Kure Tsukiji-dori 3763	737-7152	34.261593	132.556836

Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil’s CEP format:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="BR"))

	name String	city String	address String	postcode String	latitude String	longitude String
PolarsRows200Columns6
1	Fábio Santos	Campinas	Avenida das Amoreiras, 1624, Apto 677, 13239-778 Campinas - SP	13441-282	-22.925008	-47.075037
2	João Vieira	Porto Alegre	Rua Vasco da Gama, 2869, 90313-262 Porto Alegre - RS	90439-595	-30.006369	-51.232939
3	Letícia Costa	Rio de Janeiro	Avenida Maracanã, 4893, Apto 709, 20863-487 Rio de Janeiro - RJ	20351-144	-22.881665	-43.185310
4	Regina Ferreira	Belo Horizonte	Rua Guajajaras, 5690, 30633-255 Belo Horizonte - MG	30803-422	-19.962432	-43.907671
5	Francisco Rodrigues	Belo Horizonte	Rua Turquesa, 281, Apto 515, 30775-668 Belo Horizonte - MG	30822-683	-19.958575	-43.936621
196	Júlia Martins	São Paulo	Rua São Bento, 5410, 01605-295 São Paulo - SP	01655-287	-23.597987	-46.673967
197	Júlia Moreira	Recife	Rua do Giriquiti, 7340, 50146-087 Recife - PE	50893-805	-8.025160	-34.880996
198	Rodrigo Moura	Salvador	Rua Francisco Muniz Barreto, 5926, 40584-681 Salvador - BA	40442-463	-12.921934	-38.521806
199	Iraci Lima	São Paulo	Rua Celso Garcia, 3181, Apto 872, 01417-237 São Paulo - SP	01650-013	-23.506813	-46.649849
200	Bianca Medeiros	Curitiba	Rua Marechal Hermes, 3301, Apto 281, 80229-139 Curitiba - PR	80703-569	-25.477946	-49.278213

This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters.

Data Coherence

Pointblank automatically links related columns to produce realistic rows. There are three coherence systems that activate based on which presets appear together in a schema:

Address coherence activates when any address-related preset is present (address, city, state, postcode, latitude, longitude, phone_number, license_plate). All of these fields will refer to the same location within each row.

Person coherence activates when any person-related preset is present (name, name_full, first_name, last_name, email, user_name). The email and username are derived from the person’s name.

Business coherence activates when both job and company are present. When active:

the company and job title are drawn from the same industry (e.g., a nurse will work at a hospital, not a law firm).
name_full gains profession-matched titles: a doctor may appear as “Dr. Ana Sousa” and a professor as “Prof. Tanaka Yuki”. For German-speaking countries (DE, AT, CH), the honorific stacks before the professional title (e.g., “Herr Dr. med. Klaus Weber”).
integer columns whose name contains age (e.g., age, person_age) are automatically constrained to working-age range (22–65).

Here’s an example showing all three coherence systems working together:

schema = pb.Schema(
    name=pb.string_field(preset="name_full"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    job=pb.string_field(preset="job"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    license_plate=pb.string_field(preset="license_plate"),
    age=pb.int_field(),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23, country="DE"))

	name String	email String	company String	job String	city String	state String	license_plate String	age Int64
PolarsRows100Columns8
1	Frau Agathe Kramer	agathekramer@mail.de	Berliner Systeme Software	Produktmanager	Potsdam	Brandenburg	P-ZL 931	40
2	Herr Rüdiger Altmann	ruediger.altmann@mail.de	KPMG	Business Analyst	Halle (Saale)	Sachsen-Anhalt	HAL-D 6183	27
3	Herr Bodo Meyer	b_meyer@outlook.de	Internationale Strom Netzwerke	Systemadministrator	Frankfurt am Main	Hessen	F-DZ 0091	23
4	Frau Stefanie Meyer	stefanie.meyer@posteo.de	Frankfurter Sicherheit	Cloud-Architekt	Leipzig	Sachsen	L-X 1095	59
5	Herr Fabian Thomas	f_thomas@freenet.de	Scherer Immobilien	Immobilienmakler	Köln	Nordrhein-Westfalen	K-P 299	41
96	Frau Prof. Maike Fuchs	maike.fuchs37@arcor.de	Technische Universität Frankfurt am Main	Professor	Frankfurt am Main	Hessen	F-A 2325	24
97	Herr Sebastian König	sebastiankoenig@web.de	Premium Kreativ	Grafikdesigner	Augsburg	Bayern	A-MY 8326	25
98	Frau Diana Henkel	diana990@web.de	Sparkasse Frankfurt am Main	Finanzanalyst	Frankfurt am Main	Hessen	F-F 1377	62
99	Herr Matthias Keßler	matthiaskessler@yahoo.de	Schrader Gruppe	Grafikdesigner	Zwickau	Sachsen	Z-W 569	48
100	Herr Dirk Neumann	dirk_neumann@arcor.de	Deutsche Software Digital	Produktmanager	Nürnberg	Bayern	N-K 1620	47

License plate coherence is part of address coherence. For CA, US, DE, AU, and GB, license plates follow real subregion-specific formats when location fields are present. For example, an Ontario row produces plates like "CABC 123" while a British Columbia row produces "AB1 23C". Letters I, O, Q, and U are excluded from plate generation, matching real-world restrictions.

Supported Countries

Pointblank currently supports 100 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., "US") or alpha-3 codes (e.g., "USA").

Europe (38 countries):

Armenia (AM), Austria (AT), Azerbaijan (AZ), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czech Republic (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Georgia (GE), Germany (DE), Greece (GR), Hungary (HU), Iceland (IS), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Moldova (MD), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Russia (RU), Serbia (RS), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), Switzerland (CH), Ukraine (UA), United Kingdom (GB)

Americas (19 countries):

Argentina (AR), Bolivia (BO), Brazil (BR), Canada (CA), Chile (CL), Colombia (CO), Costa Rica (CR), Dominican Republic (DO), Ecuador (EC), El Salvador (SV), Guatemala (GT), Honduras (HN), Jamaica (JM), Mexico (MX), Panama (PA), Paraguay (PY), Peru (PE), United States (US), Uruguay (UY)

Asia-Pacific (22 countries):

Australia (AU), Bangladesh (BD), Cambodia (KH), China (CN), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), Kazakhstan (KZ), Malaysia (MY), Myanmar (MM), Nepal (NP), New Zealand (NZ), Pakistan (PK), Philippines (PH), Singapore (SG), South Korea (KR), Sri Lanka (LK), Taiwan (TW), Thailand (TH), Uzbekistan (UZ), Vietnam (VN)

Middle East & Africa (21 countries):

Algeria (DZ), Cameroon (CM), Egypt (EG), Ethiopia (ET), Ghana (GH), Israel (IL), Jordan (JO), Kenya (KE), Lebanon (LB), Morocco (MA), Mozambique (MZ), Nigeria (NG), Rwanda (RW), Saudi Arabia (SA), Senegal (SN), South Africa (ZA), Tanzania (TZ), Tunisia (TN), Turkey (TR), Uganda (UG), United Arab Emirates (AE)

Additional countries and expanded coverage are planned for future releases.

Mixing Multiple Countries

When you need test data that spans multiple locales (e.g., simulating an international customer base), you can pass a list or dict to the country= parameter instead of a single string.

Passing a list of country codes splits rows equally across those countries. Here, 200 rows are divided evenly among the US, Germany, and Japan (~67 each):

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    postcode=pb.string_field(preset="postcode"),
)

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country=["US", "DE", "JP"]))

	name String	city String	postcode String
PolarsRows200Columns3
1	Katharina Köhler	Wuppertal	42929
2	Hiromi Matsumoto	Saitama	330-8862
3	Eric Simmons	Eugene	97476
4	Barbara Rodriguez	St. Petersburg	33704
5	Tabea Braun	Leipzig	04272
196	Yuko Oda	Okayama	700-3500
197	Ezekiel Lynch	Austin	78714
198	James Patterson	Nashville	37297
199	Claudia Weber	Berlin	10470
200	Yuki Fujita	Fukuoka	810-0280

To control the proportion of rows per country, pass a dict mapping country codes to weights. The following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:

pb.preview(
    pb.generate_dataset(
        schema, n=200, seed=23,
        country={"US": 0.7, "DE": 0.2, "FR": 0.1},
    )
)

	name String	city String	postcode String
PolarsRows200Columns3
1	Jessica Evans	El Paso	79948
2	Dirk Thomas	Bremen	28897
3	Janet Anderson	Eugene	97489
4	David Fleming	St. Petersburg	33746
5	Russell Mitchell	Philadelphia	19159
196	Kristin Franke	München	80714
197	Luna Wilson	Austin	78712
198	Virginia Thomas	Nashville	37244
199	Gregory Smith	New York	10088
200	Grégory Bourdillon	Vieux Lyon	69097

Weights are auto-normalized, so {"US": 7, "DE": 2, "FR": 1} is equivalent to the example above. Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly n.

By default, rows from different countries are interleaved randomly (shuffle=True). Set shuffle=False to keep rows grouped by country in the order the countries are listed:

pb.preview(
    pb.generate_dataset(
        schema, n=120, seed=23,
        country=["US", "DE", "JP"], shuffle=False,
    )
)

	name String	city String	postcode String
PolarsRows120Columns3
1	Matthew Patterson	San Diego	92174
2	Karen Reyes	New York	10063
3	Patricia Johnson	Austin	78731
4	Cash Maldonado	New Orleans	70152
5	Nova Harris	Austin	78725
116	Rika Nakano	Kawagoe	350-0811
117	Yuichi Suzuki	Anjo	446-3372
118	Keiko Yasuda	Naha	900-2535
119	Michiko Yoshida	Ishinomaki	986-2104
120	Nobuko Matsuda	Machida	194-2863

All coherence systems (address, person, business) work correctly within each country’s batch of rows. A French row will have a French name with a matching French email; a Japanese row will have a Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates) are generated independently for each batch but still respect their field constraints.

Frequency-Weighted Sampling

By default, names and cities are sampled uniformly at random from the locale data, giving every entry the same probability of being selected. Real-world distributions are far from uniform though: “James” and “Maria” appear orders of magnitude more often than “Thaddeus” or “Xiomara”, and more people live in New York City than in Flagstaff. The weighted=True parameter makes generated data reflect this natural skew.

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="US", weighted=True))

	name String	city String
PolarsRows200Columns2
1	Adam Knight	Lubbock
2	Noah Gibson	Anaheim
3	Jackson Morales	Phoenix
4	Mark Wilson	Denver
5	Maisie Perkins	San Antonio
196	Daniel Woods	Philadelphia
197	Christina Anderson	Los Angeles
198	Thea Woods	Joliet
199	Anthony Campbell	San Diego
200	Daniel Wagner	Chicago

With weighting enabled you will see popular names like James, John, Mary, and Patricia appear more frequently, while unusual names surface only occasionally. Similarly, cities like New York, Los Angeles, and Chicago dominate the output while smaller cities appear less often.

The feature works by organizing locale data into four frequency tiers. Each tier has a sampling probability that determines how likely its members are to be selected:

Tier	Probability	Contents
very_common	45%	The top ~10% of entries by real-world frequency
common	30%	The next ~20% of entries
uncommon	20%	The next ~30% of entries
rare	5%	The remaining ~40% of entries

When a value is needed, a tier is first chosen according to these probabilities and then a single entry is picked uniformly at random within that tier. This two-step approach keeps sampling fast while producing a realistic long-tail distribution. Setting weighted=False pools all entries across every tier and samples them uniformly, which can be useful when you want an even spread rather than a realistic distribution.

Weighted sampling combines seamlessly with multi-country mixing. Each country’s batch uses its own tiered data independently, so a mixed dataset will have weighted US names alongside weighted German names:

pb.preview(
    pb.generate_dataset(
        schema,
        n=200,
        seed=23,
        country={"US": 0.6, "DE": 0.4},
        weighted=True,
    )
)

	name String	city String
PolarsRows200Columns2
1	Susan Mcintyre	El Paso
2	Uwe Huber	Köln
3	Jennifer Griffin	Eugene
4	John Cook	St. Petersburg
5	Laura Myers	Philadelphia
196	Nicole Scheck	Stuttgart
197	Robert Butler	Austin
198	Steven Kennedy	Nashville
199	Nancy Thomas	New York
200	Marie Wegner	Regensburg

All 100 supported country locales have tiered name and location data, so weighted=True produces realistic frequency distributions for every country.

Output Formats

The generate_dataset() function supports multiple output formats via the output= parameter, making it easy to integrate with your preferred data processing library.

schema = pb.Schema(
    id=pb.int_field(min_val=1),
    name=pb.string_field(preset="name"),
)

The default output is a Polars DataFrame, which offers excellent performance and a modern API for data manipulation:

polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars")

pb.preview(polars_df)

	id Int64	name String
PolarsRows100Columns2
1	7188536481533917197	Doris Martin
2	2674009078779859984	Nancy Gonzalez
3	7652102777077138151	Jessica Turner
4	157503859921753049	George Evans
5	2829213282471975080	Patricia Williams
96	7027508096731143831	Isaiah Murphy
97	6055996548456656575	Brittany Rodriguez
98	3822709996092631588	Megan Stevens
99	1522653102058131295	Pamela Jenkins
100	5690877051669225499	Stephanie Santos

If your workflow uses Pandas, simply specify output="pandas" to get a Pandas DataFrame:

pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas")

pb.preview(pandas_df)

	id int64	name str
PandasRows100Columns2
1	7188536481533917197	Doris Martin
2	2674009078779859984	Nancy Gonzalez
3	7652102777077138151	Jessica Turner
4	157503859921753049	George Evans
5	2829213282471975080	Patricia Williams
96	7027508096731143831	Isaiah Murphy
97	6055996548456656575	Brittany Rodriguez
98	3822709996092631588	Megan Stevens
99	1522653102058131295	Pamela Jenkins
100	5690877051669225499	Stephanie Santos

Both formats work seamlessly with Pointblank’s validation functions, so you can choose whichever fits best with your existing data pipeline.

Using Generated Data for Validation Testing

A common use case is generating test data to validate your validation rules:

# Define a schema with constraints
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=100),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate test data
test_data = pb.generate_dataset(schema, n=100, seed=23)

# Validate the generated data (it should pass all checks)
validation = (
    pb.Validate(test_data)
    .col_vals_gt("user_id", 0)
    .col_vals_regex("email", r".+@.+\..+")
    .col_vals_between("age", 18, 100)
    .col_vals_in_set("status", ["active", "pending", "inactive"])
    .interrogate()
)

validation

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
Pointblank Validation
2026-02-26\|14:33:03 Polars
#4CA64C	1	col_vals_gt()	user_id	0	✓	100	100 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_regex()	email	.+@.+\..+	✓	100	100 1.00	0 0.00	—	—	—	—
#4CA64C	3	col_vals_between()	age	[18, 100]	✓	100	100 1.00	0 0.00	—	—	—	—
#4CA64C	4	col_vals_in_set()	status	active, pending, inactive	✓	100	100 1.00	0 0.00	—	—	—	—

Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline.

Pytest Fixture

When Pointblank is installed, a generate_dataset pytest fixture is automatically available in all your test files. There is no need to import anything or add configuration to conftest.py: the fixture is registered via pytest’s plugin system.

The fixture works identically to pb.generate_dataset(), but with one key difference: when you don’t supply a seed= parameter, a deterministic seed is automatically derived from the test’s fully-qualified name. This means:

the same test always produces the same data: no manual seed management required.
different tests get different seeds, so they exercise different datasets.
you can still pass an explicit seed= to override the automatic seed when needed.

Basic Usage

Use it by adding generate_dataset to your test function’s parameter list:

test_pipeline.py

import pointblank as pb

def test_etl_handles_nulls(generate_dataset):
    schema = pb.Schema(
        user_id=pb.int_field(unique=True),
        email=pb.string_field(preset="email", nullable=True, null_probability=0.3),
        age=pb.int_field(min_val=0, max_val=120),
    )

    df = generate_dataset(schema, n=500)
    result = my_etl_pipeline(df)
    assert result.filter(pl.col("email").is_null()).shape[0] == 0

All parameters from generate_dataset() are supported: n=, seed=, output=, and country=:

def test_german_data(generate_dataset):
    schema = pb.Schema(
        name=pb.string_field(preset="name"),
        city=pb.string_field(preset="city"),
    )

    df = generate_dataset(schema, n=200, country="DE", output="pandas")
    assert len(df) == 200

Multiple Datasets in One Test

Calling the fixture multiple times within the same test produces different (but still deterministic) data on each call:

def test_merge_pipeline(generate_dataset):
    customers = generate_dataset(customer_schema, n=1000, country="US")
    orders = generate_dataset(order_schema, n=5000)

    # Each call gets a unique seed derived from the test name + call index,
    # so both DataFrames are deterministic and different from each other.
    result = merge_pipeline(customers, orders)
    assert result.shape[0] > 0

Testing Across Locales

The fixture makes locale testing particularly concise when combined with pytest.mark.parametrize:

import pytest
import pointblank as pb

@pytest.mark.parametrize("country", ["US", "DE", "JP", "BR"])
def test_name_normalizer(generate_dataset, country):
    schema = pb.Schema(name=pb.string_field(preset="name_full"))
    df = generate_dataset(schema, n=100, country=country)
    result = normalize_names(df)
    assert result["name"].str.len_chars().min() > 0

Sharing Schemas Across Tests

Define schemas as fixtures in conftest.py and compose them with generate_dataset:

conftest.py

import pytest
import pointblank as pb

@pytest.fixture
def customer_schema():
    return pb.Schema(
        id=pb.int_field(unique=True),
        name=pb.string_field(preset="name"),
        email=pb.string_field(preset="email"),
        city=pb.string_field(preset="city"),
    )

test_validation.py

def test_customer_validation(generate_dataset, customer_schema):
    df = generate_dataset(customer_schema, n=200, country="DE")
    validation = pb.Validate(df).col_vals_not_null(columns="email").interrogate()
    assert validation.all_passed()

test_export.py

def test_customer_export(generate_dataset, customer_schema):
    df = generate_dataset(customer_schema, n=50, country="JP")
    exported = export_to_parquet(df)
    assert exported.exists()

Debugging with Seed Introspection

The fixture callable exposes two attributes that make debugging failed tests straightforward:

generate_dataset.default_seed: the base seed derived from the test name (available before any call)
generate_dataset.last_seed: the seed actually used for the most recent call (accounts for the call counter and explicit overrides)

Include .last_seed in assertion messages so failures are immediately reproducible:

def test_age_range(generate_dataset):
    schema = pb.Schema(age=pb.int_field(min_val=18, max_val=100))
    df = generate_dataset(schema, n=500)
    min_age = df["age"].min()
    assert min_age >= 18, (
        f"Expected min age >= 18, got {min_age} (seed={generate_dataset.last_seed})"
    )

You can also use .default_seed to reproduce the exact dataset outside of pytest:

# In a REPL or notebook, reproduce the data from a failed test:
import pointblank as pb
df = pb.generate_dataset(schema, n=500, seed=<default_seed_from_output>)

Seed Stability

A given seed (whether explicit or auto-derived) is guaranteed to produce identical output within the same Pointblank version. Across versions, changes to country data files or generator logic may alter the output for a given seed.

For CI pipelines that require bit-exact data across library upgrades, we recommend saving generated DataFrames as Parquet or CSV snapshot files rather than relying on cross-version seed stability. This is the same approach used by snapshot-testing tools like pytest-snapshot and syrupy.

Conclusion

Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can:

quickly prototype validation rules before working with production data
create reproducible test fixtures for automated testing and CI/CD pipelines
generate locale-specific data for internationalization testing across 100 countries
ensure coherent relationships between related fields like names, emails, addresses, jobs, and license plates
produce datasets of any size with consistent, realistic values

Whether you’re building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality.