Test Data Generation

Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development.

Note

Throughout this guide, we use pb.preview() to display generated datasets with nice HTML formatting. This is optional: pb.generate_dataset() returns a standard DataFrame that you can display or manipulate however you prefer.

Quick Start

Generate test data using a schema with field constraints:

import pointblank as pb

# Define a schema with typed field specifications
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=80),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate 100 rows of test data (seed ensures reproducibility)
pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns5
user_id
Int64
name
String
email
String
age
Int64
status
String
1 7188536481533917197 Vivienne Rios vivienne.rios@gmail.com 77 pending
2 2674009078779859984 William Schaefer williamschaefer@aol.com 67 active
3 7652102777077138151 Lily Hansen lilyhansen@hotmail.com 78 active
4 157503859921753049 Shirley Mays shirley.mays27@aol.com 36 inactive
5 2829213282471975080 Sean Dawson sean.dawson29@aol.com 75 pending
96 7027508096731143831 Kathryn Green kathryn.green@hotmail.com 55 active
97 6055996548456656575 Daniel Morris dmorris@yahoo.com 39 inactive
98 3822709996092631588 William Cooper williamcooper@protonmail.com 24 inactive
99 1522653102058131295 Lane Sawyer l_sawyer@zoho.com 41 active
100 5690877051669225499 Paisley Sandoval paisley_sandoval@gmail.com 75 pending

Field Types

Pointblank provides helper functions for defining typed columns with constraints:

Function Description Key Parameters
int_field() Integer columns min_val, max_val, allowed, unique
float_field() Float columns min_val, max_val, allowed
string_field() String columns preset, pattern, allowed, unique
bool_field() Boolean columns p_true (probability of True)
date_field() Date columns min_val, max_val
datetime_field() Datetime columns min_val, max_val
time_field() Time columns min_val, max_val
duration_field() Duration columns min_val, max_val

Integer Fields

Integer fields support range constraints with min_val and max_val, discrete allowed values with allowed, and uniqueness enforcement with unique=True:

schema = pb.Schema(
    id=pb.int_field(min_val=1000, max_val=9999, unique=True),
    quantity=pb.int_field(min_val=1, max_val=100),
    rating=pb.int_field(allowed=[1, 2, 3, 4, 5]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
id
Int64
quantity
Int64
rating
Int64
1 5749 100 3
2 2368 38 1
3 1279 11 1
4 6025 3 5
5 7942 76 3
96 5330 64 2
97 8634 31 1
98 9982 43 2
99 4221 70 1
100 8520 19 5

The unique=True constraint ensures no duplicate values appear in that column, which is useful for generating primary keys or identifiers.

Float Fields

Float fields work similarly to integers, with min_val and max_val defining the range of generated values:

schema = pb.Schema(
    price=pb.float_field(min_val=0.0, max_val=1000.0),
    discount=pb.float_field(min_val=0.0, max_val=0.5),
    temperature=pb.float_field(min_val=-40.0, max_val=50.0),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
price
Float64
discount
Float64
temperature
Float64
1 924.8652516259452 0.4624326258129726 43.23787264633508
2 948.6057779931772 0.47430288899658857 45.37452001938594
3 892.4333440485793 0.44621667202428966 40.31900096437214
4 83.55067683068363 0.04177533841534181 -32.48043908523847
5 592.0272268857353 0.29601361344286764 13.282450419716177
96 444.6925279641446 0.2223462639820723 0.022327516773010814
97 342.7762214585577 0.17138811072927884 -9.150140068729808
98 892.3288689140903 0.4461644344570452 40.309598202268134
99 813.7559456012128 0.4068779728006064 33.238035104109144
100 895.1816604808429 0.44759083024042146 40.56634944327587

Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data.

String Fields with Presets

Presets generate realistic data like names, emails, and addresses. When you include related fields like name and email in the same schema, Pointblank ensures coherence (e.g., the email address will be derived from the person’s name), making the generated data more realistic:

schema = pb.Schema(
    full_name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns4
full_name
String
email
String
company
String
city
String
1 Kingston Miller k_miller@zoho.com Innovative Systems Enterprises Hollywood
2 Kaden Mosley kaden.mosley9@protonmail.com Prime Investments Santa Ana
3 Brooks Wilkerson brooks703@yahoo.com Creative Commerce Inc Rochester
4 Juliana Mitchell jmitchell@zoho.com Diaz LLC Bloomington
5 Barbara Walters barbara662@icloud.com Warner Bros. Toledo
96 Cheryl Robinson cheryl.robinson@zoho.com Wood International Henderson
97 Elijah Cunningham ecunningham22@hotmail.com First Analytics Aurora
98 Magnolia Mosley magnolia_mosley@aol.com Disney Vancouver
99 Stella Gray stella_gray@mail.com Berry Associates Syracuse
100 Harrison Allen harrison.allen25@outlook.com Hughes Solutions Plano

This coherence extends to other related fields like user_name, which will also reflect the person’s name when included alongside name and email fields.

String Fields with Patterns

Use regex patterns to generate strings matching specific formats:

schema = pb.Schema(
    product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"),
    phone=pb.string_field(pattern=r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"),
    license_plate=pb.string_field(pattern=r"[A-Z]{2}[0-9]{2} [A-Z]{3}"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
product_code
String
phone
String
license_plate
String
1 CAS-6685 (109) 668-2347 CA66 EXG
2 XGI-0397 (397) 117-0865 OA97 DCW
3 DCW-6086 (309) 293-9594 NA50 ZBS
4 YBG-9529 (917) 797-2285 VF59 SJV
5 XLS-9459 (911) 609-9495 SC27 VUF
96 THG-2900 (993) 511-5415 WN20 ICH
97 CHC-3681 (065) 802-0822 EU51 PID
98 HKT-3552 (927) 701-4276 HZ77 GVO
99 OEW-4157 (365) 419-1062 PC36 VCL
100 FSX-8948 (897) 459-3038 YW40 TXG

Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification.

Boolean Fields

Control the probability of True values:

schema = pb.Schema(
    is_active=pb.bool_field(p_true=0.8),      # 80% True
    is_premium=pb.bool_field(p_true=0.2),     # 20% True
    is_verified=pb.bool_field(),              # 50% True (default)
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
is_active
Boolean
is_premium
Boolean
is_verified
Boolean
1 False False False
2 False False False
3 False False False
4 True True True
5 True False False
96 True False True
97 True False True
98 False False False
99 False False False
100 False False False

This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others.

Date and Datetime Fields

Temporal fields accept Python date and datetime objects for their range boundaries, generating values uniformly distributed within the specified period:

from datetime import date, datetime

schema = pb.Schema(
    birth_date=pb.date_field(
        min_date=date(1960, 1, 1),
        max_date=date(2005, 12, 31)
    ),
    created_at=pb.datetime_field(
        min_date=datetime(2024, 1, 1),
        max_date=datetime(2024, 12, 31)
    ),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns2
birth_date
Date
created_at
Datetime
1 1986-01-03 2024-12-25 04:22:08
2 1967-06-30 2024-10-29 16:22:23
3 1961-07-13 2024-04-22 14:13:08
4 1987-07-09 2024-12-12 14:04:53
5 1998-01-06 2024-11-18 04:49:47
96 1969-04-14 2024-07-29 13:15:44
97 1975-03-23 2024-04-28 08:49:29
98 1981-05-29 2024-12-13 09:42:37
99 1982-09-14 2024-10-28 23:35:39
100 1968-12-21 2024-06-25 14:22:27

The same pattern applies to time_field() and duration_field(), allowing you to generate realistic temporal data for any use case.

Available Presets

The preset= parameter in string_field() supports many data types:

Personal Data:

  • name: full name (first + last)
  • name_full: full name with potential prefix/suffix
  • first_name: first name only
  • last_name: last name only
  • email: email address

Location Data:

  • address: full street address
  • city: city name
  • state: state/province name
  • country: country name
  • postcode: postal/ZIP code
  • latitude: latitude coordinate
  • longitude: longitude coordinate

Business Data:

  • company: company name
  • job: job title
  • catch_phrase: business catch phrase

Internet Data:

  • url: website URL
  • domain_name: domain name
  • ipv4: IPv4 address
  • ipv6: IPv6 address
  • user_name: username
  • password: password

Financial Data:

  • credit_card_number: credit card number
  • iban: International Bank Account Number
  • currency_code: currency code (USD, EUR, etc.)

Identifiers:

  • uuid4: UUID version 4
  • ssn: Social Security Number (US format)
  • license_plate: vehicle license plate

Text:

  • word: single word
  • sentence: full sentence
  • paragraph: paragraph of text
  • text: multiple paragraphs

Miscellaneous:

  • color_name: color name
  • file_name: file name
  • file_extension: file extension
  • mime_type: MIME type

Country-Specific Data

One of the most powerful features is generating locale-aware data. Use the country= parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets.

Let’s create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures consistency across related fields. The city, address, postcode, and coordinates will all correspond to the same location:

# Schema with linked location fields
schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    address=pb.string_field(preset="address"),
    postcode=pb.string_field(preset="postcode"),
    latitude=pb.string_field(preset="latitude"),
    longitude=pb.string_field(preset="longitude"),
)

Here’s German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="DE"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Ines Flohr Sachsenhausen Paradiesgasse 8446, 60559 Sachsenhausen 60546 50.103839 8.696611
2 Joachim Pohlmann St. Pauli Talstraße 3672, 20302 St. Pauli 20371 53.547930 9.963469
3 Elfriede Sander Kreuzberg Mariannenstraße 990, 10911 Kreuzberg 10976 52.498599 13.407795
4 Wilhelm Opitz Dessau-Roßlau Bauhausstraße 1418, 06784 Dessau-Roßlau 06690 51.800731 12.186808
5 Ursula Westphal Wiesbaden Langgasse 8328, Wohnung 8, 65082 Wiesbaden 65705 50.099696 8.274693
196 Hildegard Reinhardt Berlin Rosa-Luxemburg-Straße 1748, Wohnung 435, 10693 Berlin 10914 52.409726 13.555846
197 Arnold Münz Stuttgart Lautenschlagerstraße 9997, Wohnung 810, 70523 Stuttgart 70136 48.727147 9.086857
198 Dominik Bachmann Ulm Neue Straße 106, 89350 Ulm 89332 48.420703 9.974251
199 Alexander Busch Prenzlauer Berg Fehrbelliner Straße 9073, Wohnung 623, 10481 Prenzlauer Berg 10480 52.558487 13.418282
200 Bianca Bollmann Augsburg Hochfeldstraße 8381, Wohnung 18, 86343 Augsburg 86950 48.409668 10.856927

Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan’s geographic boundaries:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="JP"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Tadayuki Hara Fukuyama 720-8233 Hiroshima Fukuyama Higashisakura-cho 9502-435 720-1815 34.494863 133.394347
2 Takafumi Kato Kamakura 248-7907 Kanagawa Kamakura Yuigahama 8877-274 248-1448 35.322437 139.537618
3 Gota Kashiwagi Ikebukuro 171-8305 Tokyo Ikebukuro Ikebukuro 6360-827 171-5788 35.726229 139.725434
4 Shinya Fujishima Mihara 723-0842 Hiroshima Mihara Minato-cho 8740-210 723-7551 34.419333 133.075846
5 Nodoka Kuwata Fuji 416-8257 Shizuoka Fuji Yoshiwara 2536 416-7451 35.185194 138.686510
196 Manami Inagawa Nagaoka 940-6170 Niigata Nagaoka Omachi 5565 940-3502 37.435046 138.841454
197 Takuya Komori Chiba 260-6887 Chiba Chiba Chiba-eki 553 260-3254 35.604613 140.178219
198 Hayato Sakamoto Kure 737-3564 Hiroshima Kure Chuo 1183-587 737-7583 34.262447 132.580660
199 Mitsuko Tateno Chigasaki 253-6710 Kanagawa Chigasaki Minami-koide 5131 253-2692 35.324888 139.419479
200 Toshiko Tominaga Hakodate 040-8228 Hokkaido Hakodate Omoricho 2970-334 040-5213 41.778725 140.774551

Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil’s CEP format:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="BR"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Alice Batista Porto Velho Rua Getúlio Vargas, 490, Apto 834, 76897-338 Porto Velho - RO 76890-533 -8.731745 -63.916302
2 Rodrigo Alves Aracaju Rua Capela, 8878, Apto 348, 49117-586 Aracaju - SE 49819-954 -10.910831 -37.067142
3 Luiz Dias Porto Alegre Avenida Farrapos, 3749, 90245-305 Porto Alegre - RS 90445-930 -30.063047 -51.199116
4 Letícia Costa Ribeirão Preto Rua Duque de Caxias, 4782, Apto 986, 14331-290 Ribeirão Preto - SP 14395-739 -21.165961 -47.782629
5 Roberto Araújo Londrina Rua Sergipe, 6118, 86318-136 Londrina - PR 86357-713 -23.333412 -51.108257
196 Matheus Cardoso Curitiba Rua XV de Novembro, 1400, Apto 798, 80327-342 Curitiba - PR 80438-776 -25.402548 -49.271910
197 André Santos Duque de Caxias Avenida Brigadeiro Lima e Silva, 6485, Apto 775, 25037-569 Duque de Caxias - RJ 25659-226 -22.767603 -43.325565
198 Rafael Teixeira Porto Alegre Avenida Farrapos, 3353, Apto 354, 90184-367 Porto Alegre - RS 90007-201 -30.058029 -51.156808
199 Jéssica Correia Campinas Rua 13 de Maio, 3499, 13778-766 Campinas - SP 13205-817 -22.882406 -47.020113
200 Ricardo Gomes Campinas Avenida Norte-Sul, 5677, Apto 862, 13272-067 Campinas - SP 13121-589 -22.942932 -47.086221

This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters.

Supported Countries

Pointblank currently supports 50 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., "US") or alpha-3 codes (e.g., "USA").

Europe (32 countries):

  • Austria (AT), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czech Republic (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Germany (DE), Greece (GR), Hungary (HU), Iceland (IS), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Russia (RU), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), Switzerland (CH), United Kingdom (GB)

Americas (7 countries):

  • Argentina (AR), Brazil (BR), Canada (CA), Chile (CL), Colombia (CO), Mexico (MX), United States (US)

Asia-Pacific (10 countries):

  • Australia (AU), China (CN), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), New Zealand (NZ), Philippines (PH), South Korea (KR), Taiwan (TW)

Middle East (1 country):

  • Turkey (TR)

Additional countries and expanded coverage are planned for future releases.

Output Formats

The generate_dataset() function supports multiple output formats via the output= parameter, making it easy to integrate with your preferred data processing library.

schema = pb.Schema(
    id=pb.int_field(min_val=1),
    name=pb.string_field(preset="name"),
)

The default output is a Polars DataFrame, which offers excellent performance and a modern API for data manipulation:

# Polars DataFrame (default)
polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars")
pb.preview(polars_df)
PolarsRows100Columns2
id
Int64
name
String
1 7188536481533917197 Vivienne Rios
2 2674009078779859984 William Schaefer
3 7652102777077138151 Lily Hansen
4 157503859921753049 Shirley Mays
5 2829213282471975080 Sean Dawson
96 7027508096731143831 Kathryn Green
97 6055996548456656575 Daniel Morris
98 3822709996092631588 William Cooper
99 1522653102058131295 Lane Sawyer
100 5690877051669225499 Paisley Sandoval

If your workflow uses Pandas, simply specify output="pandas" to get a Pandas DataFrame:

# Pandas DataFrame
pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas")
pb.preview(pandas_df)
PandasRows100Columns2
id
int64
name
str
1 7188536481533917197 Vivienne Rios
2 2674009078779859984 William Schaefer
3 7652102777077138151 Lily Hansen
4 157503859921753049 Shirley Mays
5 2829213282471975080 Sean Dawson
96 7027508096731143831 Kathryn Green
97 6055996548456656575 Daniel Morris
98 3822709996092631588 William Cooper
99 1522653102058131295 Lane Sawyer
100 5690877051669225499 Paisley Sandoval

Both formats work seamlessly with Pointblank’s validation functions, so you can choose whichever fits best with your existing data pipeline.

Using Generated Data for Validation Testing

A common use case is generating test data to validate your validation rules:

# Define a schema with constraints
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=100),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate test data
test_data = pb.generate_dataset(schema, n=100, seed=23)

# Validate the generated data (it should pass all checks)
validation = (
    pb.Validate(test_data)
    .col_vals_gt("user_id", 0)
    .col_vals_regex("email", r".+@.+\..+")
    .col_vals_between("age", 18, 100)
    .col_vals_in_set("status", ["active", "pending", "inactive"])
    .interrogate()
)

validation
Pointblank Validation
2026-02-09|16:00:05
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
user_id 0 100 100
1.00
0
0.00
#4CA64C 2
col_vals_regex
col_vals_regex()
email .+@.+\..+ 100 100
1.00
0
0.00
#4CA64C 3
col_vals_between
col_vals_between()
age [18, 100] 100 100
1.00
0
0.00
#4CA64C 4
col_vals_in_set
col_vals_in_set()
status active, pending, inactive 100 100
1.00
0
0.00

Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline.

Conclusion

Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can:

  • quickly prototype validation rules before working with production data
  • create reproducible test fixtures for automated testing and CI/CD pipelines
  • generate locale-specific data for internationalization testing across many countries
  • ensure coherent relationships between related fields like names, emails, and addresses
  • produce datasets of any size with consistent, realistic values

Whether you’re building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality.