generate_dataset()function

Generate synthetic test data from a schema.

USAGE

generate_dataset(schema, n=100, seed=None, output='polars', country='US')

This function generates random data that conforms to a schema’s column definitions. When the schema is defined using Field objects with constraints (e.g., min_val, max_val, pattern, preset), the generated data will respect those constraints.

This is a convenience function that wraps Schema.generate() for a more functional style of usage, similar to how load_dataset() loads built-in datasets.

Parameters

schema : Schema

The schema object defining the structure and constraints of the data to generate.

n : int = 100

Number of rows to generate. Default is 100.

seed : int | None = None

Random seed for reproducibility. If provided, the same seed will produce the same data. Default is None (non-deterministic).

output : Literal['polars', 'pandas', 'dict'] = 'polars'

Output format for the generated data. Options are: (1) "polars" (default) returns a Polars DataFrame, (2) "pandas" returns a Pandas DataFrame, and (3) "dict" returns a dictionary of lists.

country : str = 'US'

Country code for realistic data generation when using presets (e.g., preset="email", preset="address"). Accepts ISO 3166-1 alpha-2 codes (e.g., "US", "DE", "FR") or alpha-3 codes (e.g., "USA", "DEU", "FRA"). Default is "US".

Returns

DataFrame or dict

Generated data in the requested format.

Raises

: ValueError

If the schema has no columns or if constraints cannot be satisfied.

: ImportError

If required optional dependencies are not installed.

Supported Countries

The country= parameter controls the country used for generating realistic data with presets (e.g., preset="email", preset="address"). This affects location-specific formats like addresses, phone numbers, and postal codes. Currently, 50 countries are supported with full locale data:

Europe (32 countries): Austria ("AT"), Belgium ("BE"), Bulgaria ("BG"), Croatia ("HR"), Cyprus ("CY"), Czech Republic ("CZ"), Denmark ("DK"), Estonia ("EE"), Finland ("FI"), France ("FR"), Germany ("DE"), Greece ("GR"), Hungary ("HU"), Iceland ("IS"), Ireland ("IE"), Italy ("IT"), Latvia ("LV"), Lithuania ("LT"), Luxembourg ("LU"), Malta ("MT"), Netherlands ("NL"), Norway ("NO"), Poland ("PL"), Portugal ("PT"), Romania ("RO"), Russia ("RU"), Slovakia ("SK"), Slovenia ("SI"), Spain ("ES"), Sweden ("SE"), Switzerland ("CH"), United Kingdom ("GB")

Americas (7 countries): Argentina ("AR"), Brazil ("BR"), Canada ("CA"), Chile ("CL"), Colombia ("CO"), Mexico ("MX"), United States ("US")

Asia-Pacific (10 countries): Australia ("AU"), China ("CN"), Hong Kong ("HK"), India ("IN"), Indonesia ("ID"), Japan ("JP"), New Zealand ("NZ"), Philippines ("PH"), South Korea ("KR"), Taiwan ("TW")

Middle East (1 country): Turkey ("TR")

Examples


Generate test data from a schema with field constraints:

import pointblank as pb

schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=100),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate 100 rows of test data
pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns4
user_id
Int64
email
String
age
Int64
status
String
1 7188536481533917197 vivienne.rios@gmail.com 55 pending
2 2674009078779859984 williamschaefer@aol.com 28 active
3 7652102777077138151 lilyhansen@hotmail.com 20 active
4 157503859921753049 shirley.mays27@aol.com 93 inactive
5 2829213282471975080 sean.dawson29@aol.com 57 pending
96 7027508096731143831 kathryn.green@hotmail.com 68 active
97 6055996548456656575 dmorris@yahoo.com 20 inactive
98 3822709996092631588 williamcooper@protonmail.com 38 inactive
99 1522653102058131295 l_sawyer@zoho.com 46 active
100 5690877051669225499 paisley_sandoval@gmail.com 19 pending

Generate data from a simple dtype-only schema as a Pandas DataFrame:

schema = pb.Schema(name="String", age="Int64", active="Boolean")
pb.preview(pb.generate_dataset(schema, n=50, seed=23, output="pandas"))
PandasRows50Columns3
name
str
age
int64
active
bool
1 51fbLtByHw -1406612057389349638 False
2 UmrCa -2617964757147985650 False
3 ND5bgfTF -5681649629593590626 False
4 bGOUBwXdnYcLxQ -8963716282372353309 True
5 NnVxKW -7269866261640175410 False
46 8VQTQ3rUkjMe 6777163490966252062 True
47 ZGDIWh7eBERjPZthNbW 4534912642422597042 False
48 MnIPm2wYtrTsBF6I8 -7714433421897454051 False
49 sv9VboYQKY5JjeSX8i -4108772566563722234 True
50 S6tq -7629746523602015996 True

Generate data with German addresses by using country="DE":

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    address=pb.string_field(preset="address"),
    city=pb.string_field(preset="city"),
)
pb.preview(pb.generate_dataset(schema, n=20, seed=23, country="DE"))
PolarsRows20Columns3
name
String
address
String
city
String
1 Ottokar Wittmann Brückenstraße 8995, Wohnung 110, 60542 Sachsenhausen Sachsenhausen
2 Annette Feldmann Hein-Hoyer-Straße 8078, 20358 St. Pauli St. Pauli
3 Martina Eisenberg Falckensteinstraße 6276, 10970 Kreuzberg Kreuzberg
4 Klaus Fabian Kavalierstraße 6446, Wohnung 998, 06230 Dessau-Roßlau Dessau-Roßlau
5 Ludwig Fröhlich Biebricher Allee 932, 65715 Wiesbaden Wiesbaden
16 Franz Eberhardt Königsallee 2838, 44616 Bochum Bochum
17 Kilian Heinze Schwanseestraße 7868, Wohnung 539, 99882 Weimar Weimar
18 Margit Anders Braunschweiger Straße 4349, Wohnung 885, 38130 Wolfsburg Wolfsburg
19 Eleonore Witte Prenzlauer Allee 6183, Wohnung 422, 10479 Prenzlauer Berg Prenzlauer Berg
20 Ida Förster Walddörferstraße 5238, Wohnung 281, 22054 Wandsbek Wandsbek