string_field()`function`

Create a string column specification for use in a schema.

USAGE

string_field(
    min_length=None,
    max_length=None,
    pattern=None,
    preset=None,
    allowed=None,
    nullable=False,
    null_probability=0.0,
    unique=False,
    generator=None,
)

The string_field() function defines the constraints and behavior for a string column when generating synthetic data with generate_dataset(). It provides three main modes of string generation: (1) controlled random strings with min_length=/max_length=, (2) strings matching a regular expression via pattern=, or (3) realistic data using preset= (e.g., "email", "name", "address"). You can also restrict values to a fixed set with allowed=. Only one of preset=, pattern=, or allowed= can be specified at a time.

When no special mode is selected, random alphanumeric strings are generated with lengths between min_length= and max_length= (defaulting to 1–20 characters).

Parameters

min_length : int | None = None: Minimum string length (for random string generation). Default is None (defaults to 1). Only applies when preset=, pattern=, and allowed= are all None.
max_length : int | None = None: Maximum string length (for random string generation). Default is None (defaults to 20). Only applies when preset=, pattern=, and allowed= are all None.
pattern : str | None = None: Regular expression pattern that generated strings must match. Supports character classes (e.g., [A-Z], [0-9]), quantifiers (e.g., {3}, {2,5}), alternation, and groups. Cannot be combined with preset= or allowed=.
preset : str | None = None: Preset name for generating realistic data. When specified, values are produced using locale-aware data generation, and the country= parameter of generate_dataset() controls the locale. Cannot be combined with pattern= or allowed=. See the Available Presets section below for the full list.
allowed : list[str] | None = None: List of allowed string values (categorical constraint). Values are sampled uniformly from this list. Cannot be combined with preset= or pattern=.
nullable : bool = False: Whether the column can contain null values. Default is False.
null_probability : float = 0.0: Probability of generating a null value for each row when nullable=True. Must be between 0.0 and 1.0. Default is 0.0.
unique : bool = False: Whether all values must be unique. Default is False. When True, the generator will retry until it produces n distinct values.
generator : Callable[[], Any] | None = None: Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single string value.

Returns

StringField: A string field specification that can be passed to Schema().

Raises

: ValueError: If more than one of preset=, pattern=, or allowed= is specified; if allowed= is an empty list; if min_length or max_length is negative; if min_length exceeds max_length; or if preset is not a recognized preset name.

Available Presets

The preset= parameter accepts one of the following preset names, organized by category. When a preset is used, the country= parameter of generate_dataset() controls the locale for region-specific formatting (e.g., address formats, phone number patterns).

Personal: "name" (first + last name), "name_full" (full name with possible prefix or suffix), "first_name", "last_name", "email" (realistic email address), "phone_number", "address" (full street address), "city", "state", "country", "country_code_2" (ISO 3166-1 alpha-2 code, e.g., "US"), "country_code_3" (ISO 3166-1 alpha-3 code, e.g., "USA"), "postcode", "latitude", "longitude"

Business: "company" (company name), "job" (job title), "catch_phrase"

Internet: "url", "domain_name", "ipv4", "ipv6", "user_name", "password"

Text: "text" (paragraph of text), "sentence", "paragraph", "word"

Financial: "credit_card_number", "iban", "currency_code"

Identifiers: "uuid4", "md5" (MD5 hash, 32 hex chars), "sha1" (SHA-1 hash, 40 hex chars), "sha256" (SHA-256 hash, 64 hex chars), "ssn" (social security number), "license_plate"

Barcodes: "ean8" (EAN-8 barcode with valid check digit), "ean13" (EAN-13 barcode with valid check digit)

Date/Time (as strings): "date_this_year", "date_this_decade", "date_between" (random date between 2000–2025), "date_range" (two dates joined with an en-dash, e.g., "2012-05-12 – 2015-11-22"), "future_date" (up to 1 year ahead), "past_date" (up to 10 years back), "time"

Miscellaneous: "color_name", "file_name", "file_extension", "mime_type", "user_agent" (browser user agent string with country-specific browser weighting)

Coherent Data Generation

When multiple columns in the same schema use related presets, the generated data will be coherent across those columns within each row. Specifically:

Person-related presets ("name", "name_full", "first_name", "last_name", "email", "user_name"): the email and username will be derived from the person’s name.
Address-related presets ("address", "city", "state", "postcode", "phone_number", "latitude", "longitude"): the city, state, and postcode will correspond to the same location within the address.

This coherence is automatic and requires no additional configuration.

Examples

The preset= parameter generates realistic personal data, while allowed= restricts values to a categorical set:

import pointblank as pb

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email", unique=True),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))

	name String	email String	status String
PolarsRows100Columns3
1	Doris Martin	d_martin@aol.com	pending
2	Nancy Gonzalez	nancygonzalez@icloud.com	active
3	Jessica Turner	jturner@aol.com	active
4	George Evans	georgeevans@zoho.com	inactive
5	Patricia Williams	pwilliams@outlook.com	pending
96	Isaiah Murphy	isaiah.murphy@zoho.com	active
97	Brittany Rodriguez	brodriguez@yandex.com	inactive
98	Megan Stevens	mstevens26@aol.com	inactive
99	Pamela Jenkins	pjenkins29@yandex.com	active
100	Stephanie Santos	stephanie.santos40@gmail.com	pending

We can also generate strings that match a regular expression with pattern= (e.g., product codes, identifiers):

schema = pb.Schema(
    product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"),
    batch_id=pb.string_field(pattern=r"BATCH-[A-Z][0-9]{3}"),
    sku=pb.string_field(pattern=r"[A-Z]{2}[0-9]{6}"),
)

pb.preview(pb.generate_dataset(schema, n=30, seed=23))

	product_code String	batch_id String	sku String
PolarsRows30Columns3
1	CAS-6685	BATCH-Y109	CA668523
2	XGI-0397	BATCH-J685	OA970117
3	DCW-6086	BATCH-E470	AQ503095
4	YBG-9529	BATCH-H011	TG959459
5	XLS-9459	BATCH-W608	PF972228
26	IEQ-1971	BATCH-I620	XF292474
27	SYO-0413	BATCH-O629	BT502512
28	BNZ-4359	BATCH-W138	GN938965
29	TYC-8695	BATCH-J648	XR725640
30	CTW-0120	BATCH-T410	ML823566

For random alphanumeric strings, min_length= and max_length= control the length. Adding nullable=True introduces missing values:

schema = pb.Schema(
    short_code=pb.string_field(min_length=3, max_length=5),
    notes=pb.string_field(
        min_length=10, max_length=50,
        nullable=True, null_probability=0.4,
    ),
)

pb.preview(pb.generate_dataset(schema, n=30, seed=7))

	short_code String	notes String
PolarsRows30Columns2
1	8jzP	None
2	e0I	OL8dKLzdocJ2isAjIhKtJ0RlgLKOmxgJTeKdNnFRIBXuDL7Dxt
3	xLd	None
4	ncfBA	Ac9QeWJKY40uvSwMFLZDe1f8rESQedUStPKR0CsTy
5	pfJ	None
26	8rE	tOofL9H2WjQ5TY4MyWuUFjsUNPjc0
27	QedUS	None
28	PKR0	IRpFqaDZeV7G5IfQHeVVEqZe2qpUWnoVPDF2yeE6RsXcNOPmeM
29	sTy4	None
30	wb8Dw	sTHsDDDXh5Jmtf7EbsDe0G9Cryn687neLfjVHq8xi

It’s possible to combine business and internet presets to build a company directory:

schema = pb.Schema(
    company=pb.string_field(preset="company"),
    domain=pb.string_field(preset="domain_name"),
    industry_tag=pb.string_field(allowed=["tech", "finance", "health", "retail"]),
)

pb.preview(pb.generate_dataset(schema, n=20, seed=55))

	company String	domain String	industry_tag String
PolarsRows20Columns3
1	Morgan Stanley	was.co	tech
2	Walmart	his.biz	finance
3	Thompson and Zuniga	program.net	finance
4	Adams and Ward	people.io	health
5	White Partners	very.net	tech
16	Silver Properties	program.us	tech
17	Dynamic Industries Enterprises	you.app	health
18	National Systems	who.net	tech
19	Adobe	then.cloud	finance
20	Apex Industries Enterprises	now.us	tech