load_dataset

load_dataset(dataset='small_table', tbl_type='polars')

Load a dataset hosted in the library as specified table type.

The Pointblank library includes several datasets that can be loaded using the load_dataset() function. The datasets can be loaded as a Polars DataFrame, a Pandas DataFrame, or as a DuckDB table (which uses the Ibis library backend). These datasets are used throughout the documentation’s examples to demonstrate the functionality of the library. They’re also useful for experimenting with the library and trying out different validation scenarios.

Parameters

dataset : Literal['small_table', 'game_revenue', 'nycflights'] = 'small_table'

The name of the dataset to load. Current options are "small_table", "game_revenue", and "nycflights".

tbl_type : Literal['polars', 'pandas', 'duckdb'] = 'polars'

The type of table to generate from the dataset. The named options are "polars", "pandas", and "duckdb".

Returns

: FrameT | Any

The dataset for the Validate object. This could be a Polars DataFrame, a Pandas DataFrame, or a DuckDB table as an Ibis table.

Included Datasets

There are three included datasets that can be loaded using the load_dataset() function:

  • "small_table": A small dataset with 13 rows and 8 columns. This dataset is useful for testing and demonstration purposes.
  • "game_revenue": A dataset with 2000 rows and 11 columns. Provides revenue data for a game development company. For the particular game, there are records of player sessions, the items they purchased, ads viewed, and the revenue generated.
  • "nycflights": A dataset with 336,776 rows and 18 columns. This dataset provides information about flights departing from New York City airports (JFK, LGA, or EWR) in 2013.

Supported DataFrame Types

The tbl_type= parameter can be set to one of the following:

  • "polars": A Polars DataFrame.
  • "pandas": A Pandas DataFrame.
  • "duckdb": An Ibis table for a DuckDB database.

Examples

Load the "small_table" dataset as a Polars DataFrame by calling load_dataset() with its defaults:

import pointblank as pb

small_table = pb.load_dataset()

pb.preview(small_table)
PolarsRows13Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

Note that the "small_table" dataset is a simple Polars DataFrame and using the preview() function will display the table in an HTML viewing environment.

The "game_revenue" dataset can be loaded as a Pandas DataFrame by specifying the dataset name and setting tbl_type="pandas":

game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="pandas")

pb.preview(game_revenue)
PandasRows2,000Columns11
player_id
object
session_id
object
session_start
datetime64[ns, UTC]
time
datetime64[ns, UTC]
item_type
object
item_name
object
item_revenue
float64
session_duration
float64
start_day
datetime64[ns]
acquisition
object
country
object
1 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:31:27+00:00 iap offer2 8.99 16.3 2015-01-01 00:00:00 google Germany
2 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:36:57+00:00 iap gems3 22.49 16.3 2015-01-01 00:00:00 google Germany
3 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:37:45+00:00 iap gold7 107.99 16.3 2015-01-01 00:00:00 google Germany
4 ECPANOIXLZHF896 ECPANOIXLZHF896-eol2j8bs 2015-01-01 01:31:03+00:00 2015-01-01 01:42:33+00:00 ad ad_20sec 0.76 16.3 2015-01-01 00:00:00 google Germany
5 ECPANOIXLZHF896 ECPANOIXLZHF896-hdu9jkls 2015-01-01 11:50:02+00:00 2015-01-01 11:55:20+00:00 ad ad_5sec 0.03 35.2 2015-01-01 00:00:00 google Germany
1996 NAOJRDMCSEBI281 NAOJRDMCSEBI281-j2vs9ilp 2015-01-21 01:57:50+00:00 2015-01-21 02:02:50+00:00 ad ad_survey 1.332 25.8 2015-01-11 00:00:00 organic Norway
1997 NAOJRDMCSEBI281 NAOJRDMCSEBI281-j2vs9ilp 2015-01-21 01:57:50+00:00 2015-01-21 02:22:14+00:00 ad ad_survey 1.35 25.8 2015-01-11 00:00:00 organic Norway
1998 RMOSWHJGELCI675 RMOSWHJGELCI675-vbhcsmtr 2015-01-21 02:39:48+00:00 2015-01-21 02:40:00+00:00 ad ad_5sec 0.03 8.4 2015-01-10 00:00:00 other_campaign France
1999 RMOSWHJGELCI675 RMOSWHJGELCI675-vbhcsmtr 2015-01-21 02:39:48+00:00 2015-01-21 02:47:12+00:00 iap offer5 26.09 8.4 2015-01-10 00:00:00 other_campaign France
2000 GJCXNTWEBIPQ369 GJCXNTWEBIPQ369-9elq67md 2015-01-21 03:59:23+00:00 2015-01-21 04:06:29+00:00 ad ad_5sec 0.12 18.5 2015-01-14 00:00:00 organic United States

The "game_revenue" dataset is a more real-world dataset with a mix of data types, and it’s significantly larger than the small_table dataset at 2000 rows and 11 columns.

The "nycflights" dataset can be loaded as a DuckDB table by specifying the dataset name and setting tbl_type="duckdb":

nycflights = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")

pb.preview(nycflights)
DuckDBRows336,776Columns18
year
int64
month
int64
day
int64
dep_time
int64
sched_dep_time
int64
dep_delay
int64
arr_time
int64
sched_arr_time
int64
arr_delay
int64
carrier
string
flight
int64
tailnum
string
origin
string
dest
string
air_time
int64
distance
int64
hour
int64
minute
int64
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0
336772 2013 9 30 NULL 1455 NULL NULL 1634 NULL 9E 3393 NULL JFK DCA NULL 213 14 55
336773 2013 9 30 NULL 2200 NULL NULL 2312 NULL 9E 3525 NULL LGA SYR NULL 198 22 0
336774 2013 9 30 NULL 1210 NULL NULL 1330 NULL MQ 3461 N535MQ LGA BNA NULL 764 12 10
336775 2013 9 30 NULL 1159 NULL NULL 1344 NULL MQ 3572 N511MQ LGA CLE NULL 419 11 59
336776 2013 9 30 NULL 840 NULL NULL 1020 NULL MQ 3531 N839MQ LGA RDU NULL 431 8 40

The "nycflights" dataset is a large dataset with 336,776 rows and 18 columns. This dataset is truly a real-world dataset and provides information about flights originating from New York City airports in 2013.