import pointblank as pb
= pb.load_dataset()
small_table
pb.preview(small_table)
PolarsRows13Columns8 |
||||||||
Load a dataset hosted in the library as specified table type.
The Pointblank library includes several datasets that can be loaded using the load_dataset()
function. The datasets can be loaded as a Polars DataFrame, a Pandas DataFrame, or as a DuckDB table (which uses the Ibis library backend). These datasets are used throughout the documentation’s examples to demonstrate the functionality of the library. They’re also useful for experimenting with the library and trying out different validation scenarios.
dataset : Literal
['small_table', 'game_revenue', 'nycflights'] = 'small_table'
The name of the dataset to load. Current options are "small_table"
, "game_revenue"
, and "nycflights"
.
tbl_type : Literal
['polars', 'pandas', 'duckdb'] = 'polars'
The type of table to generate from the dataset. The named options are "polars"
, "pandas"
, and "duckdb"
.
: FrameT
| Any
The dataset for the Validate
object. This could be a Polars DataFrame, a Pandas DataFrame, or a DuckDB table as an Ibis table.
There are three included datasets that can be loaded using the load_dataset()
function:
"small_table"
: A small dataset with 13 rows and 8 columns. This dataset is useful for testing and demonstration purposes."game_revenue"
: A dataset with 2000 rows and 11 columns. Provides revenue data for a game development company. For the particular game, there are records of player sessions, the items they purchased, ads viewed, and the revenue generated."nycflights"
: A dataset with 336,776 rows and 18 columns. This dataset provides information about flights departing from New York City airports (JFK, LGA, or EWR) in 2013.The tbl_type=
parameter can be set to one of the following:
"polars"
: A Polars DataFrame."pandas"
: A Pandas DataFrame."duckdb"
: An Ibis table for a DuckDB database.Load the "small_table"
dataset as a Polars DataFrame by calling load_dataset()
with its defaults:
PolarsRows13Columns8 |
||||||||
date_time Datetime |
date Date |
a Int64 |
b String |
c Int64 |
d Float64 |
e Boolean |
f String |
|
---|---|---|---|---|---|---|---|---|
1 | 2016-01-04 11:00:00 | 2016-01-04 | 2 | 1-bcd-345 | 3 | 3423.29 | True | high |
2 | 2016-01-04 00:32:00 | 2016-01-04 | 3 | 5-egh-163 | 8 | 9999.99 | True | low |
3 | 2016-01-05 13:32:00 | 2016-01-05 | 6 | 8-kdg-938 | 3 | 2343.23 | True | high |
4 | 2016-01-06 17:23:00 | 2016-01-06 | 2 | 5-jdo-903 | None | 3892.4 | False | mid |
5 | 2016-01-09 12:36:00 | 2016-01-09 | 8 | 3-ldm-038 | 7 | 283.94 | True | low |
9 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
10 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
11 | 2016-01-26 20:07:00 | 2016-01-26 | 4 | 2-dmx-010 | 7 | 833.98 | True | low |
12 | 2016-01-28 02:51:00 | 2016-01-28 | 2 | 7-dmx-010 | 8 | 108.34 | False | low |
13 | 2016-01-30 11:23:00 | 2016-01-30 | 1 | 3-dka-303 | None | 2230.09 | True | high |
Note that the "small_table"
dataset is a simple Polars DataFrame and using the preview()
function will display the table in an HTML viewing environment.
The "game_revenue"
dataset can be loaded as a Pandas DataFrame by specifying the dataset name and setting tbl_type="pandas"
:
PandasRows2,000Columns11 |
|||||||||||
player_id object |
session_id object |
session_start datetime64[ns, UTC] |
time datetime64[ns, UTC] |
item_type object |
item_name object |
item_revenue float64 |
session_duration float64 |
start_day datetime64[ns] |
acquisition object |
country object |
|
---|---|---|---|---|---|---|---|---|---|---|---|
1 | ECPANOIXLZHF896 | ECPANOIXLZHF896-eol2j8bs | 2015-01-01 01:31:03+00:00 | 2015-01-01 01:31:27+00:00 | iap | offer2 | 8.99 | 16.3 | 2015-01-01 00:00:00 | Germany | |
2 | ECPANOIXLZHF896 | ECPANOIXLZHF896-eol2j8bs | 2015-01-01 01:31:03+00:00 | 2015-01-01 01:36:57+00:00 | iap | gems3 | 22.49 | 16.3 | 2015-01-01 00:00:00 | Germany | |
3 | ECPANOIXLZHF896 | ECPANOIXLZHF896-eol2j8bs | 2015-01-01 01:31:03+00:00 | 2015-01-01 01:37:45+00:00 | iap | gold7 | 107.99 | 16.3 | 2015-01-01 00:00:00 | Germany | |
4 | ECPANOIXLZHF896 | ECPANOIXLZHF896-eol2j8bs | 2015-01-01 01:31:03+00:00 | 2015-01-01 01:42:33+00:00 | ad | ad_20sec | 0.76 | 16.3 | 2015-01-01 00:00:00 | Germany | |
5 | ECPANOIXLZHF896 | ECPANOIXLZHF896-hdu9jkls | 2015-01-01 11:50:02+00:00 | 2015-01-01 11:55:20+00:00 | ad | ad_5sec | 0.03 | 35.2 | 2015-01-01 00:00:00 | Germany | |
1996 | NAOJRDMCSEBI281 | NAOJRDMCSEBI281-j2vs9ilp | 2015-01-21 01:57:50+00:00 | 2015-01-21 02:02:50+00:00 | ad | ad_survey | 1.332 | 25.8 | 2015-01-11 00:00:00 | organic | Norway |
1997 | NAOJRDMCSEBI281 | NAOJRDMCSEBI281-j2vs9ilp | 2015-01-21 01:57:50+00:00 | 2015-01-21 02:22:14+00:00 | ad | ad_survey | 1.35 | 25.8 | 2015-01-11 00:00:00 | organic | Norway |
1998 | RMOSWHJGELCI675 | RMOSWHJGELCI675-vbhcsmtr | 2015-01-21 02:39:48+00:00 | 2015-01-21 02:40:00+00:00 | ad | ad_5sec | 0.03 | 8.4 | 2015-01-10 00:00:00 | other_campaign | France |
1999 | RMOSWHJGELCI675 | RMOSWHJGELCI675-vbhcsmtr | 2015-01-21 02:39:48+00:00 | 2015-01-21 02:47:12+00:00 | iap | offer5 | 26.09 | 8.4 | 2015-01-10 00:00:00 | other_campaign | France |
2000 | GJCXNTWEBIPQ369 | GJCXNTWEBIPQ369-9elq67md | 2015-01-21 03:59:23+00:00 | 2015-01-21 04:06:29+00:00 | ad | ad_5sec | 0.12 | 18.5 | 2015-01-14 00:00:00 | organic | United States |
The "game_revenue"
dataset is a more real-world dataset with a mix of data types, and it’s significantly larger than the small_table
dataset at 2000 rows and 11 columns.
The "nycflights"
dataset can be loaded as a DuckDB table by specifying the dataset name and setting tbl_type="duckdb"
:
DuckDBRows336,776Columns18 |
||||||||||||||||||
year int64 |
month int64 |
day int64 |
dep_time int64 |
sched_dep_time int64 |
dep_delay int64 |
arr_time int64 |
sched_arr_time int64 |
arr_delay int64 |
carrier string |
flight int64 |
tailnum string |
origin string |
dest string |
air_time int64 |
distance int64 |
hour int64 |
minute int64 |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 |
2 | 2013 | 1 | 1 | 533 | 529 | 4 | 850 | 830 | 20 | UA | 1714 | N24211 | LGA | IAH | 227 | 1416 | 5 | 29 |
3 | 2013 | 1 | 1 | 542 | 540 | 2 | 923 | 850 | 33 | AA | 1141 | N619AA | JFK | MIA | 160 | 1089 | 5 | 40 |
4 | 2013 | 1 | 1 | 544 | 545 | -1 | 1004 | 1022 | -18 | B6 | 725 | N804JB | JFK | BQN | 183 | 1576 | 5 | 45 |
5 | 2013 | 1 | 1 | 554 | 600 | -6 | 812 | 837 | -25 | DL | 461 | N668DN | LGA | ATL | 116 | 762 | 6 | 0 |
336772 | 2013 | 9 | 30 | NULL | 1455 | NULL | NULL | 1634 | NULL | 9E | 3393 | NULL | JFK | DCA | NULL | 213 | 14 | 55 |
336773 | 2013 | 9 | 30 | NULL | 2200 | NULL | NULL | 2312 | NULL | 9E | 3525 | NULL | LGA | SYR | NULL | 198 | 22 | 0 |
336774 | 2013 | 9 | 30 | NULL | 1210 | NULL | NULL | 1330 | NULL | MQ | 3461 | N535MQ | LGA | BNA | NULL | 764 | 12 | 10 |
336775 | 2013 | 9 | 30 | NULL | 1159 | NULL | NULL | 1344 | NULL | MQ | 3572 | N511MQ | LGA | CLE | NULL | 419 | 11 | 59 |
336776 | 2013 | 9 | 30 | NULL | 840 | NULL | NULL | 1020 | NULL | MQ | 3531 | N839MQ | LGA | RDU | NULL | 431 | 8 | 40 |
The "nycflights"
dataset is a large dataset with 336,776 rows and 18 columns. This dataset is truly a real-world dataset and provides information about flights originating from New York City airports in 2013.