Previewing Data

In many cases, it’s good to look at your data tables. Before validating a table, you’ll likely want to inspect a portion of it before diving into the creation of data-quality rules. This is pretty easily done with Polars and Pandas DataFrames, however, it’s not as easy with database tables and each table backend displays things differently.

To make this common task a little better, you can use the preview() function in Pointblank. It has been designed to work with every table that the package supports (i.e., DataFrames and Ibis-backend tables, the latter of which are largely database tables). Plus, what’s shown in the output is consistent, no matter what type of data you’re looking at.

Viewing a Table with `preview()`

Let’s look at how preview() works. It requires only a table and, for this first example, let’s use the nycflights dataset:

import pointblank as pb

nycflights = pb.load_dataset(dataset="nycflights", tbl_type="polars")

pb.preview(nycflights)

	year Int64	month Int64	day Int64	dep_time Int64	sched_dep_time Int64	dep_delay Int64	arr_time Int64	sched_arr_time Int64	arr_delay Int64	carrier String	flight Int64	tailnum String	origin String	dest String	air_time Int64	distance Int64	hour Int64	minute Int64
PolarsRows336,776Columns18
1	2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15
2	2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29
3	2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40
4	2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45
5	2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0
336772	2013	9	30	None	1455	None	None	1634	None	9E	3393	None	JFK	DCA	None	213	14	55
336773	2013	9	30	None	2200	None	None	2312	None	9E	3525	None	LGA	SYR	None	198	22	0
336774	2013	9	30	None	1210	None	None	1330	None	MQ	3461	N535MQ	LGA	BNA	None	764	12	10
336775	2013	9	30	None	1159	None	None	1344	None	MQ	3572	N511MQ	LGA	CLE	None	419	11	59
336776	2013	9	30	None	840	None	None	1020	None	MQ	3531	N839MQ	LGA	RDU	None	431	8	40

This is an HTML table using the style of the other reporting tables in the library. The header is more minimal here, only showing the type of table we’re looking at (POLARS in this case) along with the table dimensions. The column headers provide both the column names and the column data types.

By default, we’re getting the first five rows and the last five rows. Row numbers (from the original dataset) provide an indication of which rows are the head and tail rows. The blue lines provide additional demarcation of the column containing the row numbers and the head and tail row groups. Finally, any cells with missing values are prominently styled with red lettering and a lighter red background.

If you’d rather not see the row numbers in the table, you can use the show_row_numbers=False option. Let’s try that with the game_revenue dataset as a DuckDB table:

game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="duckdb")

pb.preview(game_revenue, show_row_numbers=False)

player_id string	session_id string	session_start timestamp	time timestamp	item_type string	item_name string	item_revenue float64	session_duration float64	start_day date	acquisition string	country string
DuckDBRows2,000Columns11
ECPANOIXLZHF896	ECPANOIXLZHF896-eol2j8bs	2015-01-01 01:31:03+00:00	2015-01-01 01:31:27+00:00	iap	offer2	8.99	16.3	2015-01-01	google	Germany
ECPANOIXLZHF896	ECPANOIXLZHF896-eol2j8bs	2015-01-01 01:31:03+00:00	2015-01-01 01:36:57+00:00	iap	gems3	22.49	16.3	2015-01-01	google	Germany
ECPANOIXLZHF896	ECPANOIXLZHF896-eol2j8bs	2015-01-01 01:31:03+00:00	2015-01-01 01:37:45+00:00	iap	gold7	107.99	16.3	2015-01-01	google	Germany
ECPANOIXLZHF896	ECPANOIXLZHF896-eol2j8bs	2015-01-01 01:31:03+00:00	2015-01-01 01:42:33+00:00	ad	ad_20sec	0.76	16.3	2015-01-01	google	Germany
ECPANOIXLZHF896	ECPANOIXLZHF896-hdu9jkls	2015-01-01 11:50:02+00:00	2015-01-01 11:55:20+00:00	ad	ad_5sec	0.03	35.2	2015-01-01	google	Germany
NAOJRDMCSEBI281	NAOJRDMCSEBI281-j2vs9ilp	2015-01-21 01:57:50+00:00	2015-01-21 02:02:50+00:00	ad	ad_survey	1.332	25.8	2015-01-11	organic	Norway
NAOJRDMCSEBI281	NAOJRDMCSEBI281-j2vs9ilp	2015-01-21 01:57:50+00:00	2015-01-21 02:22:14+00:00	ad	ad_survey	1.35	25.8	2015-01-11	organic	Norway
RMOSWHJGELCI675	RMOSWHJGELCI675-vbhcsmtr	2015-01-21 02:39:48+00:00	2015-01-21 02:40:00+00:00	ad	ad_5sec	0.03	8.4	2015-01-10	other_campaign	France
RMOSWHJGELCI675	RMOSWHJGELCI675-vbhcsmtr	2015-01-21 02:39:48+00:00	2015-01-21 02:47:12+00:00	iap	offer5	26.09	8.4	2015-01-10	other_campaign	France
GJCXNTWEBIPQ369	GJCXNTWEBIPQ369-9elq67md	2015-01-21 03:59:23+00:00	2015-01-21 04:06:29+00:00	ad	ad_5sec	0.12	18.5	2015-01-14	organic	United States

With the above preview, the row numbers are gone. The horizontal blue line still serves to divide the top and bottom rows of the table, however.

Adjusting the Number of Rows Shown

It could be that displaying the five top and bottom rows is not preferred. This can be changed with the n_head= and n_tail=. Maybe, you want three from the top along with the last row? Let’s try that out with the small_table dataset as a Pandas DataFrame:

small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas")

pb.preview(small_table, n_head=3, n_tail=1)

	date_time datetime64[ns]	date datetime64[ns]	a int64	b object	c float64	d float64	e bool	f object
PandasRows13Columns8
1	2016-01-04 11:00:00	2016-01-04 00:00:00	2	1-bcd-345	3.0	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04 00:00:00	3	5-egh-163	8.0	9999.99	True	low
3	2016-01-05 13:32:00	2016-01-05 00:00:00	6	8-kdg-938	3.0	2343.23	True	high
13	2016-01-30 11:23:00	2016-01-30 00:00:00	1	3-dka-303	NA	2230.09	True	high

If you’re looking at a small table and want to see the entirety of it, you can enlarge the n_head= and n_tail= values:

small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas")

pb.preview(small_table, n_head=10, n_tail=10)

	date_time datetime64[ns]	date datetime64[ns]	a int64	b object	c float64	d float64	e bool	f object
PandasRows13Columns8
1	2016-01-04 11:00:00	2016-01-04 00:00:00	2	1-bcd-345	3.0	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04 00:00:00	3	5-egh-163	8.0	9999.99	True	low
3	2016-01-05 13:32:00	2016-01-05 00:00:00	6	8-kdg-938	3.0	2343.23	True	high
4	2016-01-06 17:23:00	2016-01-06 00:00:00	2	5-jdo-903	NA	3892.4	False	mid
5	2016-01-09 12:36:00	2016-01-09 00:00:00	8	3-ldm-038	7.0	283.94	True	low
6	2016-01-11 06:15:00	2016-01-11 00:00:00	4	2-dhe-923	4.0	3291.03	True	mid
7	2016-01-15 18:46:00	2016-01-15 00:00:00	7	1-knw-093	3.0	843.34	True	high
8	2016-01-17 11:27:00	2016-01-17 00:00:00	4	5-boe-639	2.0	1035.64	False	low
9	2016-01-20 04:30:00	2016-01-20 00:00:00	3	5-bce-642	9.0	837.93	False	high
10	2016-01-20 04:30:00	2016-01-20 00:00:00	3	5-bce-642	9.0	837.93	False	high
11	2016-01-26 20:07:00	2016-01-26 00:00:00	4	2-dmx-010	7.0	833.98	True	low
12	2016-01-28 02:51:00	2016-01-28 00:00:00	2	7-dmx-010	8.0	108.34	False	low
13	2016-01-30 11:23:00	2016-01-30 00:00:00	1	3-dka-303	NA	2230.09	True	high

Given that the table has 13 rows, asking for 20 rows to be displayed effectively shows the entire table.

Previewing a Subset of Columns

The preview scales well to tables that have many columns by allowing for a horizontal scroll. However, previewing data from all columns can be impractical if you’re only concerned with a key set of them. To preview only a subset of a table’s columns, we can use the columns_subset= argument. Let’s do this with the nycflights dataset and provide a list of six columns from that table.

pb.preview(
    nycflights,
    columns_subset=["hour", "minute", "sched_dep_time", "year", "month", "day"]
)

	hour Int64	minute Int64	sched_dep_time Int64	year Int64	month Int64	day Int64
PolarsRows336,776Columns18
1	5	15	515	2013	1	1
2	5	29	529	2013	1	1
3	5	40	540	2013	1	1
4	5	45	545	2013	1	1
5	6	0	600	2013	1	1
336772	14	55	1455	2013	9	30
336773	22	0	2200	2013	9	30
336774	12	10	1210	2013	9	30
336775	11	59	1159	2013	9	30
336776	8	40	840	2013	9	30

What we see are the six columns we specified from the nycflights dataset.

Note that the columns are displayed in the order provided in the columns_subset= list. This can be useful for making quick, side-by-side comparisons. In the example above, we placed hour and minute next to the sched_dep_time column. In the original dataset, sched_dep_time is far apart from the other two columns, but, it’s useful to have them next to each other in the preview since hour and minute are derived from sched_dep_time (and this lets us spot check any issues).

We can also use column selectors within columns_subset=. Suppose we want to only see those columns that have "dep_" or "arr_" in the name. To do that, we use the matches() column selector function:

pb.preview(nycflights, columns_subset=pb.matches("dep_|arr_"))

	dep_time Int64	sched_dep_time Int64	dep_delay Int64	arr_time Int64	sched_arr_time Int64	arr_delay Int64
PolarsRows336,776Columns18
1	517	515	2	830	819	11
2	533	529	4	850	830	20
3	542	540	2	923	850	33
4	544	545	-1	1004	1022	-18
5	554	600	-6	812	837	-25
336772	None	1455	None	None	1634	None
336773	None	2200	None	None	2312	None
336774	None	1210	None	None	1330	None
336775	None	1159	None	None	1344	None
336776	None	840	None	None	1020	None

Several selectors can be combined together through use of the col() function and operators such as & (and), | (or), - (difference), and ~ (not). Let’s look at a column selection case where:

the first three columns are selected
all columns containing "dep_" or "arr_" are selected
any columns beginning with "sched" are omitted

This is how we put that together within col():

pb.preview(
    nycflights,
    columns_subset=pb.col((pb.first_n(3) | pb.matches("dep_|arr_")) & ~ pb.starts_with("sched"))
)

	year Int64	month Int64	day Int64	dep_time Int64	dep_delay Int64	arr_time Int64	arr_delay Int64
PolarsRows336,776Columns18
1	2013	1	1	517	2	830	11
2	2013	1	1	533	4	850	20
3	2013	1	1	542	2	923	33
4	2013	1	1	544	-1	1004	-18
5	2013	1	1	554	-6	812	-25
336772	2013	9	30	None	None	None	None
336773	2013	9	30	None	None	None	None
336774	2013	9	30	None	None	None	None
336775	2013	9	30	None	None	None	None
336776	2013	9	30	None	None	None	None

This gives us a preview with only the columns that fit the specific selection rules. Incidentally, using selectors with a dataset through preview() is a good way to test out the use of selectors more generally. Since they are primarily used to select columns for validation, trying them beforehand with preview() can help verify that your selection logic is sound.

Viewing a Table with preview()

Adjusting the Number of Rows Shown

Previewing a Subset of Columns

Viewing a Table with `preview()`