[r] Quickly reading very large tables as dataframes

I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.

Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.

My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.

This linked article provides benchmarking and a good overview. I have quoted some interesting points below.

https://ursalabs.org/blog/2019-10-columnar-perf/

File Size

That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.

Read Speed

When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.


Independent Test

I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.

Data

library(dplyr)
library(tibble)
library(OpenRepGrid)

n <- 1000000

set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))


test_data <- mtcars %>%
  rownames_to_column() %>%
  sample_n(n, replace = TRUE) %>%
  mutate_all(~ sample(., length(.))) %>%
  mutate(factor1 = sample(some_levels1, n, replace = TRUE),
         factor2 = sample(some_levels2, n, replace = TRUE),
         text = randomSentences(n, sample(3:8, n, replace = TRUE))
         )

Read and Write

Writing the data is easy.

library(arrow)

write_parquet(test_data , "test_data.parquet")

# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)

Reading the data is also easy.

read_parquet("test_data.parquet")

# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)

I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.

benchmarking

This file is nowhere near as large as the benchmark article, so maybe that is the difference.

Tests

  • rds: test_data.rds (20.3 MB)
  • parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
  • parquet2: test_data2.parquet (14.9 MB with higher compression)
  • parquet: test_data.parquet (40.7 MB)
  • fst2: test_data2.fst (27.9 MB with higher compression)
  • fst: test_data.fst (76.8 MB)
  • fread2: test_data.csv.gz (23.6MB)
  • fread: test_data.csv (98.7MB)
  • feather_arrow: test_data.feather (157.2 MB read with arrow)
  • feather: test_data.feather (157.2 MB read with feather)

Observations

For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.

Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to import

Import functions from another js file. Javascript The difference between "require(x)" and "import x" pytest cannot import module while python can How to import an Excel file into SQL Server? When should I use curly braces for ES6 import? How to import a JSON file in ECMAScript 6? Python: Importing urllib.quote importing external ".txt" file in python beyond top level package error in relative import Reading tab-delimited file with Pandas - works on Windows, but not on Mac

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to r-faq

What does "The following object is masked from 'package:xxx'" mean? What does "Error: object '<myvariable>' not found" mean? How do I deal with special characters like \^$.?*|+()[{ in my regex? What does %>% function mean in R? How to plot a function curve in R Use dynamic variable names in `dplyr` Error: unexpected symbol/input/string constant/numeric constant/SPECIAL in my code How should I deal with "package 'xxx' is not available (for R version x.y.z)" warning? How to select the row with the maximum value in each group R data formats: RData, Rda, Rds etc