What are the pros and cons of parquet format compared to other formats

Question

Characteristics of Apache Parquet are     Self-describing  Columnar format Language-independent   In comparison to Avro  Sequence Files  RC File etc  I want an overview of the formats  I have already read   How Impala Works with Hadoop File Formats   it gives some insights on the formats but I would like to know how the access to data  amp  storage of data is done in each of these formats  How parquet has an advantage over the others

User · Answer

Choosing the right file format is important to building performant data applications. The concepts outlined in this post carry over to Pandas, Dask, Spark, and Presto / AWS Athena.

Column pruning

Column pruning is a big performance improvement that's possible for column-based file formats (Parquet, ORC) and not possible for row-based file formats (CSV, Avro).

Suppose you have a dataset with 100 columns and want to read two of them into a DataFrame. Here's how you can perform this with Pandas if the data is stored in a Parquet file.

import pandas as pd

pd.read_parquet('some_file.parquet', columns = ['id', 'firstname'])

Parquet is a columnar file format, so Pandas can grab the columns relevant for the query and can skip the other columns. This is a massive performance improvement.

If the data is stored in a CSV file, you can read it like this:

import pandas as pd

pd.read_csv('some_file.csv', usecols = ['id', 'firstname'])

usecols can't skip over entire columns because of the row nature of the CSV file format.

Spark doesn't require users to explicitly list the columns that'll be used in a query. Spark builds up an execution plan and will automatically leverage column pruning whenever possible. Of course, column pruning is only possible when the underlying file format is column oriented.

Popularity

Spark and Pandas have built-in readers writers for CSV, JSON, ORC, Parquet, and text files. They don't have built-in readers for Avro.

Avro is popular within the Hadoop ecosystem. Parquet has gained significant traction outside of the Hadoop ecosystem. For example, the Delta Lake project is being built on Parquet files.

Arrow is an important project that makes it easy to work with Parquet files with a variety of different languages (C, C++, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust), but doesn't support Avro. Parquet files are easier to work with because they are supported by so many different projects.

Schema

Parquet stores the file schema in the file metadata. CSV files don't store file metadata, so readers need to either be supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious and inferring a schema is error prone / expensive.

Avro also stores the data schema in the file itself. Having schema in the files is a huge advantage and is one of the reasons why a modern data project should not rely on JSON or CSV.

Column metadata

Parquet stores metadata statistics for each column and lets users add their own column metadata as well.

The min / max column value metadata allows for Parquet predicate pushdown filtering that's supported by the Dask & Spark cluster computing frameworks.

Here's how to fetch the column statistics with PyArrow.

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)

<pyarrow._parquet.Statistics object at 0x11ac17eb0>
  has_min_max: True
  min: 1
  max: 9
  null_count: 0
  distinct_count: 0
  num_values: 3
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE

Complex column types

Parquet allows for complex column types like arrays, dictionaries, and nested schemas. There isn't a reliable method to store complex types in simple file formats like CSVs.

Compression

Columnar file formats store related types in rows, so they're easier to compress. This CSV file is relatively hard to compress.

first_name,age
ken,30
felicia,36
mia,2

This data is easier to compress when the related types are stored in the same row:

ken,felicia,mia
30,36,2

Parquet files are most commonly compressed with the Snappy compression algorithm. Snappy compressed files are splittable and quick to inflate. Big data systems want to reduce file size on disk, but also want to make it quick to inflate the flies and run analytical queries.

Mutable nature of file

Parquet files are immutable, as described here. CSV files are mutable.

Adding a row to a CSV file is easy. You can't easily add a row to a Parquet file.

Data lakes

In a big data environment, you'll be working with hundreds or thousands of Parquet files. Disk partitioning of the files, avoiding big files, and compacting small files is important. The optimal disk layout of data depends on your query patterns.

User · Answer

Avro is a row-based storage format for Hadoop   Parquet is a column-based storage format for Hadoop   If your use case typically scans or retrieves all of the fields in a row in each query  Avro is usually the best choice   If your dataset has many columns  and your use case typically involves working with a subset of those columns rather than entire records  Parquet is optimized for that kind of work   Source

User · Answer

I think the main difference I can describe relates to record oriented vs  column oriented formats   Record oriented formats are what we re all used to -- text files  delimited formats like CSV  TSV   AVRO is slightly cooler than those because it can change schema over time  e g  adding or removing columns from a record   Other tricks of various formats  especially including compression  involve whether a format can be split -- that is  can you read a block of records from anywhere in the dataset and still know it s schema   But here s more detail on columnar formats like Parquet   Parquet  and other columnar formats handle a common Hadoop situation very efficiently   It is common to have tables  datasets  having many more columns than you would expect in a well-designed relational database -- a hundred or two hundred columns is not unusual   This is so because we often use Hadoop as a place to denormalize data from relational formats -- yes  you get lots of repeated values and many tables all flattened into a single one   But it becomes much easier to query since all the joins are worked out   There are other advantages such as retaining state-in-time data   So anyway it s common to have a boatload of columns in a table     Let s say there are 132 columns  and some of them are really long text fields  each different column one following the other and use up maybe 10K per record   While querying these tables is easy with SQL standpoint  it s common that you ll want to get some range of records based on only a few of those hundred-plus columns   For example  you might want all of the records in February and March for customers with sales    500   To do this in a row format the query would need to scan every record of the dataset   Read the first row  parse the record into fields  columns  and get the date and sales columns  include it in your result if it satisfies the condition   Repeat   If you have 10 years  120 months  of history  you re reading every single record just to find 2 of those months   Of course this is a great opportunity to use a partition on year and month  but even so  you re reading and parsing 10K of each record row for those two months just to find whether the customer s sales are    500   In a columnar format  each column  field  of a record is stored with others of its kind  spread all over many different blocks on the disk -- columns for year together  columns for month together  columns for customer employee handbook  or other long text   and all the others that make those records so huge all in their own separate place on the disk  and of course columns for sales together   Well heck  date and months are numbers  and so are sales -- they are just a few bytes   Wouldn t it be great if we only had to read a few bytes for each record to determine which records matched our query   Columnar storage to the rescue   Even without partitions  scanning the small fields needed to satisfy our query is super-fast -- they are all in order by record  and all the same size  so the disk seeks over much less data checking for included records   No need to read through that employee handbook and other long text fields -- just ignore them   So  by grouping columns with each other  instead of rows  you can almost always scan less data   Win   But wait  it gets better   If your query only needed to know those values and a few more  let s say 10 of the 132 columns  and didn t care about that employee handbook column  once it had picked the right records to return  it would now only have to go back to the 10 columns it needed to render the results  ignoring the other 122 of the 132 in our dataset   Again  we skip a lot of reading    Note  for this reason  columnar formats are a lousy choice when doing straight transformations  for example  if you re joining all of two tables into one big ger  result set that you re saving as a new table  the sources are going to get scanned completely anyway  so there s not a lot of benefit in read performance  and because columnar formats need to remember more about the where stuff is  they use more memory than a similar row format    One more benefit of columnar  data is spread around   To get a single record  you can have 132 workers each read  and write  data from to 132 different places on 132 blocks of data   Yay for parallelization   And now for the clincher  compression algorithms work much better when it can find repeating patterns   You could compress AABBBBBBCCCCCCCCCCCCCCCC as 2A6B16C but ABCABCBCBCBCCCCCCCCCCCCCC wouldn t get as small  well  actually  in this case it would  but trust me  -      So once again  less reading   And writing too   So we read a lot less data to answer common queries  it s potentially faster to read and write in parallel  and compression tends to work much better     Columnar is great when your input side is large  and your output is a filtered subset  from big to little is great   Not as beneficial when the input and outputs are about the same   But in our case  Impala took our old Hive queries that ran in 5  10  20 or 30 minutes  and finished most in a few seconds or a minute   Hope this helps answer at least part of your question

User · Answer

Tom s answer is quite detailed and exhaustive but you may also be interested in this simple study about Parquet vs Avro done at Allstate Insurance  summarized here    Overall  Parquet showed either similar or better results on every test  than Avro   The query-performance differences on the larger datasets in Parquet   s favor are partly due to the compression results  when querying the wide dataset  Spark had to read 3 5x less data for Parquet than Avro  Avro did not perform well when processing the entire dataset  as suspected

[file] What are the pros and cons of parquet format compared to other formats?

Examples related to file

Examples related to hadoop

Examples related to hdfs

Examples related to avro

Examples related to parquet