[file] What are the pros and cons of parquet format compared to other formats?

Characteristics of Apache Parquet are :

  • Self-describing
  • Columnar format
  • Language-independent

In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats , it gives some insights on the formats but I would like to know how the access to data & storage of data is done in each of these formats. How parquet has an advantage over the others?

This question is related to file hadoop hdfs avro parquet

The answer is


Tom's answer is quite detailed and exhaustive but you may also be interested in this simple study about Parquet vs Avro done at Allstate Insurance, summarized here:

"Overall, Parquet showed either similar or better results on every test [than Avro]. The query-performance differences on the larger datasets in Parquet’s favor are partly due to the compression results; when querying the wide dataset, Spark had to read 3.5x less data for Parquet than Avro. Avro did not perform well when processing the entire dataset, as suspected."


Avro is a row-based storage format for Hadoop.

Parquet is a column-based storage format for Hadoop.

If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.

If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.

Source


Choosing the right file format is important to building performant data applications. The concepts outlined in this post carry over to Pandas, Dask, Spark, and Presto / AWS Athena.

Column pruning

Column pruning is a big performance improvement that's possible for column-based file formats (Parquet, ORC) and not possible for row-based file formats (CSV, Avro).

Suppose you have a dataset with 100 columns and want to read two of them into a DataFrame. Here's how you can perform this with Pandas if the data is stored in a Parquet file.

import pandas as pd

pd.read_parquet('some_file.parquet', columns = ['id', 'firstname'])

Parquet is a columnar file format, so Pandas can grab the columns relevant for the query and can skip the other columns. This is a massive performance improvement.

If the data is stored in a CSV file, you can read it like this:

import pandas as pd

pd.read_csv('some_file.csv', usecols = ['id', 'firstname'])

usecols can't skip over entire columns because of the row nature of the CSV file format.

Spark doesn't require users to explicitly list the columns that'll be used in a query. Spark builds up an execution plan and will automatically leverage column pruning whenever possible. Of course, column pruning is only possible when the underlying file format is column oriented.

Popularity

Spark and Pandas have built-in readers writers for CSV, JSON, ORC, Parquet, and text files. They don't have built-in readers for Avro.

Avro is popular within the Hadoop ecosystem. Parquet has gained significant traction outside of the Hadoop ecosystem. For example, the Delta Lake project is being built on Parquet files.

Arrow is an important project that makes it easy to work with Parquet files with a variety of different languages (C, C++, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust), but doesn't support Avro. Parquet files are easier to work with because they are supported by so many different projects.

Schema

Parquet stores the file schema in the file metadata. CSV files don't store file metadata, so readers need to either be supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious and inferring a schema is error prone / expensive.

Avro also stores the data schema in the file itself. Having schema in the files is a huge advantage and is one of the reasons why a modern data project should not rely on JSON or CSV.

Column metadata

Parquet stores metadata statistics for each column and lets users add their own column metadata as well.

The min / max column value metadata allows for Parquet predicate pushdown filtering that's supported by the Dask & Spark cluster computing frameworks.

Here's how to fetch the column statistics with PyArrow.

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)
<pyarrow._parquet.Statistics object at 0x11ac17eb0>
  has_min_max: True
  min: 1
  max: 9
  null_count: 0
  distinct_count: 0
  num_values: 3
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE

Complex column types

Parquet allows for complex column types like arrays, dictionaries, and nested schemas. There isn't a reliable method to store complex types in simple file formats like CSVs.

Compression

Columnar file formats store related types in rows, so they're easier to compress. This CSV file is relatively hard to compress.

first_name,age
ken,30
felicia,36
mia,2

This data is easier to compress when the related types are stored in the same row:

ken,felicia,mia
30,36,2

Parquet files are most commonly compressed with the Snappy compression algorithm. Snappy compressed files are splittable and quick to inflate. Big data systems want to reduce file size on disk, but also want to make it quick to inflate the flies and run analytical queries.

Mutable nature of file

Parquet files are immutable, as described here. CSV files are mutable.

Adding a row to a CSV file is easy. You can't easily add a row to a Parquet file.

Data lakes

In a big data environment, you'll be working with hundreds or thousands of Parquet files. Disk partitioning of the files, avoiding big files, and compacting small files is important. The optimal disk layout of data depends on your query patterns.


Examples related to file

Gradle - Move a folder from ABC to XYZ Difference between opening a file in binary vs text Angular: How to download a file from HttpClient? Python error message io.UnsupportedOperation: not readable java.io.FileNotFoundException: class path resource cannot be opened because it does not exist Writing JSON object to a JSON file with fs.writeFileSync How to read/write files in .Net Core? How to write to a CSV line by line? Writing a dictionary to a text file? What are the pros and cons of parquet format compared to other formats?

Examples related to hadoop

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check Spark Version What are the pros and cons of parquet format compared to other formats? java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient How to export data from Spark SQL to CSV How to copy data from one HDFS to another HDFS? How to calculate Date difference in Hive Select top 2 rows in Hive Spark - load CSV file as DataFrame?

Examples related to hdfs

What are the pros and cons of parquet format compared to other formats? How to copy data from one HDFS to another HDFS? Spark - load CSV file as DataFrame? hadoop copy a local file system folder to HDFS What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? How to fix corrupt HDFS FIles How to copy file from HDFS to the local file system Name node is in safe mode. Not able to leave Hive load CSV with commas in quoted fields Permission denied at hdfs

Examples related to avro

What are the pros and cons of parquet format compared to other formats?

Examples related to parquet

What are the pros and cons of parquet format compared to other formats? How to read a Parquet file into Pandas DataFrame?