Characteristics of Apache Parquet are :
In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats , it gives some insights on the formats but I would like to know how the access to data & storage of data is done in each of these formats. How parquet has an advantage over the others?
Tom's answer is quite detailed and exhaustive but you may also be interested in this simple study about Parquet vs Avro done at Allstate Insurance, summarized here:
"Overall, Parquet showed either similar or better results on every test [than Avro]. The query-performance differences on the larger datasets in Parquet’s favor are partly due to the compression results; when querying the wide dataset, Spark had to read 3.5x less data for Parquet than Avro. Avro did not perform well when processing the entire dataset, as suspected."
Avro is a row-based storage format for Hadoop.
Parquet is a column-based storage format for Hadoop.
If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.
If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.
Choosing the right file format is important to building performant data applications. The concepts outlined in this post carry over to Pandas, Dask, Spark, and Presto / AWS Athena.
Column pruning
Column pruning is a big performance improvement that's possible for column-based file formats (Parquet, ORC) and not possible for row-based file formats (CSV, Avro).
Suppose you have a dataset with 100 columns and want to read two of them into a DataFrame. Here's how you can perform this with Pandas if the data is stored in a Parquet file.
import pandas as pd
pd.read_parquet('some_file.parquet', columns = ['id', 'firstname'])
Parquet is a columnar file format, so Pandas can grab the columns relevant for the query and can skip the other columns. This is a massive performance improvement.
If the data is stored in a CSV file, you can read it like this:
import pandas as pd
pd.read_csv('some_file.csv', usecols = ['id', 'firstname'])
usecols
can't skip over entire columns because of the row nature of the CSV file format.
Spark doesn't require users to explicitly list the columns that'll be used in a query. Spark builds up an execution plan and will automatically leverage column pruning whenever possible. Of course, column pruning is only possible when the underlying file format is column oriented.
Popularity
Spark and Pandas have built-in readers writers for CSV, JSON, ORC, Parquet, and text files. They don't have built-in readers for Avro.
Avro is popular within the Hadoop ecosystem. Parquet has gained significant traction outside of the Hadoop ecosystem. For example, the Delta Lake project is being built on Parquet files.
Arrow is an important project that makes it easy to work with Parquet files with a variety of different languages (C, C++, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust), but doesn't support Avro. Parquet files are easier to work with because they are supported by so many different projects.
Schema
Parquet stores the file schema in the file metadata. CSV files don't store file metadata, so readers need to either be supplied with the schema or the schema needs to be inferred. Supplying a schema is tedious and inferring a schema is error prone / expensive.
Avro also stores the data schema in the file itself. Having schema in the files is a huge advantage and is one of the reasons why a modern data project should not rely on JSON or CSV.
Column metadata
Parquet stores metadata statistics for each column and lets users add their own column metadata as well.
The min / max column value metadata allows for Parquet predicate pushdown filtering that's supported by the Dask & Spark cluster computing frameworks.
Here's how to fetch the column statistics with PyArrow.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('some_file.parquet')
print(parquet_file.metadata.row_group(0).column(1).statistics)
<pyarrow._parquet.Statistics object at 0x11ac17eb0>
has_min_max: True
min: 1
max: 9
null_count: 0
distinct_count: 0
num_values: 3
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
Complex column types
Parquet allows for complex column types like arrays, dictionaries, and nested schemas. There isn't a reliable method to store complex types in simple file formats like CSVs.
Compression
Columnar file formats store related types in rows, so they're easier to compress. This CSV file is relatively hard to compress.
first_name,age
ken,30
felicia,36
mia,2
This data is easier to compress when the related types are stored in the same row:
ken,felicia,mia
30,36,2
Parquet files are most commonly compressed with the Snappy compression algorithm. Snappy compressed files are splittable and quick to inflate. Big data systems want to reduce file size on disk, but also want to make it quick to inflate the flies and run analytical queries.
Mutable nature of file
Parquet files are immutable, as described here. CSV files are mutable.
Adding a row to a CSV file is easy. You can't easily add a row to a Parquet file.
Data lakes
In a big data environment, you'll be working with hundreds or thousands of Parquet files. Disk partitioning of the files, avoiding big files, and compacting small files is important. The optimal disk layout of data depends on your query patterns.
Source: Stackoverflow.com