We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.
Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data.
Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites.
We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.
The project contains multiple sub-modules, which implement the core components of reading and writing a nested, column-oriented data stream, map this core onto the parquet format, and provide Hadoop Input/Output Formats, Pig loaders, and other java-based utilities for interacting with Parquet.
The The current stable version should always be available from Maven Central. Thrift can be also code-genned into any other thrift-supported language. In the above example, there are N columns in this table, split into M row groups.
The file metadata contains the locations of all the column metadata start locations.
More details on what is contained in the metadata can be found in the thrift files.
Metadata is written after the data to allow for single pass writing.
Readers are expected to first read the file metadata to find all the column chunks they are interested in.
The columns chunks should then be read sequentially.
There are three types of metadata: file metadata, column (chunk) metadata and page header metadata.