9/18/2023 0 Comments Snappy compression format![]() ![]() We picked the top 10 Hive tables in size, converted them to ZSTD, and observed the following size reductions: From GZIP to ZSTD (default compression level 3, hereinafter for all other conversions), we observed ~7% drops in storage size, while from SNAPPY to ZSTD, we see ~39% reduction. We did a benchmark test for the storage size reduction and commute (vCore) gain/loss for reading the data after compression with GZIP/SNAPPY. ![]() GZIP and SNAPPY compression were mainly used in our data lake before this initiative. Generally when the compression level increases, so does the compression ratio, at the cost of sacrificing compression speed. It provides compression levels from 1 to 22, so that users can choose the level as a trade-off between compression ratio and compression speed. ZSTD can achieve a higher compression ratio while maintaining a reasonable compression and decompression speed (more information can be found here ). So the initiatives discussed below are around Parquet’s optimization or adding Parquet tools to achieve the goal of storage cost-efficiency.Ĭost Efficiency Initiatives at the File Format (Parquet) Bring ZSTD (Zstandard) Compression to Uber Data Lake ZSTD Introduction In short, Parquet is the major file format in Uber. Hive is the only engine that drives for producing data in ORC format, but a significant portion of Hive generates tables that are also in Parquet format. Parquet is also the first-class citizen in Spark as the storage format, and Presto is also in favor of Parquet which has tons of optimizations. Our data platform leverages Apache Hive, Apache Presto, and Apache Spark for both interactive and long-running queries. Uber’s data lake ingestion platform uses Apache Hudi, which was bootstrapped at Uber, as the table format and Parquet is the first-class citizenship of Hudi for the file format. Generally, choosing the right compression method is a trade-off between compression ratio and speed for reading and writing. There are several compression methods in Parquet, including SNAPPY, GZIP, LZO, BROTLI, LZ4, and ZSTD. After encoding the compression will make the data size smaller without losing information. Most encodings are in favor of continuous zeros because that can improve encoding efficiency. Parquet provides a variety of encoding mechanisms, including plain, dictionary encoding, run-length encoding (RLE), delta encoding, and byte split encoding. When a Parquet file is written, the column data is assembled into pages, which are the unit for encoding and compression. Figure 1: Apache Parquet File Format Structure Each column chunk is divided into pages, for which the encoding and the compression are performed. The file is divided into row groups and each row group consists of column chunks for each column. The Parquet format is depicted in the diagram below. This will make the subsequent encoding and compression of the column data more efficient. In a column storage format, the data is stored such that each row of a column will be next to other rows from that same column. Apache Parquet is a columnar storage file format that supports nested data for analytic frameworks. Our initiatives and the discussion in this blog are around Parquet. Uber data is ingested into HDFS and registered as either raw or modeled tables, mainly in the Parquet format and with a small portion in the ORC file format. In this blog, we will focus on reducing the data size in storage at the file format level, essentially at Parquet. We started several initiatives to reduce storage cost, including setting TTL (Time to Live) to old partitions, moving data from hot/warm to cold storage, and reducing data size in the file format level. The main goal of this blog is to address storage cost efficiency issues, but the side benefits also include CPU, IO, and network consumption usage. As data volume grows, so do the associated storage and compute costs, resulting in growing hardware purchasing requirements, higher resource usage, and even causing out-of-memory ( OOM ) or high GC pause. Uber’s growth over the last few years exponentially increased both the volume of data and the associated access loads required to process it. Our data platform leverages Apache Hive ™, Apache Presto ™, and Apache Spark ™ for both interactive and long-running queries, serving the myriad needs of different teams at Uber. We use Apache Hudi ™ as our ingestion table format and Apache Parquet ™ as the underlying file format. ![]() Our Apache Hadoop ® based data platform ingests hundreds of petabytes of analytical data with minimum latency and stores it in a data lake built on top of the Hadoop Distributed File System (HDFS). ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |