hudi/hudi-client at 77b0440eb4fdbb01d41aa27b4fa2a8d45a20b293 - hudi

Files

Prashant Wason 77b0440eb4 [HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873 )

* [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Following improvements are implemented:
1. Memory overhead reduction:
  - Existing code caches FileStatus for each file in memory.
  - Created a new class DirectoryInfo which is used to cache a director's file list with parts of the FileStatus (only filename and file len). This reduces the memory requirements.

2. Improved parallelism:
  - Existing code collects all the listing to the Driver and then creates HoodieRecord on the Driver.
  - This takes a long time for large tables (11million HoodieRecords to be created)
  - Created a new function in SparkRDDWriteClient specifically for bootstrap commit. In it, the HoodieRecord creation is parallelized across executors so it completes fast.

3. Fixed setting to limit the number of parallel listings:
  - Existing code had a bug wherein 1500 executors were hardcoded to perform listing. This leads to exception due to limit in the spark's result memory.
  - Corrected the use of the config.

Result:
Dataset has 1299 partitions and 12Million files.
file listing time=1.5mins
HoodieRecord creation time=13seconds
deltacommit duration=2.6mins

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>

2021-11-10 22:37:48 -05:00

hudi-client-common

[HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873 )

2021-11-10 22:37:48 -05:00

hudi-flink-client

[HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873 )

2021-11-10 22:37:48 -05:00

hudi-java-client

[HUDI-1877] Support records staying in same fileId after clustering (#3833 )

2021-11-09 20:17:50 -08:00

hudi-spark-client

[HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873 )

2021-11-10 22:37:48 -05:00

pom.xml

Moving to 0.10.0-SNAPSHOT on master branch.

2021-08-14 18:51:09 -07:00