1
0

[HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873)

* [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Following improvements are implemented:
1. Memory overhead reduction:
  - Existing code caches FileStatus for each file in memory.
  - Created a new class DirectoryInfo which is used to cache a director's file list with parts of the FileStatus (only filename and file len). This reduces the memory requirements.

2. Improved parallelism:
  - Existing code collects all the listing to the Driver and then creates HoodieRecord on the Driver.
  - This takes a long time for large tables (11million HoodieRecords to be created)
  - Created a new function in SparkRDDWriteClient specifically for bootstrap commit. In it, the HoodieRecord creation is parallelized across executors so it completes fast.

3. Fixed setting to limit the number of parallel listings:
  - Existing code had a bug wherein 1500 executors were hardcoded to perform listing. This leads to exception due to limit in the spark's result memory.
  - Corrected the use of the config.

Result:
Dataset has 1299 partitions and 12Million files.
file listing time=1.5mins
HoodieRecord creation time=13seconds
deltacommit duration=2.6mins

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
This commit is contained in:
Prashant Wason
2021-11-10 19:37:48 -08:00
committed by GitHub
parent 90f9b4562a
commit 77b0440eb4
8 changed files with 188 additions and 76 deletions

View File

@@ -21,6 +21,7 @@ package org.apache.hudi.metadata;
import org.apache.avro.specific.SpecificRecordBase;
import org.apache.hudi.client.HoodieFlinkWriteClient;
import org.apache.hudi.client.WriteStatus;
import org.apache.hudi.common.data.HoodieData;
import org.apache.hudi.common.engine.HoodieEngineContext;
import org.apache.hudi.common.metrics.Registry;
import org.apache.hudi.common.model.FileSlice;
@@ -91,8 +92,9 @@ public class FlinkHoodieBackedTableMetadataWriter extends HoodieBackedTableMetad
}
@Override
protected void commit(List<HoodieRecord> records, String partitionName, String instantTime, boolean canTriggerTableService) {
protected void commit(HoodieData<HoodieRecord> hoodieDataRecords, String partitionName, String instantTime, boolean canTriggerTableService) {
ValidationUtils.checkState(enabled, "Metadata table cannot be committed to as it is not enabled");
List<HoodieRecord> records = (List<HoodieRecord>) hoodieDataRecords.get();
List<HoodieRecord> recordList = prepRecords(records, partitionName, 1);
try (HoodieFlinkWriteClient writeClient = new HoodieFlinkWriteClient(engineContext, metadataWriteConfig)) {