1
0

[HUDI-1757] Assigns the buckets by record key for Flink writer (#2757)

Currently we assign the buckets by record partition path which could
cause hotspot if the partition field is datetime type. Changes to assign
buckets by grouping the record whth their key first, the assignment is
valid if only there is no conflict(two task write to the same bucket).

This patch also changes the coordinator execution to be asynchronous.
This commit is contained in:
Danny Chan
2021-04-06 19:06:41 +08:00
committed by GitHub
parent 920537cac8
commit 9c369c607d
25 changed files with 638 additions and 400 deletions

View File

@@ -260,11 +260,17 @@ public class HoodieFlinkWriteClient<T extends HoodieRecordPayload> extends
* but cleaning action should trigger after all the write actions within a
* checkpoint finish.
*
* @param instantTime The latest successful commit time
* @param table Table to commit on
* @param metadata Commit Metadata corresponding to committed instant
* @param instantTime Instant Time
* @param extraMetadata Additional Metadata passed by user
*/
public void postCommit(String instantTime) {
@Override
protected void postCommit(HoodieTable<T, List<HoodieRecord<T>>, List<HoodieKey>, List<WriteStatus>> table,
HoodieCommitMetadata metadata,
String instantTime,
Option<Map<String, String>> extraMetadata) {
try {
HoodieTable<?, ?, ?, ?> table = createTable(config, hadoopConf);
// Delete the marker directory for the instant.
new MarkerFiles(createTable(config, hadoopConf), instantTime)
.quietDeleteMarkerDir(context, config.getMarkersDeleteParallelism());