hudi/hudi-spark-datasource/hudi-spark at 7da97c8096ad4eabb49b42c86013679ed5009451 - hudi

Files

Alexey Kudinkin 4f7ea8c79a [HUDI-4176] Fixing TableSchemaResolver to avoid repeated HoodieCommitMetadata parsing (#5733 )

As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs.
Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time.

- Rebased onto new APIs to avoid excessive Hadoop's Path allocations
- Eliminated hasOperationField completely to avoid repeatitive computations
- Cleaning up duplication in HoodieActiveTimeline
- Added caching for common instances of HoodieCommitMetadata
- Made tableStructSchema lazy;

2022-06-06 13:14:26 -04:00

src

[HUDI-4176] Fixing TableSchemaResolver to avoid repeated HoodieCommitMetadata parsing (#5733 )

2022-06-06 13:14:26 -04:00

pom.xml

Moving to 0.12.0-SNAPSHOT on master branch.

2022-04-06 15:24:10 +08:00

run_hoodie_app.sh

[HUDI-1040] Make Hudi support Spark 3 (#2208 )

2020-12-09 15:52:23 -08:00

run_hoodie_generate_app.sh

[HUDI-1040] Make Hudi support Spark 3 (#2208 )

2020-12-09 15:52:23 -08:00

run_hoodie_streaming_app.sh

[HUDI-1040] Make Hudi support Spark 3 (#2208 )

2020-12-09 15:52:23 -08:00