Go to file

Prashant Wason 77b0440eb4 [HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873 )

* [HUDI-2634] Improved the metadata table bootstrap for very large tables.

Following improvements are implemented:
1. Memory overhead reduction:
  - Existing code caches FileStatus for each file in memory.
  - Created a new class DirectoryInfo which is used to cache a director's file list with parts of the FileStatus (only filename and file len). This reduces the memory requirements.

2. Improved parallelism:
  - Existing code collects all the listing to the Driver and then creates HoodieRecord on the Driver.
  - This takes a long time for large tables (11million HoodieRecords to be created)
  - Created a new function in SparkRDDWriteClient specifically for bootstrap commit. In it, the HoodieRecord creation is parallelized across executors so it completes fast.

3. Fixed setting to limit the number of parallel listings:
  - Existing code had a bug wherein 1500 executors were hardcoded to perform listing. This leads to exception due to limit in the spark's result memory.
  - Corrected the use of the config.

Result:
Dataset has 1299 partitions and 12Million files.
file listing time=1.5mins
HoodieRecord creation time=13seconds
deltacommit duration=2.6mins

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>

2021-11-10 22:37:48 -05:00

.github

[MINOR] fix typo,'SPAKR' corrected to 'SPARK' (#3721 )

2021-09-26 21:52:35 +08:00

dependencies

[HUDI-2643] Remove duplicated hbase-common with tests classifier exists in bundles (#3886 )

2021-11-01 20:11:00 +08:00

docker

[HUDI-2530] Adding async compaction support to integ test suite framework (#3750 )

2021-10-08 11:30:48 -04:00

hudi-cli

[HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819 )

2021-11-05 09:31:42 -04:00

hudi-client

[HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873 )

2021-11-10 22:37:48 -05:00

hudi-common

[HUDI-2634] Improved the metadata table bootstrap for very large tables. (#3873 )

2021-11-10 22:37:48 -05:00

hudi-examples

HUDI-1827 : Add ORC support in Bootstrap Op (#3457 )

2021-11-06 12:23:20 -04:00

hudi-flink

[HUDI-2685] Support scheduling online compaction plan when there are no commit data (#3928 )

2021-11-11 10:13:21 +08:00

hudi-hadoop-mr

[HUDI-2086]redo the logical of mor_incremental_view for hive (#3203 )

2021-11-10 15:41:07 +08:00

hudi-integ-test

[HUDI-2614] Remove duplicated hadoop-hdfs with tests classifier exists in bundles (#3864 )

2021-10-26 22:36:10 +08:00

hudi-kafka-connect

[MINOR] Fix README for hudi-kafka-connect (#3858 )

2021-10-26 14:45:52 -07:00

hudi-spark-datasource

[HUDI-2442] Change default values for certin clustering configs (#3875 )

2021-11-10 14:23:24 +05:30

hudi-sync

[HUDI-2676] Hudi should synchronize owner information to hudi _rt/_ro table. (#3911 )

2021-11-03 20:36:01 +08:00

hudi-timeline-service

[HUDI-2600] Remove duplicated hadoop-common with tests classifier exists in bundles (#3847 )

2021-10-25 13:45:28 +08:00

hudi-utilities

[HUDI-2579] Make deltastreamer checkpoint state merging more explicit (#3820 )

2021-11-09 17:37:59 -05:00

packaging

[HUDI-2684] Use DefaultHoodieRecordPayload when precombine field is specified specifically (#3922 )

2021-11-04 16:23:36 +08:00

rfc

[MINOR] Add links to all the existing RFCs in rfc/README.md (#3876 )

2021-10-27 05:25:19 -07:00

scripts

[HUDI-2507] Generate more dependency list file for other bundles (#3773 )

2021-10-21 14:10:01 +08:00

style

[HUDI-2307] When using delete_partition with ds should not rely on the primary key (#3469 )

2021-08-14 02:53:39 -04:00

.asf.yaml

[MINOR] Remove commenting from Github, JIRA bridge (#3620 )

2021-09-07 21:54:58 -07:00

.codecov.yml

[MINOR] Disable codecov (#3314 )

2021-07-20 22:07:22 -07:00

.gitignore

[HUDI-2394] Implement Kafka Sink Protocol for Hudi for Ingesting Immutable Data (#3592 )

2021-09-10 18:20:26 -07:00

azure-pipelines.yml

[HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409 )

2021-09-07 09:44:30 -07:00

doap_HUDI.rdf

[MINOR] Update DOAP with 0.9.0 Release (#3537 )

2021-08-25 16:57:05 -04:00

LICENSE

[HUDI-764] [HUDI-765] ORC reader writer Implementation (#2999 )

2021-06-15 15:21:43 -07:00

NOTICE

[HUDI-764] [HUDI-765] ORC reader writer Implementation (#2999 )

2021-06-15 15:21:43 -07:00

pom.xml

[HUDI-1869] Upgrading Spark3 To 3.1 (#3844 )

2021-11-02 18:25:12 -07:00

README.md

[HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409 )

2021-09-07 09:44:30 -07:00

README.md

Apache Hudi

Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).

https://hudi.apache.org/

Features

Upsert support with fast, pluggable indexing
Atomically publish data with rollback support
Snapshot isolation between writer & queries
Savepoints for data recovery
Manages file sizes, layout using statistics
Async compaction of row & columnar data
Timeline metadata to track lineage
Optimize data lake layout with clustering

Hudi supports three types of queries:

Snapshot Query - Provides snapshot queries on real-time data, using a combination of columnar & row-based storage (e.g Parquet + Avro).
Incremental Query - Provides a change stream with records inserted or updated after a point in time.
Read Optimized Query - Provides excellent snapshot query performance via purely columnar storage (e.g. Parquet).

Learn more about Hudi at https://hudi.apache.org

Building Apache Hudi from source

Prerequisites for building Apache Hudi:

Unix-like system (like Linux, Mac OS X)
Java 8 (Java 9 or 10 may work)
Git
Maven

# Checkout code and build
git clone https://github.com/apache/hudi.git && cd hudi
mvn clean package -DskipTests

# Start command
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

To build the Javadoc for all Java and Scala classes:

# Javadoc generated under target/site/apidocs
mvn clean javadoc:aggregate -Pjavadocs

Build with Scala 2.12

The default Scala version supported is 2.11. To build for Scala 2.12 version, build using scala-2.12 profile

mvn clean package -DskipTests -Dscala-2.12

Build with Spark 3.0.0

The default Spark version supported is 2.4.4. To build for Spark 3.0.0 version, build using spark3 profile

mvn clean package -DskipTests -Dspark3

Build without spark-avro module

The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using spark-shade-unbundle-avro profile

# Checkout code and build
git clone https://github.com/apache/hudi.git && cd hudi
mvn clean package -DskipTests -Pspark-shade-unbundle-avro

# Start command
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
  --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

Running Tests

Unit tests can be run with maven profile unit-tests.

mvn -Punit-tests test

Functional tests, which are tagged with @Tag("functional"), can be run with maven profile functional-tests.

mvn -Pfunctional-tests test

To run tests with spark event logging enabled, define the Spark event log directory. This allows visualizing test DAG and stages using Spark History Server UI.

mvn -Punit-tests test -DSPARK_EVLOG_DIR=/path/for/spark/event/log

Quickstart

Please visit https://hudi.apache.org/docs/quick-start-guide.html to quickly explore Hudi's capabilities using spark-shell.

Languages

Java 81.4%

Scala 16.7%

ANTLR 0.9%

Shell 0.8%

Dockerfile 0.2%