- Generalize the hive-sync module for syncing to multiple metastores
- Added new options for datasource
- Added new command line for delta streamer
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
- [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
- [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
- [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
- [HUDI-421] Bootstrap Write Client with tests
- [HUDI-425] Added HoodieDeltaStreamer support
- [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
- [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
- [HUDI-424] Simplify Record reader implementation
- [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
- [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables
Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
- Support Glue catalog and other metastore implementations
- Remove shading from hudi utilities bundle
- Add maven profile to optionally shade hive in utilities bundle
- Add spotless format fixing to project
- One time reformatting for conformity
- Build fails for formatting changes and mvn spotless:apply autofixes them
1. Remove dnl utils jar from git
2. Add LICENSE Headers in missing files
3. Fix NOTICE and LICENSE in all HUDI packages and in top-level
4. Fix License wording in certain HUDI source files
5. Include non java/scala code in RAT licensing check
6. Use whitelist to include dependencies as part of timeline-server bundling
- Documented principles applied for redesign at packaging/README.md
- No longer depends on incl commons-codec, commons-io, commons-pool, commons-dbcp, commons-lang, commons-logging, avro-mapred
- Introduce new FileIOUtils & added checkstyle rule for illegal import of above
- Parquet, Avro dependencies moved to provided scope to enable being picked up from Hive/Spark/Presto instead
- Pickup jackson jars for Hive sync tool from HIVE_HOME & unbundling jackson everywhere
- Remove hive-jdbc standalone jar from being bundled in Spark/Hive/Utilities bundles
- 6.5x reduced number of classes across bundles
- Redo all classes based on org.parquet only
- remove unuused dependencies like parquet-hadoop, common-configuration2
- timeline-service does not build a fat jar anymore
- Fix utilities and hadoop-mr bundles based on above
- [HUDI-172] Cleanup Maven POM/Classpath
- Fix ordering of dependencies in poms, to enable better resolution
- Idea is to place more specific ones at the top
- And place dependencies which use them below them
- [HUDI-68] : Automate demo steps on docker setup
- Move hive queries from hive cli to beeline
- Standardize on taking query input from text command files
- Deltastreamer ingest, also does hive sync in a single step
- Spark Incremental Query materialized as a derived Hive table using datasource
- Fix flakiness in HDFS spin up and output comparison
- Code cleanup around streamlining and loc reduction
- Also fixed pom to not shade some hive classs in spark, to enable hive sync