* `ZCurveOptimizeHelper` > `ZOrderingIndexHelper`;
Moved Z-index helper under `hudi.index.zorder` package
* Tidying up `ZOrderingIndexHelper`
* Fixing compilation
* Fixed index new/original table merging sequence to always prefer values from new index;
Cleaned up `HoodieSparkUtils`
* Added test for `mergeIndexSql`
* Abstracted Z-index name composition w/in `ZOrderingIndexHelper`;
* Fixed `DataSkippingUtils` to interrupt prunning in case data filter contains non-indexed column reference
* Properly handle exceptions origination during pruning in `HoodieFileIndex`
* Make sure no errors are logged upon encountering `AnalysisException`
* Cleaned up Z-index updating sequence;
Tidying up comments, java-docs;
* Fixed Z-index to properly handle changes of the list of clustered columns
* Tidying up
* `lint`
* Suppressing `JavaDocStyle` first sentence check
* Fixed compilation
* Fixing incorrect `DecimalType` conversion
* Refactored test `TestTableLayoutOptimization`
- Added Z-index table composition test (against fixtures)
- Separated out GC test;
Tidying up
* Fixed tests re-shuffling column order for Z-Index table `DataFrame` to align w/ the one by one loaded from JSON
* Scaffolded `DataTypeUtils` to do basic checks of Spark types;
Added proper compatibility checking b/w old/new index-tables
* Added test for Z-index tables merging
* Fixed import being shaded by creating internal `hudi.util` package
* Fixed packaging for `TestOptimizeTable`
* Revised `updateMetadataIndex` seq to provide Z-index updating process w/ source table schema
* Make sure existing Z-index table schema is sync'd to source table's one
* Fixed shaded refs
* Fixed tests
* Fixed type conversion of Parquet provided metadata values into Spark expected schemas
* Fixed `composeIndexSchema` utility to propose proper schema
* Added more tests for Z-index:
- Checking that Z-index table is built correctly
- Checking that Z-index tables are merged correctly (during update)
* Fixing source table
* Fixing tests to read from Parquet w/ proper schema
* Refactored `ParquetUtils` utility reading stats from Parquet footers
* Fixed incorrect handling of Decimals extracted from Parquet footers
* Worked around issues in javac failign to compile stream's collection
* Fixed handling of `Date` type
* Fixed handling of `DateType` to be parsed as `LocalDate`
* Updated fixture;
Make sure test loads Z-index fixture using proper schema
* Removed superfluous scheme adjusting when reading from Parquet, since Spark is actually able to perfectly restore schema (given Parquet was previously written by Spark as well)
* Fixing race-condition in Parquet's `DateStringifier` trying to share `SimpleDataFormat` object which is inherently not thread-safe
* Tidying up
* Make sure schema is used upon reading to validate input files are in the appropriate format;
Tidying up;
* Worked around javac (1.8) inability to infer expression type properly
* Updated fixtures;
Tidying up
* Fixing compilation after rebase
* Assert clustering have in Z-order layout optimization testing
* Tidying up exception messages
* XXX
* Added test validating Z-index lookup filter correctness
* Added more test-cases;
Tidying up
* Added tests for string expressions
* Fixed incorrect Z-index filter lookup translations
* Added more test-cases
* Added proper handling on complex negations of AND/OR expressions by pushing NOT operator down into inner expressions for appropriate handling
* Added `-target:jvm-1.8` for `hudi-spark` module
* Adding more tests
* Added tests for non-indexed columns
* Properly handle non-indexed columns by falling back to a re-write of containing expression as `TrueLiteral` instead
* Fixed tests
* Removing the parquet test files and disabling corresponding tests
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- Add spotless format fixing to project
- One time reformatting for conformity
- Build fails for formatting changes and mvn spotless:apply autofixes them
- Documented principles applied for redesign at packaging/README.md
- No longer depends on incl commons-codec, commons-io, commons-pool, commons-dbcp, commons-lang, commons-logging, avro-mapred
- Introduce new FileIOUtils & added checkstyle rule for illegal import of above
- Parquet, Avro dependencies moved to provided scope to enable being picked up from Hive/Spark/Presto instead
- Pickup jackson jars for Hive sync tool from HIVE_HOME & unbundling jackson everywhere
- Remove hive-jdbc standalone jar from being bundled in Spark/Hive/Utilities bundles
- 6.5x reduced number of classes across bundles
The code-style rules follow google style with some changes:
1. Increase line length from 100 to 120
2. Disable JavaDoc related checkstyles as this needs more manual work.
Both source and test code are checked for code-style