1
0
Commit Graph

121 Commits

Author SHA1 Message Date
Alexey Kudinkin
5755ff25a4 [HUDI-2814] Addressing issues w/ Z-order Layout Optimization (#4060)
* `ZCurveOptimizeHelper` > `ZOrderingIndexHelper`;
Moved Z-index helper under `hudi.index.zorder` package

* Tidying up `ZOrderingIndexHelper`

* Fixing compilation

* Fixed index new/original table merging sequence to always prefer values from new index;
Cleaned up `HoodieSparkUtils`

* Added test for `mergeIndexSql`

* Abstracted Z-index name composition w/in `ZOrderingIndexHelper`;

* Fixed `DataSkippingUtils` to interrupt prunning in case data filter contains non-indexed column reference

* Properly handle exceptions origination during pruning in `HoodieFileIndex`

* Make sure no errors are logged upon encountering `AnalysisException`

* Cleaned up Z-index updating sequence;
Tidying up comments, java-docs;

* Fixed Z-index to properly handle changes of the list of clustered columns

* Tidying up

* `lint`

* Suppressing `JavaDocStyle` first sentence check

* Fixed compilation

* Fixing incorrect `DecimalType` conversion

* Refactored test `TestTableLayoutOptimization`
  - Added Z-index table composition test (against fixtures)
  - Separated out GC test;
Tidying up

* Fixed tests re-shuffling column order for Z-Index table `DataFrame` to align w/ the one by one loaded from JSON

* Scaffolded `DataTypeUtils` to do basic checks of Spark types;
Added proper compatibility checking b/w old/new index-tables

* Added test for Z-index tables merging

* Fixed import being shaded by creating internal `hudi.util` package

* Fixed packaging for `TestOptimizeTable`

* Revised `updateMetadataIndex` seq to provide Z-index updating process w/ source table schema

* Make sure existing Z-index table schema is sync'd to source table's one

* Fixed shaded refs

* Fixed tests

* Fixed type conversion of Parquet provided metadata values into Spark expected schemas

* Fixed `composeIndexSchema` utility to propose proper schema

* Added more tests for Z-index:
  - Checking that Z-index table is built correctly
  - Checking that Z-index tables are merged correctly (during update)

* Fixing source table

* Fixing tests to read from Parquet w/ proper schema

* Refactored `ParquetUtils` utility reading stats from Parquet footers

* Fixed incorrect handling of Decimals extracted from Parquet footers

* Worked around issues in javac failign to compile stream's collection

* Fixed handling of `Date` type

* Fixed handling of `DateType` to be parsed as `LocalDate`

* Updated fixture;
Make sure test loads Z-index fixture using proper schema

* Removed superfluous scheme adjusting when reading from Parquet, since Spark is actually able to perfectly restore schema (given Parquet was previously written by Spark as well)

* Fixing race-condition in Parquet's `DateStringifier` trying to share `SimpleDataFormat` object which is inherently not thread-safe

* Tidying up

* Make sure schema is used upon reading to validate input files are in the appropriate format;
Tidying up;

* Worked around javac (1.8) inability to infer expression type properly

* Updated fixtures;
Tidying up

* Fixing compilation after rebase

* Assert clustering have in Z-order layout optimization testing

* Tidying up exception messages

* XXX

* Added test validating Z-index lookup filter correctness

* Added more test-cases;
Tidying up

* Added tests for string expressions

* Fixed incorrect Z-index filter lookup translations

* Added more test-cases

* Added proper handling on complex negations of AND/OR expressions by pushing NOT operator down into inner expressions for appropriate handling

* Added `-target:jvm-1.8` for `hudi-spark` module

* Adding more tests

* Added tests for non-indexed columns

* Properly handle non-indexed columns by falling back to a re-write of containing expression as  `TrueLiteral` instead

* Fixed tests

* Removing the parquet test files and disabling corresponding tests

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-11-26 10:02:15 -08:00
Alexey Kudinkin
60b23b9797 [HUDI-2788] Fixing issues w/ Z-order Layout Optimization (#4026)
* Simplyfying, tidying up

* Fixed packaging for `TestOptimizeTable`

* Cleaned up `HoodiFileIndex` file filtering seq;
Removed optimization manually reading Parquet table circumventing Spark

* Refactored `DataSkippingUtils`:
  - Fixed checks to validate all statistics cols are present
  - Fixed some predicates being constructed incorrectly
  - Rewrote comments for easier comprehension, added more notes
  - Tidying up

* Tidying up tests

* `lint`

* Fixing compilation

* `TestOptimizeTable` > `TestTableLayoutOptimization`;
Added assertions to test data skipping paths

* Fixed tests to properly hit data-skipping path

* Fixed pruned files candidates lookup seq to conservatively included all non-indexed files

* Added java-doc

* Fixed compilation
2021-11-24 10:10:28 -08:00
Yann Byron
a234833f0a [HUDI-2759] extract HoodieCatalogTable to coordinate spark catalog table and hoodie table (#3998) 2021-11-24 02:12:38 -08:00
xiarixiaoyao
9de9951348 [HUDI-2778] Optimize statistics collection related codes and add some docs for z-order add fix some bugs (#4013)
* [HUDI-2778] Optimize statistics collection related codes and add more docs for z-order.

* add test code for multi-thread parquet footer read
2021-11-22 21:46:02 -08:00
Sivabalan Narayanan
fc9ca6a07a [HUDI-2559] Converting commit timestamp format to millisecs (#4024)
- Adds support for generating commit timestamps with millisecs granularity. 
- Older commit timestamps (in secs granularity) will be suffixed with 999 and parsed with millisecs format.
2021-11-22 11:44:38 -05:00
wenningd
3dc6262437 [HUDI-2242] Add configuration inference logic for few options (#3359)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-11-19 19:38:38 -08:00
wenningd
24def0b30d [HUDI-2362] Add external config file support (#3416)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2021-11-18 01:59:26 -08:00
董可伦
4d884bdaa9 [MINOR] Fix typo,'Hooide' corrected to 'Hoodie' (#4007) 2021-11-17 16:50:04 +08:00
Yann Byron
0bb6d8ff80 [HUDI-2706] refactor spark-sql to make consistent with DataFrame api (#3936) 2021-11-14 15:44:39 -08:00
Yann Byron
90529aa552 [HUDI-2495] Resolve inconsistent key generation for timestamp types by GenericRecord and Row (#3944) 2021-11-11 11:54:34 +08:00
Sagar Sumit
187bedf795 [HUDI-2442] Change default values for certin clustering configs (#3875) 2021-11-10 14:23:24 +05:30
Shawy Geng
cf2ecd77ba [HUDI-2679] Fix the TestMergeIntoLogOnlyTable typo. (#3918) 2021-11-07 10:19:17 -08:00
manasaks
e0285800fb HUDI-1827 : Add ORC support in Bootstrap Op (#3457)
Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-11-06 12:23:20 -04:00
董可伦
844346c3ab [HUDI-2471] Add support ignoring case in merge into (#3700) 2021-11-05 07:50:16 -07:00
Prashant Wason
b7ee341e14 [HUDI-1794] Moved static COMMIT_FORMATTER to thread local variable as SimpleDateFormat is not thread safe. (#2819) 2021-11-05 09:31:42 -04:00
Yann Byron
6351e5f4d0 [HUDI-2538] persist some configs to hoodie.properties when the first write (#3823) 2021-11-03 10:04:23 +08:00
Yann Byron
1f17467f73 [HUDI-1869] Upgrading Spark3 To 3.1 (#3844)
Co-authored-by: pengzhiwei <pengzhiwei2015@icloud.com>
2021-11-02 18:25:12 -07:00
Carl-Zhou-CN
dee3a14aae [HUDI-2582] Support concurrent key gen for different tables with row writer path (#3817)
Co-authored-by: yao.zhou <yao.zhou@linkflowtech.com>
2021-11-02 18:05:09 -04:00
xiarixiaoyao
d194643b49 [HUDI-2101][RFC-28] support z-order for hudi (#3330)
* [HUDI-2101]support z-order for hudi

* Renaming some configs for consistency/simplicity.

* Minor code cleanups

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-11-02 09:31:57 -07:00
Yann Byron
1e2be85a0f [HUDI-2482] support 'drop partition' sql (#3754) 2021-10-19 22:09:53 +08:00
Sivabalan Narayanan
8a487eafa7 [HUDI-2494] Fixing glob pattern to skip all hoodie meta paths (#3768) 2021-10-12 14:06:40 -04:00
Yann Byron
e91e611afb [HUDI-2456] support 'show partitions' sql (#3693) 2021-10-06 15:46:49 +08:00
Sivabalan Narayanan
5f32162a2f [HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590)
* [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime.

- This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline.
- Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table.
- Due to this, archival of data table also fences itself up until compacted instant in metadata table.
All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways.
- As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. 
- Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition.
Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table.
- Enabling metadata table by default.
- Adding more tests for metadata table

Co-authored-by: Prashant Wason <pwason@uber.com>
2021-10-06 00:17:52 -04:00
Sagar Sumit
bc4966ea73 [HUDI-2484] Fix hive sync mode setting in Deltastreamer (#3712) 2021-09-24 13:05:42 -04:00
董可伦
5a94043f38 [HUDI-2343]Fix the exception for mergeInto when the primaryKey and preCombineField of source table and target table differ in case only (#3517) 2021-09-21 22:11:52 +08:00
liujinhui
9f3c4a2a7f [HUDI-2410] Fix getDefaultBootstrapIndexClass logical error (#3633) 2021-09-13 16:10:17 +08:00
vinoth chandar
ea59a7ff5f [HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409)
Update Azure CI ubuntu from 16.04 to 18.04 due to 16.04 will be removed soon

Fixed some consistently failed tests

* fix TestCOWDataSourceStorage TestMORDataSourceStorage
* reset mocks

Also update readme badge



Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2021-09-07 09:44:30 -07:00
wenningd
69cbcc9516 Merge pull request #3541 from rahil-c/rahil-c/HUDI-2359
[HUDI-2359] Add basic "hoodie_is_deleted" unit tests to TestDataSource classes
2021-08-27 16:28:51 -07:00
Satish M
55a80a817d [HUDI-2264] Refactor HoodieSparkSqlWriterSuite to add setup and teardown (#3544) 2021-08-26 10:01:48 -04:00
Rahil Chertara
694300477f [HUDI-2359] Add basic "hoodie_is_deleted" unit tests to TestDataSource classes 2021-08-25 16:35:35 -07:00
zhangyue19921010
de94787a85 [HUDI-2345] Hoodie columns sort partitioner for bulk insert (#3523)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-08-24 21:45:17 +08:00
董可伦
be8c1e499f Support referencing subquery with column aliases by table alias in merge into (#3380) 2021-08-21 21:53:16 +08:00
pengzhiwei
49829f8822 [HUDI-2339] Create Table If Not Exists Failed After Alter Table (#3510) 2021-08-20 14:21:10 +08:00
Udit Mehrotra
c350d05dd3 Restore 0.8.0 config keys with deprecated annotation (#3506)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-08-19 13:36:40 -07:00
Raymond Xu
4d508ef673 [MINOR] Fix SelectPackages in HoodieSparkFunctionalTestSuite (#3476) 2021-08-15 10:17:00 -07:00
liujinhui
b7da6cb33d [HUDI-2307] When using delete_partition with ds should not rely on the primary key (#3469)
- Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-08-14 02:53:39 -04:00
Sagar Sumit
9689278014 [HUDI-1363] Provide option to drop partition columns (#3465)
- Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2021-08-13 13:01:26 -04:00
董可伦
6602e55cd2 [HUDI-2279]Support column name matching for insert * and update set * in merge into (#3415) 2021-08-13 14:10:07 +08:00
Sagar Sumit
0544d70d8f [MINOR] Deprecate older configs (#3464)
Rename and deprecate props in HoodieWriteConfig

Rename and deprecate older props
2021-08-12 20:31:04 -07:00
Sivabalan Narayanan
c9fa3cffaf [HUDI-1774] Adding support for delete_partitions to spark data source (#3437) 2021-08-11 01:03:01 -04:00
Shawy Geng
a5e496fe23 [HUDI-2292] MOR should not predicate pushdown when reading with payload_combine type (#3443) 2021-08-11 12:17:39 +08:00
swuferhong
5448cdde7e [HUDI-2170] [HUDI-1763] Always choose the latest record for HoodieRecordPayload (#3401) 2021-08-11 10:20:55 +08:00
Sivabalan Narayanan
1196736185 [HUDI-1129] Improving schema evolution support in hudi (#2927)
* Adding support to ingest records with old schema after table's schema is evolved

* Rebasing against latest master

- Trimming test file to be < 800 lines
- Renaming config names

* Addressing feedback

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-08-10 09:15:37 -07:00
zhangyue19921010
73d898322b [MINOR] Fix travis from errors (#3432) 2021-08-10 08:25:49 -07:00
pengzhiwei
41a9986a76 [HUDI-2208] Support Bulk Insert For Spark Sql (#3328) 2021-08-09 00:18:31 -04:00
pengzhiwei
32a50d8ddb [HUDI-2243] Support Time Travel Query For Hoodie Table (#3360) 2021-08-07 19:07:22 -04:00
pengzhiwei
55d2e786db [HUDI-1842] Spark Sql Support For pre-existing Hoodie Table (#3393) 2021-08-07 07:49:26 -04:00
pengzhiwei
3f8ca1a355 [HUDI-2182] Support Compaction Command For Spark Sql (#3277) 2021-08-06 15:12:10 +08:00
Sivabalan Narayanan
1df5ded433 [HUDI-2273] Migrating some long running tests to functional test profile (#3398) 2021-08-04 19:08:50 -04:00
pengzhiwei
5574e092fb [HUDI-2232] [SQL] MERGE INTO fails with table having nested struct (#3379) 2021-08-04 18:20:29 +08:00