Y Ethan Guo
4ddd094ba2
[HUDI-3689] Disable flaky tests in TestHoodieDeltaStreamer ( #5127 )
2022-03-24 16:42:44 -07:00
Raymond Xu
ff136658a0
[HUDI-3689] Fix delta streamer tests ( #5124 )
2022-03-24 14:19:53 -07:00
Raymond Xu
686da41696
[HUDI-3689] Fix UT failures in TestHoodieDeltaStreamer ( #5120 )
2022-03-24 09:10:33 -07:00
Alexey Kudinkin
ccc3728002
[HUDI-3684] Fixing NPE in ParquetUtils ( #5102 )
...
* Make sure nulls are properly handled in `HoodieColumnRangeMetadata`
2022-03-24 17:37:38 +05:30
Sagar Sumit
fe2c3989e3
[HUDI-3689] Fix glob path and hive sync in deltastreamer tests ( #5117 )
...
* Remove glob pattern basePath from the deltastreamer tests.
* [HUDI-3689] Fix file scheme config
for CI failure in TestHoodieRealTimeRecordReader
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com >
2022-03-24 15:48:35 +05:30
Rajesh Mahindra
5f570ea151
[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs ( #4175 )
...
- Refactor hive sync tool / config to use reflection and standardize configs
Co-authored-by: sivabalan <n.siva.b@gmail.com >
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local >
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com >
2022-03-21 22:56:31 -04:00
YueZhang
8ca9a54db0
[Hudi-3376] Add an option to skip under deletion files for HoodieMetadataTableValidator ( #4994 )
...
Co-authored-by: yuezhang <yuezhang@freewheel.tv >
2022-03-16 18:31:00 -07:00
wangxianghu
3b59b76952
[HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string ( #4987 )
...
* [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string
* add ut
* Address comment
2022-03-15 15:06:30 +04:00
Sivabalan Narayanan
22c3ce73db
[HUDI-3621] Fixing NullPointerException in DeltaStreamer ( #5039 )
2022-03-14 18:34:17 -04:00
wangxianghu
e8918b6c2c
[HUDI-3569] Introduce ChainedJsonKafkaSourePostProcessor to support setting multi processors at once ( #4969 )
2022-03-11 17:49:30 -05:00
wangxianghu
b00180342e
[HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor ( #5019 )
2022-03-11 15:03:42 +04:00
wangxianghu
83cff3afee
[HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema ( #4972 )
...
* [HUDI-3522] Introduce DropColumnSchemaPostProcessor to support drop columns from schema
* Fix case sensitivity
2022-03-11 09:30:37 +04:00
Sivabalan Narayanan
9dc6df5dca
[HUDI-3595] Fixing NULL schema provider for empty batch ( #5002 )
2022-03-10 22:52:55 -05:00
Alexey Kudinkin
034addaef5
[HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns ( #4818 )
...
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant
This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow
Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
wangxianghu
548000b0d6
[HUDI-3568] Introduce ChainedSchemaPostProcessor to support setting multi processors at once ( #4968 )
2022-03-09 11:16:22 +04:00
Sagar Sumit
575bc63468
[HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param ( #4848 )
...
Rework of #4761
This diff introduces following changes:
- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs.
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.
Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com >
2022-03-08 10:39:04 -05:00
Sivabalan Narayanan
3539578ccb
[HUDI-3213] Making commit preserve metadata to true for compaction ( #4811 )
...
* Making commit preserve metadata to true
* Fixing integ tests
* Fixing preserve commit metadata for metadata table
* fixed bootstrap tests
* temp diff
* Fixing merge handle
* renaming fallback record
* fixing build issue
* Fixing test failures
2022-03-07 18:02:05 +05:30
wangxianghu
c9ffdc493e
[HUDI-3525] Introduce JsonkafkaSourceProcessor to support data preprocess before it is transformed to DataSet ( #4930 )
2022-03-06 15:41:01 -05:00
wangxianghu
4b471772aa
[HUDI-3520] Introduce DeleteSupportSchemaPostProcessor to support adding _hoodie_is_deleted column to schema ( #4921 )
2022-03-06 15:37:09 -05:00
Pratyaksh Sharma
907e60c252
[HUDI-3264]: made schema registry urls configurable with MTDS ( #4779 )
2022-03-02 15:30:41 -05:00
Alexey Kudinkin
85f47b53df
[HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds ( #4866 )
2022-03-01 22:15:26 -08:00
Sivabalan Narayanan
f7088a957c
[HUDI-3497] Adding Datatable validator tool ( #4902 )
2022-02-28 22:46:32 -05:00
Y Ethan Guo
257052a94d
[HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator ( #4878 )
2022-02-28 18:49:30 -08:00
Sivabalan Narayanan
2f99e8458a
[HUDI-3521] Fixing kakfa key and value serializer value type from class to string ( #4919 )
2022-02-27 11:13:13 -05:00
Sivabalan Narayanan
9678c3fbcf
[MINOR] Fixing checkpoint management in S3IncrSource ( #4871 )
2022-02-22 09:15:16 -05:00
wangxianghu
4d1f74ebea
[HUDI-3464] Fix wrong exception thrown from HiveSchemaProvider ( #4865 )
2022-02-22 10:20:20 +04:00
Sivabalan Narayanan
14dbbdf4c7
[HUDI-2189] Adding delete partitions support to DeltaStreamer ( #4787 )
2022-02-22 00:01:30 -05:00
Sivabalan Narayanan
d36fe24c9e
[HUDI-3455] Fixing checkpoint management in hoodie incr source ( #4850 )
2022-02-21 08:19:57 -05:00
Sivabalan Narayanan
66ac1446dd
[MINOR] Moving spark scheduling configs out of DataSourceOptions ( #4843 )
2022-02-20 13:49:18 -05:00
Y Ethan Guo
fba5822ee3
[HUDI-3430] Fix Deltastreamer to properly shut down the services upon failure ( #4824 )
2022-02-18 08:44:56 -05:00
Y Ethan Guo
9a05940a74
[HUDI-3366] Remove hardcoded logic of disabling metadata table in tests ( #4792 )
2022-02-15 16:41:47 -05:00
YueZhang
76e2faa28d
[HUDI-3370] The files recorded in the commit may not match the actual ones for MOR Compaction ( #4753 )
...
* use HoodieCommitMetadata to replace writeStatuses computation
Co-authored-by: yuezhang <yuezhang@freewheel.tv >
2022-02-14 11:12:52 +08:00
冯健
55777fec05
[HUDI-2413] fix Sql source's checkpoint issue ( #3648 )
...
* [HUDI-2413] fix Sql source's checkpoint
* Fixing sql source checkpoint handling
* Fixing docs
Co-authored-by: jian.feng <fengjian428@gmial.com >
Co-authored-by: sivabalan <n.siva.b@gmail.com >
2022-02-14 08:07:48 +05:30
Y Ethan Guo
6aba00e84f
[MINOR] Fix typos in Spark client related classes ( #4781 )
2022-02-13 06:41:58 -08:00
wangxianghu
ce9762d588
[MINOR] unused import ( #4799 )
2022-02-12 13:11:37 +04:00
zhangxiang17
9518f78610
[HUDI-3413]fix jackson parse error when empty message from JsonKafkaSource Using HoodieDeltaStreamer ( #4794 )
2022-02-12 11:37:29 +04:00
Sivabalan Narayanan
60831d6906
[HUDI-3361] Fixing missing begin checkpoint in HoodieIncremental pull ( #4755 )
2022-02-08 12:03:07 -05:00
Sivabalan Narayanan
ab73047958
Adding support for custom scheduler configs with streaming sink ( #4762 )
2022-02-08 14:44:10 +05:30
YueZhang
1636876e8a
[HUDI-3320] Hoodie metadata table validator ( #4721 )
...
Co-authored-by: yuezhang <yuezhang@freewheel.tv >
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com >
2022-02-08 00:29:44 -08:00
Vinish Reddy
8ab6f17149
[HUDI-3373] Add zero value metrics for empty data source and PROMETHEUS_PUSHGATEWAY reporter ( #4760 )
2022-02-07 15:17:46 -05:00
ForwardXu
773b317983
[HUDI-2941] Show _hoodie_operation in spark sql results ( #4649 )
2022-02-07 06:28:13 -08:00
Sivabalan Narayanan
24f738fe68
[HUDI-3360] Adding retries to deltastreamer for source errors ( #4744 )
2022-02-07 08:10:06 -05:00
ehui
538db185ca
[HUDI-2491] Expose HMS mode metastore uri config option for spark writer ( #3962 )
2022-02-07 18:13:51 +05:30
YueZhang
de206acbae
[HUDI-3369] New ScheduleAndExecute mode for HoodieCompactor and hudi-cli ( #4750 )
...
Schedule and execute compaction plan in one single mode.
2022-02-07 15:01:34 +05:30
Y Ethan Guo
b8601a9f58
[HUDI-2656] Generalize HoodieIndex for flexible record data type ( #3893 )
...
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com >
2022-02-03 20:24:04 -08:00
Raymond Xu
caef3d5c58
[HUDI-3330] Remove fixture test tables for multi writer tests ( #4704 )
2022-02-02 17:50:10 +05:30
jsbali
7ce0f4522b
[HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner ( #3946 )
...
Co-authored-by: sivabalan <n.siva.b@gmail.com >
2022-01-31 23:03:18 -05:00
Harsha Teja Kanna
ed7aa138e8
[MINOR] Added log to debug checkpoint resumption when set to 0 ( #4650 )
2022-01-28 23:08:25 -05:00
Sagar Sumit
e78b2f1b55
[HUDI-2943] Complete pending clustering before deltastreamer sync ( #4572 )
2022-01-28 21:28:04 -05:00
Manoj Govindassamy
f87c47352a
[HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field ( #4449 )
...
* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.
- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.
- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.
- Tests have been added to verify the default virtual keys and key
deduplication support for the metadata table records.
Co-authored-by: Vinoth Chandar <vinoth@apache.org >
2022-01-26 13:34:04 -05:00