ForwardXu
08fd80c913
[HUDI-3221] Support querying a table as of a savepoint ( #4720 )
2022-03-08 10:02:34 -08:00
Sagar Sumit
575bc63468
[HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param ( #4848 )
...
Rework of #4761
This diff introduces following changes:
- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs.
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.
Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com >
2022-03-08 10:39:04 -05:00
Raymond Xu
ed26c5265c
[HUDI-3584] Skip integ test modules by default ( #4986 )
2022-03-08 06:32:04 -08:00
ForwardXu
25385805aa
[HUDI-3574] Improve maven module configs for different spark profiles ( #4970 )
2022-03-08 01:01:05 -08:00
Danny Chan
fe53bd2dea
[HUDI-2677] Add DFS based message queue for flink writer[part3] ( #4961 )
2022-03-08 15:43:21 +08:00
Bo
b6bdb46f7f
[MINOR][HUDI-3460]Fix HoodieDataSourceITCase
...
close #4959
2022-03-08 12:18:43 +08:00
todd5167
34bc752853
[HUDI-3573] flink cleanFuntion execute clean on initialization ( #4936 )
...
For flink insert overwrite operation, do the cleaning each time before the write.
2022-03-08 11:53:54 +08:00
Sivabalan Narayanan
29040762fa
[HUDI-3576] Configuring timeline refreshes based on latest commit ( #4973 )
2022-03-07 17:01:49 -05:00
YueZhang
53826d69e4
[HUDI-2747] support set --sparkMaster for MDT cli ( #4964 )
...
Co-authored-by: yuezhang <yuezhang@freewheel.tv >
2022-03-07 16:57:03 -05:00
Alexey Kudinkin
a66fd40692
[HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS ( #4739 )
...
- This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size.
- To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.
2022-03-07 15:38:27 -05:00
Alexey Kudinkin
f0bcee3c01
[HUDI-3561] Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize ( #4954 )
...
- Avoid including whole MultipleSparkJobExecutionStrategy object into the closure for Spark to serialize
2022-03-07 13:42:03 -05:00
Sivabalan Narayanan
3539578ccb
[HUDI-3213] Making commit preserve metadata to true for compaction ( #4811 )
...
* Making commit preserve metadata to true
* Fixing integ tests
* Fixing preserve commit metadata for metadata table
* fixed bootstrap tests
* temp diff
* Fixing merge handle
* renaming fallback record
* fixing build issue
* Fixing test failures
2022-03-07 18:02:05 +05:30
苏承祥
6f57bbfac4
[HUDI-3069] Improve HoodieMergedLogRecordScanner avoid putting unnecessary hoodie records ( #4932 )
...
* log scanner optimization
* payload equals switches to `=`
Co-authored-by: 苏承祥 <sucx@tuya.com >
2022-03-07 14:35:55 +08:00
wangxianghu
c9ffdc493e
[HUDI-3525] Introduce JsonkafkaSourceProcessor to support data preprocess before it is transformed to DataSet ( #4930 )
2022-03-06 15:41:01 -05:00
wangxianghu
4b471772aa
[HUDI-3520] Introduce DeleteSupportSchemaPostProcessor to support adding _hoodie_is_deleted column to schema ( #4921 )
2022-03-06 15:37:09 -05:00
Aditya Tiwari
051ad0b033
[HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas ( #4468 )
...
* Fixing Hive getSchema for RT tables
* Addressing feedback
* temp diff
* fixing tests after spark datasource read support for metadata table is merged to master
* Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader
Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com >
Co-authored-by: sivabalan <n.siva.b@gmail.com >
2022-03-06 07:51:35 +05:30
Sivabalan Narayanan
6a46130037
[HUDI-2761] Fixing timeline server for repeated refreshes ( #4812 )
...
* Fixing timeline server for repeated refreshes
2022-03-05 10:04:16 +08:00
Bo Cui
0986d5a01d
[HUDI-3460] Add reader merge memory option for flink ( #4911 )
...
* flink TM memory Optimization
2022-03-04 19:29:29 +08:00
Raymond Xu
b4362fac45
[HUDI-3348] Add UT to verify HoodieRealtimeFileSplit serde ( #4951 )
2022-03-04 11:19:16 +04:00
Yuwei XIAO
f449807630
[MINOR] fix UTC timezone config ( #4950 )
2022-03-04 11:09:39 +04:00
ForwardXu
6faed3d90a
[HUDI-3161][RFC-47] Add Call Produce Command for Spark SQL ( #4607 )
2022-03-03 20:02:46 -08:00
shibei
62f534d002
[HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL ( #4901 )
...
* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL
* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL
* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL
Co-authored-by: shibei <huberylee.li@alibaba-inc.com >
2022-03-04 09:33:16 +08:00
RexAn
be9a264885
[HUDI-3548] Fix if user specify key "hoodie.datasource.clustering.async.enable" directly, async clustering not work ( #4905 )
...
Co-authored-by: Rex An <bonean131@gmail.com >
2022-03-03 19:14:07 -05:00
Danny Chan
a4ba0fff07
[HUDI-3552] Strength the NetworkUtils#getHostname by checking network interfaces first ( #4942 )
...
* In some complex network environment, the current code returns wildcard address 0.0.0.0 which is not desired.
2022-03-03 21:11:08 +08:00
Sivabalan Narayanan
876a891979
[HUDI-3544] Fixing "populate meta fields" update to metadata table ( #4941 )
...
* Fixing populateMeta fields update to metadata table
* Fix checkstyle violations
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com >
2022-03-03 17:02:25 +05:30
Manoj Govindassamy
51ee5005a6
[HUDI-2973] RFC-27: Data skipping index to improve query performance ( #4728 )
...
- Updating the schema used for data skipping index
2022-03-03 15:56:22 +05:30
Pratyaksh Sharma
907e60c252
[HUDI-3264]: made schema registry urls configurable with MTDS ( #4779 )
2022-03-02 15:30:41 -05:00
liujinhui
527bd34b1c
[MINOR] RFC-38 markdown content error ( #4933 )
...
* Minor content error
* Minor content error
2022-03-02 19:40:28 +04:00
Sivabalan Narayanan
f8945eca08
[MINOR] Adding more test props to integ tests ( #4935 )
2022-03-02 08:10:43 -05:00
Danny Chan
1d57bd17c2
[minor] Cosmetic changes following HUDI-3315 ( #4934 )
2022-03-02 17:44:52 +08:00
Gary Li
10d866f083
[HUDI-3315] RFC-35 Part-1 Support bucket index in Flink writer ( #4679 )
...
* Support bucket index in Flink writer
* Use record key as default index key
2022-03-02 15:14:44 +08:00
Alexey Kudinkin
85f47b53df
[HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds ( #4866 )
2022-03-01 22:15:26 -08:00
yuzhaojing
3b2da9f138
[HUDI-2631] In CompactFunction, set up the write schema each time with the latest schema ( #4000 )
...
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com >
2022-03-02 11:18:17 +08:00
stayrascal
3cfb52c413
[MINOR] fix get builtin function issue from Hudi catalog ( #4917 )
2022-03-02 11:16:19 +08:00
Bo Cui
3fdc9332e5
[HUDI-3516] Implement record iterator for HoodieDataBlock ( #4909 )
...
* Use iterator to void eager materialization to be memory friendly
2022-03-02 10:19:36 +08:00
ForwardXu
a81a6326d5
[HUDI-3441] Add support for "marker delete" in hudi-cli ( #4922 )
2022-03-01 16:03:53 +08:00
Sivabalan Narayanan
f7088a957c
[HUDI-3497] Adding Datatable validator tool ( #4902 )
2022-02-28 22:46:32 -05:00
Y Ethan Guo
257052a94d
[HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator ( #4878 )
2022-02-28 18:49:30 -08:00
yuzhaojing
44b8ab6048
[HUDI-3418] Save timeout option for remote RemoteFileSystemView ( #4809 )
...
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com >
2022-02-28 15:16:40 -05:00
wenningd
18dc89cf79
[HUDI-3450] Avoid passing empty string spark master to hudi cli ( #4844 )
...
Co-authored-by: Wenning Ding <wenningd@amazon.com >
2022-02-28 11:37:24 -05:00
Y Ethan Guo
05e395ae5f
[HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x ( #4897 )
2022-02-28 11:14:35 -05:00
stayrascal
8f1e4f5b3e
[HUDI-3528] Fix String convert issue and overwrite putAll method in TypedProperties.java ( #4920 )
2022-02-28 10:45:47 -05:00
Sivabalan Narayanan
4a59876c8b
[HUDI-2917] rollback insert data appended to log file when using Hbase Index ( #4840 )
...
Co-authored-by: guanziyue <guanziyue@gmail.com >
2022-02-28 08:13:17 -05:00
Bo Cui
193215201c
[MINOR] Change MINI_BATCH_SIZE to 2048 ( #4862 )
...
ParquetColumnarRowSplitReader#batchSize is 2048, so Changing MINI_BATCH_SIZE to 2048 will reduce memory cache.
2022-02-28 10:45:28 +08:00
Sivabalan Narayanan
d5444ff7ff
[HUDI-3018] Adding validation to dataframe scheme to ensure reserved field does not have diff data type ( #4852 )
2022-02-27 11:59:23 -05:00
Sivabalan Narayanan
2f99e8458a
[HUDI-3521] Fixing kakfa key and value serializer value type from class to string ( #4919 )
2022-02-27 11:13:13 -05:00
Raymond Xu
c77b2591d0
[HUDI-2439] Remove SparkBoundedInMemoryExecutor ( #4860 )
2022-02-26 08:02:12 -05:00
Sivabalan Narayanan
1379300b5b
[HUDI-3483] Adding insert override nodes to integ test suite and few clean ups ( #4895 )
2022-02-26 08:00:15 -05:00
Sagar Sumit
6a5cfb45b9
[MINOR] Fix table type in input format test ( #4912 )
2022-02-25 13:51:53 -05:00
苏承祥
92cdc5987a
[HUDI-3515] Making rdd unpersist optional at the end of writes ( #4898 )
...
Co-authored-by: 苏承祥 <sucx@tuya.com >
2022-02-25 11:30:10 -05:00