1
0
Commit Graph

605 Commits

Author SHA1 Message Date
Y Ethan Guo
eaa4c4f2e2 [HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-24 19:04:53 -07:00
Alexey Kudinkin
ccc3728002 [HUDI-3684] Fixing NPE in ParquetUtils (#5102)
* Make sure nulls are properly handled in `HoodieColumnRangeMetadata`
2022-03-24 17:37:38 +05:30
Danny Chan
a1c42fcc07 [minor] Checks the data block type for archived timeline (#5106) 2022-03-24 14:10:43 +08:00
Sivabalan Narayanan
52f0498330 Fixing non partitioned all files record in MDT (#5108) 2022-03-23 19:26:39 -07:00
Sagar Sumit
f96ba7abf0 [HUDI-3642] Handle NPE due to empty requested replacecommit metadata (#5090) 2022-03-23 12:13:02 -07:00
Rajesh Mahindra
5f570ea151 [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs (#4175)
- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Rajesh Mahindra <rmahindra@Rajeshs-MacBook-Pro.local>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-21 22:56:31 -04:00
Pratyaksh Sharma
ca0931d332 [HUDI-1436]: Provide an option to trigger clean every nth commit (#4385)
- Provided option to trigger clean every nth commit with default number of commits as 1 so that existing users are not affected.
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-21 20:06:30 -04:00
Y Ethan Guo
15d1c18625 [MINOR] Remove flaky assert in TestInLineFileSystem (#5069) 2022-03-20 18:58:30 -04:00
Alexey Kudinkin
1b6e201160 [HUDI-3663] Fixing Column Stats index to properly handle first Data Table commit (#5070)
* Fixed metadata conversion util to extract schema from `HoodieCommitMetadata`

* Fixed failure to fetch columns to index in empty table

* Abort indexing seq in case there are no columns to index

* Fallback to index at least primary key columns, in case no writer schema could be obtained to index all columns

* Fixed `getRecordFields` incorrectly ignoring default value

* Make sure Hudi metadata fields are also indexed
2022-03-20 10:24:13 +05:30
Alexey Kudinkin
099c2c099a [HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication (#4877)
Refactoring Spark DataSource Relations to avoid code duplication. 

Following Relations were in scope:

- BaseFileOnlyViewRelation
- MergeOnReadSnapshotRelaation
- MergeOnReadIncrementalRelation
2022-03-18 22:32:16 -07:00
Raymond Xu
7446ff95a7 [HUDI-2439] Replace RDD with HoodieData in HoodieSparkTable and commit executors (#4856)
- Adopt HoodieData in Spark action commit executors
- Make Spark independent DeleteHelper, WriteHelper, MergeHelper in hudi-client-common
- Make HoodieTable in WriteClient APIs have raw type to decouple with Client's generic types
2022-03-17 04:17:56 -07:00
Y Ethan Guo
5ba2d9ab2f [HUDI-3494] Consider triggering condition of MOR compaction during archival (#4974) 2022-03-17 01:28:11 -04:00
Sagar Sumit
d514570e90 [HUDI-3633] Allow non-string values to be set in TypedProperties (#5045)
* [HUDI-3633] Allow non-string values to be set in TypedProperties

* Override getProperty to ignore instanceof string check
2022-03-15 22:33:22 +04:00
Alexey Kudinkin
5e8ff8d793 [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948) 2022-03-15 10:38:36 -07:00
l-shen
9bdda2a312 [HUDI-3619] Fix HoodieOperation fromValue using wrong constant value (#5033)
Co-authored-by: root <l-shen@localhost.localdomain>
2022-03-15 16:34:31 +04:00
wangxianghu
3b59b76952 [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string (#4987)
* [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

* add ut

* Address comment
2022-03-15 15:06:30 +04:00
Sagar Sumit
eee96e9af3 [HUDI-3593] Restore TypedProperties and flush checksum in table config (#5013)
Create new TypedProperties while performing clustering

Add OrderedProperties and minor refactoring

Add javadoc and remove getters from OrderedProperties
2022-03-13 07:58:55 +05:30
RexAn
93277b2bcd [HUDI-3592] Fix NPE of DefaultHoodieRecordPayload if Property is empty (#4999)
Co-authored-by: Rex An <bonean131@gmail.com>
2022-03-11 17:45:40 -05:00
苏承祥
faed6996ee [HUDI-3566] Add thread factory in BoundedInMemoryExecutor (#4926)
Co-authored-by: 苏承祥 <sucx@tuya.com>
2022-03-11 18:58:49 +08:00
Sivabalan Narayanan
9dc6df5dca [HUDI-3595] Fixing NULL schema provider for empty batch (#5002) 2022-03-10 22:52:55 -05:00
Alexey Kudinkin
034addaef5 [HUDI-3396] Make sure BaseFileOnlyViewRelation only reads projected columns (#4818)
NOTE: This change is first part of the series to clean up Hudi's Spark DataSource related implementations, making sure there's minimal code duplication among them, implementations are consistent and performant

This PR is making sure that BaseFileOnlyViewRelation only reads projected columns as well as avoiding unnecessary serde from Row to InternalRow

Brief change log
- Introduced HoodieBaseRDD as a base for all custom RDD impls
- Extracted common fields/methods to HoodieBaseRelation
- Cleaned up and streamlined HoodieBaseFileViewOnlyRelation
- Fixed all of the Relations to avoid superfluous Row <> InternalRow conversions
2022-03-09 21:45:25 -05:00
MrSleeping123
8859b48b2a [HUDI-3383] Sync column comments while syncing a hive table (#4960)
Desc: Add a hive sync config(hoodie.datasource.hive_sync.sync_comment). This config defaults to false.
While syncing data source to hudi, add column comments to source avro schema, and the sync_comment is true, syncing column comments to the hive table.
2022-03-10 09:44:39 +08:00
Sagar Sumit
575bc63468 [HUDI-3356][HUDI-3203] HoodieData for metadata index records; BloomFilter construction from index based on the type param (#4848)
Rework of #4761 
This diff introduces following changes:

- Write stats are converted to metadata index records during the commit. Making them use the HoodieData type so that the record generation scales up with needs. 
- Metadata index init support for bloom filter and column stats partitions.
- When building the BloomFilter from the index records, using the type param stored in the payload instead of hardcoded type.
- Delta writes can change column ranges and the column stats index need to be properly updated with new ranges to be consistent with the table dataset. This fix add column stats index update support for the delta writes.

Co-authored-by: Manoj Govindassamy <manoj.govindassamy@gmail.com>
2022-03-08 10:39:04 -05:00
Alexey Kudinkin
a66fd40692 [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS (#4739)
- This change makes sure MT records are updated appropriately on HDFS: previously after Log File append operations MT records were updated w/ just the size of the deltas being appended to the original files, which have been found to be the cause of issues in case of Rollbacks that were instead updating MT with records bearing the full file-size.

- To make sure that we hedge against similar issues going f/w, this PR alleviates this discrepancy and streamlines the flow of MT table always ingesting records bearing full file-sizes.
2022-03-07 15:38:27 -05:00
Sivabalan Narayanan
3539578ccb [HUDI-3213] Making commit preserve metadata to true for compaction (#4811)
* Making commit preserve metadata to true

* Fixing integ tests

* Fixing preserve commit metadata for metadata table

* fixed bootstrap tests

* temp diff

* Fixing merge handle

* renaming fallback record

* fixing build issue

* Fixing test failures
2022-03-07 18:02:05 +05:30
苏承祥
6f57bbfac4 [HUDI-3069] Improve HoodieMergedLogRecordScanner avoid putting unnecessary hoodie records (#4932)
* log scanner optimization
* payload equals switches to `=`

Co-authored-by: 苏承祥 <sucx@tuya.com>
2022-03-07 14:35:55 +08:00
Aditya Tiwari
051ad0b033 [HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468)
* Fixing Hive getSchema for RT tables

* Addressing feedback

* temp diff

* fixing tests after spark datasource read support for metadata table is merged to master

* Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader

Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-06 07:51:35 +05:30
Yuwei XIAO
f449807630 [MINOR] fix UTC timezone config (#4950) 2022-03-04 11:09:39 +04:00
shibei
62f534d002 [HUDI-3445] Support Clustering Command Based on Call Procedure Command for Spark SQL (#4901)
* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

* [HUDI-3445] Clustering Command Based on Call Procedure Command for Spark SQL

Co-authored-by: shibei <huberylee.li@alibaba-inc.com>
2022-03-04 09:33:16 +08:00
Danny Chan
a4ba0fff07 [HUDI-3552] Strength the NetworkUtils#getHostname by checking network interfaces first (#4942)
* In some complex network environment, the current code returns wildcard address 0.0.0.0 which is not desired.
2022-03-03 21:11:08 +08:00
Alexey Kudinkin
85f47b53df [HUDI-3469] Refactor HoodieTestDataGenerator to provide for reproducible Builds (#4866) 2022-03-01 22:15:26 -08:00
Bo Cui
3fdc9332e5 [HUDI-3516] Implement record iterator for HoodieDataBlock (#4909)
*  Use iterator to void eager materialization to be memory friendly
2022-03-02 10:19:36 +08:00
Y Ethan Guo
257052a94d [HUDI-3465] Add validation of column stats and bloom filters in HoodieMetadataTableValidator (#4878) 2022-02-28 18:49:30 -08:00
yuzhaojing
44b8ab6048 [HUDI-3418] Save timeout option for remote RemoteFileSystemView (#4809)
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
2022-02-28 15:16:40 -05:00
Y Ethan Guo
05e395ae5f [HUDI-3341] Fix log file reader for S3 with hadoop-aws 2.7.x (#4897) 2022-02-28 11:14:35 -05:00
stayrascal
8f1e4f5b3e [HUDI-3528] Fix String convert issue and overwrite putAll method in TypedProperties.java (#4920) 2022-02-28 10:45:47 -05:00
Sivabalan Narayanan
d5444ff7ff [HUDI-3018] Adding validation to dataframe scheme to ensure reserved field does not have diff data type (#4852) 2022-02-27 11:59:23 -05:00
Raymond Xu
c77b2591d0 [HUDI-2439] Remove SparkBoundedInMemoryExecutor (#4860) 2022-02-26 08:02:12 -05:00
YueZhang
742810070b [HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile() (#4810) 2022-02-25 16:46:27 +05:30
Alexey Kudinkin
85e8a5c4de [HUDI-1296] Support Metadata Table in Spark Datasource (#4789)
* Bootstrapping initial support for Metadata Table in Spark Datasource

- Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication
- Bootstrapped HoodieBaseRelation
- Updated HoodieMergeOnReadRDD to be able to handle Metadata Table
- Modified MOR relations to be able to read different Base File formats (Parquet, HFile)
2022-02-24 16:23:13 -05:00
Y Ethan Guo
4e8accc179 [HUDI-3486] Fix wrong field order for constructing HoodieMetadataColumnStats (#4875) 2022-02-23 10:27:02 +05:30
yuzhaojing
dabae80423 [HUDI-3420] Remove duplicates type in HoodieClusteringGroup.avsc (#4808)
Co-authored-by: yuzhaojing <yuzhaojing@bytedance.com>
2022-02-23 10:49:47 +08:00
Danny Chan
4affdd0c8f [HUDI-3461] The archived timeline for flink streaming reader should not be reused (#4861)
* Before the patch, the flink streaming reader caches the meta client thus the archived timeline,
  when fetching the instant details from the reused timeline, the exception throws
* Add a method in HoodieTableMetaClient to return a fresh new archived timeline each time
2022-02-22 15:54:29 +08:00
Y Ethan Guo
7e1ea06eb9 [MINOR] Fix typos and improve docs in HoodieMetadataConfig (#4867) 2022-02-21 19:36:20 -08:00
Prashant Wason
0dee8edc97 [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present using a config. (#4212)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-02-21 21:53:03 -05:00
Pratyaksh Sharma
bf16bc122a [HUDI-349]: Added new cleaning policy based on number of hours (#3646) 2022-02-21 09:04:42 -05:00
Sivabalan Narayanan
17cb5cb433 [HUDI-3432] Fixing restore with metadata enabled (#4849)
* Fixing restore with metadata enabled

* Fixing test failures
2022-02-21 18:25:30 +05:30
YueZhang
359fbfde79 [HUDI-2648] Retry FileSystem action instead of failed directly. (#3887)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-02-20 15:31:31 -05:00
Bo Cui
83279971a1 [HUDI-3446] Supports batch reader in BootstrapOperator#loadRecords (#4837)
* [HUDI-3446] Supports batch Reader in BootstrapOperator#loadRecords
2022-02-19 21:21:48 +08:00
Sagar Sumit
ed106f671e [HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties (#4712)
Fix dependency conflict

Fix repairs command

Implement putIfAbsent for DDB lock provider

Add upgrade step and validate while fetching configs

Validate checksum for latest table version only while fetching config

Move generateChecksum to BinaryUtil

Rebase and resolve conflict

Fix table version check
2022-02-18 10:17:06 +05:30