1
0
Commit Graph

958 Commits

Author SHA1 Message Date
Rahil C
1a5a9f7f03 [HUDI-4439] Fix Amazon CloudWatch reporter for metadata enabled tables (#6164)
Co-authored-by: Udit Mehrotra <uditme@amazon.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-07-23 21:08:21 -07:00
Danny Chan
a0ffd05b77 [HUDI-4448] Remove the latest commit refresh for timeline server (#6179) 2022-07-23 16:10:53 -07:00
Alexey Kudinkin
2d745057ea [HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations (#5708) 2022-07-23 16:59:16 -05:00
Sagar Sumit
da28e38fe3 [HUDI-4071] Make NONE sort mode as default for bulk insert (#6195) 2022-07-23 14:37:04 -05:00
Alexey Kudinkin
eea4a692c0 [HUDI-4039] Make sure all builtin KeyGenerators properly implement Spark specific APIs (#5523)
This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)
2022-07-22 08:35:07 -07:00
Sivabalan Narayanan
2e0dd29714 [HUDI-4204] Fixing NPE with row writer path and with OCC (#5850) 2022-07-21 15:57:34 -07:00
Y Ethan Guo
50cdb867c7 [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (#6113)
Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter.
2022-07-21 11:38:25 -07:00
Alexey Kudinkin
a33bdd32e3 [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470) 2022-07-21 06:20:47 -07:00
Alexey Kudinkin
de37774e12 [HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (#5428)
Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation 
and the reason to that is SchemaPruning optimization rule (pruning nested schemas) 
that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's 
not applied in cases when any other relation is used.

This change is porting this rule to Hudi relations (MOR, Incremental, etc) 
by the virtue of leveraging HoodieSparkSessionExtensions mechanism 
injecting modified version of the original SchemaPruning rule 
that is adopted to work w/ Hudi's custom relations.

- Added customOptimizerRules to HoodieAnalysis
- Added NestedSchemaPrunning Spark's Optimizer rule
- Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas)
- Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions
- Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource
- Disabled fallback to HadoopFsRelation
2022-07-21 15:06:06 +05:30
Danny Chan
e3675fe9b0 [HUDI-4372] Enable matadata table by default for flink (#6066) 2022-07-20 16:10:19 +08:00
冯健
382d19e85b [HUDI-4065] Add FileBasedLockProvider (#6071) 2022-07-19 07:52:47 +08:00
liujinhui
1959b843b7 [HUDI-4409] Improve LockManager wait logic when catch exception (#6122) 2022-07-18 22:45:52 +08:00
Alexey Kudinkin
4bda6afe0b [HUDI-4249] Fixing in-memory HoodieData implementation to operate lazily (#5855) 2022-07-16 18:26:48 -05:00
Danny Chan
6aec9d754f [HUDI-4408] Reuse old rollover file as base file for flink merge handle (#6120) 2022-07-16 20:46:23 +08:00
Danny Chan
05606708fa [HUDI-4393] Add marker file for target file when flink merge handle rolls over (#6103) 2022-07-14 16:00:08 +08:00
liujinhui
126b88b48d [HUDI-2150] Rename/Restructure configs for better modularity (#6061)
- Move clean related configuration to HoodieCleanConfig
- Move Archival related configuration to HoodieArchivalConfig
- hoodie.compaction.payload.class move this to HoodiePayloadConfig
2022-07-09 20:00:48 +05:30
xiarixiaoyao
b686c07407 [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017)
* [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields.

* fix comments

Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>
2022-07-09 03:08:38 +08:00
Danny Chan
7eeaff9ee0 [HUDI-4357] Support flink 1.15.x (#6050) 2022-07-06 13:42:58 +08:00
xi chaomin
23c9c5c296 [HUDI-3836] Improve the way of fetching metadata partitions from table (#5286)
Co-authored-by: xicm <xicm@asiainfo.com>
2022-07-05 07:50:17 -07:00
Danny Chan
47792a3186 [HUDI-4353] Column stats data skipping for flink (#6026) 2022-07-03 08:29:31 +08:00
bschell
fd7d25ab63 [HUDI-1176] Upgrade hudi to log4j2 (#5366)
* Move to log4j2

cr: https://code.amazon.com/reviews/CR-71010705

* Upgrade unit tests to log4j2

* update exclusion

Co-authored-by: Brandon Scheller <bschelle@amazon.com>
2022-06-28 12:54:23 -07:00
Alexey Kudinkin
ed823f1c6f [HUDI-4320] Make sure HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED could be specified by the writer (#5970)
Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller
2022-06-28 12:27:32 -07:00
Sagar Sumit
8846849a03 [HUDI-4291] Fix flaky TestCleanPlanExecutor#testKeepLatestFileVersions (#5930) 2022-06-27 17:27:16 +05:30
cxzl25
72fa19bcc9 [HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (#5959) 2022-06-27 11:09:30 +08:00
Alexey Kudinkin
c86edfc28e [HUDI-4319] Fixed Parquet's PLAIN_DICTIONARY encoding not being applied when bulk-inserting (#5966)
* Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)
2022-06-24 23:52:28 -04:00
xi chaomin
30ebdc708b [HUDI-3735] TestHoodieSparkMergeOnReadTableRollback is flaky (#5874) 2022-06-24 02:47:36 -04:00
Zhaojing Yu
6456bd3a51 [HUDI-4273] Support inline schedule clustering for Flink stream (#5890)
* [HUDI-4273] Support inline schedule clustering for Flink stream

* delete deprecated clustering plan strategy and add clustering ITTest
2022-06-24 11:28:06 +08:00
Zhaojing Yu
c7e430bb46 Revert master (#5925)
* Revert "udate"

This reverts commit 092e35c1e3.

* Revert "[HUDI-3475] Initialize hudi table management module."

This reverts commit 4640a3bbb8.
2022-06-21 16:58:50 +08:00
喻兆靖
4640a3bbb8 [HUDI-3475] Initialize hudi table management module. 2022-06-21 15:21:30 +08:00
RexAn
17ac5a4573 [HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723) 2022-06-20 23:02:34 +05:30
huberylee
d4f0326b4b [HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (#5894) 2022-06-20 14:29:21 +08:00
Danny Chan
0811bb38fb [HUDI-4255] Make the flink merge and replace handle intermediate file visible (#5866) 2022-06-15 14:23:23 +08:00
Danny Chan
25bbff64cf [minor] Following HUDI-4207, remote the new wrapper #init method (#5865) 2022-06-15 08:48:13 +08:00
HunterXHunter
264b15df87 [HUDI-4207] HoodieFlinkWriteClient.getOrCreateWriteHandle throws an e… (#5788)
Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception
2022-06-13 10:36:06 -04:00
xi chaomin
e89f5627e4 [HUDI-3682] testReaderFilterRowKeys fails in TestHoodieOrcReaderWriter (#5790)
TestReaderFilterRowKeys needs to get the key from RECORD_KEY_METADATA_FIELD, but the writer in current UT does not populate the meta field and the schema does not contains meta fields.

This fix writes data with schema which contains meta fields and calls writeAvroWithMetadata for writing.

Co-authored-by: xicm <xicm@asiainfo.com>
2022-06-13 10:22:12 -04:00
Alexey Kudinkin
35afdb4316 [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737)
There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs.  This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.
2022-06-07 16:30:46 -07:00
Sivabalan Narayanan
f85cd9b16d [HUDI-4200] Fixing sorting of keys fetched from metadata table (#5773)
- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  #5208
2022-06-07 08:19:52 -04:00
Sivabalan Narayanan
21b903fddb [HUDI-4197] Fix Async indexer to support building FILES partition (#5766)
- When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met.
2022-06-06 15:47:11 -04:00
Sivabalan Narayanan
4f6fc726d0 [HUDI-4140] Fixing hive style partitioning and default partition with bulk insert row writer with SimpleKeyGen and virtual keys (#5664)
Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen.  This patch fixes the issue.
2022-06-06 10:21:00 -07:00
Alexey Kudinkin
4f7ea8c79a [HUDI-4176] Fixing TableSchemaResolver to avoid repeated HoodieCommitMetadata parsing (#5733)
As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs.
Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time.

- Rebased onto new APIs to avoid excessive Hadoop's Path allocations
- Eliminated hasOperationField completely to avoid repeatitive computations
- Cleaning up duplication in HoodieActiveTimeline
- Added caching for common instances of HoodieCommitMetadata
- Made tableStructSchema lazy;
2022-06-06 13:14:26 -04:00
marchpure
73b0be3c96 [HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (#5755)
SeekTo top cells avoid NullPointerException
2022-06-06 12:07:26 +08:00
KnightChess
51602a34f7 [HUDI-4179] Cluster with sort cloumns invalid (#5739) 2022-06-02 20:28:21 +08:00
Danny Chan
7f8630cc57 [HUDI-4167] Remove the timeline refresh with initializing hoodie table (#5716)
The timeline refresh on table initialization invokes the fs view #sync, which has two actions now:

1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata
2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view

But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest,
the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally.

In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata
is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view:

1. if the fs view is local, the visibility is based on the client table metadata client's latest commit
2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline

From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not.
That make the client logic more clear and less error-prone.

Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the
remote fs view, the server would encounter conflicts and the client encounters a response error.
2022-06-02 09:48:48 +08:00
Danny Chan
329da34ee0 [HUDI-4163] Catch general exception instead of IOException while fetching rollback plan during rollback (#5703)
If the avro file is corrupted, an InvalidAvroMagicException throws.
2022-05-30 13:08:02 +08:00
苏承祥
7e86884604 [HUDI-4086] Use CustomizedThreadFactory in async compaction and clustering (#5563)
Co-authored-by: 苏承祥 <sucx@tuya.com>
2022-05-28 22:35:47 -07:00
komao
8d2f009048 [HUDI-4124] Add valid check in Spark Datasource configs (#5637)
Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>
2022-05-26 05:21:28 -07:00
RexAn
98c5c6c654 [HUDI-4040] Bulk insert Support CustomColumnsSortPartitioner with Row (#5502)
* Along the lines of RDDCustomColumnsSortPartitioner but for Row
2022-05-26 10:39:04 +05:30
Danny Chan
4e42ed5eae [HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (part2) (#5676) 2022-05-26 11:21:39 +08:00
Sagar Sumit
cf837b4900 [HUDI-3193] Decouple hudi-aws from hudi-client-common (#5666)
Move HoodieMetricsCloudWatchConfig to hudi-client-common
2022-05-25 19:38:56 +05:30
喻兆靖
c20db99a7b [HUDI-2207] Support independent flink hudi clustering function 2022-05-24 20:16:48 +08:00