1
0
Commit Graph

2720 Commits

Author SHA1 Message Date
Danny Chan
e33149be9a [HUDI-3808] Flink bulk_insert timestamp(3) can not be read by Spark (#5236) 2022-04-07 15:17:39 +08:00
Alexey Kudinkin
d43b4cd95e [HUDI-3739] Fix handling of the isNotNull predicate in Data Skipping (#5224)
- Fix handling of the isNotNull predicate in Data Skipping
2022-04-06 15:17:36 -04:00
Raymond Xu
b2f09a1fee [HUDI-3340] Fix deploy_staging_jars command (#5243) 2022-04-06 12:14:23 -07:00
rkkalluri
939b3d1b07 [HUDI-3726] Switching from non-partitioned to partitioned key gen does not throw any exception (#5205) 2022-04-06 13:35:32 -04:00
Raymond Xu
ca273274b0 [HUDI-3340] Fix deploy_staging_jars for different profiles (#5240) 2022-04-06 09:42:11 -07:00
Alexey Kudinkin
9e87d164b3 [HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208)
- Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats 
- Index records pertaining to the columns being queried by, instead of reading out whole Index.
- Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. 

Brief change log
- Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
- Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
- Wiring key-prefix lookup t/h LogRecordScanner impls
- Cleaning up HoodieHFileReader impl

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-06 12:11:08 -04:00
BruceLin
7612549bcc [MINOR] Fixing build failure when using flink-1.13 (#5214) 2022-04-06 16:07:20 +08:00
Sivabalan Narayanan
8683fb1d49 [HUDI-3800] Fixed preserve commit metadata for compaction for untouched records (#5232) 2022-04-06 13:26:53 +05:30
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Alexey Kudinkin
8baeb816d5 [HUDI-3723] Fixed stack overflows in Record Iterators (#5235) 2022-04-05 23:12:13 -04:00
Sagar Sumit
898be6174a [HUDI-3782] Fixing table config when any of the index is disabled (#5222) 2022-04-05 23:06:52 -04:00
Vinoth Govindarajan
92ca426ab7 [HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220)
* dbt example models to demonstrate hudi dbt integration

* Fixed readme text
2022-04-05 08:58:13 -07:00
Yann Byron
3195f51562 [HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201) 2022-04-05 01:31:41 -07:00
ForwardXu
325b3d610a [HUDI-3795] Fix hudi-examples checkstyle and maven enforcer error (#5221)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-05 01:10:11 -07:00
ForwardXu
3449e86989 [HUDI-3780] improve drop partitions (#5178) 2022-04-05 11:52:33 +08:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Vinoth Govindarajan
8add740d22 [HUDI-3534] [RFC-34] Added the implementation details for the BigQuery integration (#4503) 2022-04-03 03:53:25 -07:00
Y Ethan Guo
c34eb07598 [MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217) 2022-04-03 16:12:14 +05:30
Sivabalan Narayanan
84064a9b08 [HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207) 2022-04-02 23:44:10 -07:00
Alexey Kudinkin
cc3737be50 [HUDI-3664] Fixing Column Stats Index composition (#5181)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-02 17:15:52 -07:00
Sagar Sumit
74eb09be9b [HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213) 2022-04-02 18:22:57 -04:00
Vinoth Govindarajan
20964df770 [HUDI-3357] MVP implementation of BigQuerySyncTool (#5125)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-02 13:18:06 -07:00
Y Ethan Guo
c19f505b5a [HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216) 2022-04-02 13:16:17 -07:00
todd5167
eef3f9c74a [HUDI-3771] flink supports sync table information to aws glue (#5202) 2022-04-02 21:16:10 +08:00
YueZhang
020786a5f9 [HUDI-3451] Delete metadata table when the write client disables MDT (#5186)
* Add checks for metadata table init to avoid possible out-of-sync

* Revise the logic to reuse existing table config

* Revise docs and naming

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-04-02 16:31:06 +05:30
Y Ethan Guo
b1e7e1f14e [HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check (#5204) 2022-04-01 20:17:02 -07:00
Y Ethan Guo
fb45fc9cb9 [HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209) 2022-04-01 20:14:07 -07:00
xiarixiaoyao
444ff496a4 [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)
* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical
2022-04-01 13:20:24 -07:00
Raymond Xu
9275b8fc7e [HUDI-3468][RFC-49] Support sync with DataHub (#5022) 2022-04-01 12:27:01 -07:00
Sagar Sumit
dfdd2de99c [HUDI-3225] [RFC-45] for async metadata indexing (#4640)
* Add RFC for async metadata indexing

Add more details

* Add changes since last discussion

* Add another race condition handling

* Update rfc
2022-04-01 11:49:23 -07:00
Sivabalan Narayanan
7dfb168003 [HUDI-3763] Fixing hadoop conf class loading for inline reading (#5194) 2022-04-01 11:27:40 -04:00
董可伦
23b31225df [HUDI-3769] Optimize the logs of HoodieMergeHandle and BufferedConnectWriter (#5200) 2022-04-01 21:17:49 +08:00
Danny Chan
6df14f15a3 [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880) 2022-04-01 20:46:51 +08:00
ForwardXu
98b4e9796e [HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957)
* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata

* fix comments

* fix comments

* fix comments
2022-04-01 10:01:41 +08:00
Sagar Sumit
a048e940fd [HUDI-3743] Support DELETE_PARTITION for metadata table (#5169)
In order to drop any metadata partition (index), we can reuse the DELETE_PARTITION operation in metadata table. Subsequent to this, we can support drop index (with table config update) for async metadata indexer.

- Add a new API in HoodieTableMetadataWriter
- Current only supported for Spark metadata writer
2022-03-31 21:29:17 -04:00
Sagar Sumit
28dafa774e [HUDI-2488][HUDI-3175] Implement async metadata indexing (#4693)
- Add a new action called INDEX, whose state transition is described in the RFC.
- Changes in timeline to support the new action.
- Add an index planner in ScheduleIndexActionExecutor.
- Add index plan executor in RunIndexActionExecutor.
- Add 3 APIs in HoodieTableMetadataWriter; a) scheduleIndex: will generate an index plan based on latest completed instant, initialize file groups and add a requested INDEX instant, b) index: executes the index plan and also takes care of writes that happened after indexing was requested, c) dropIndex: will drop index by removing the given metadata partition.
- Add 2 new table configs to serve as the source of truth for inflight and completed indexes.
- Support upgrade/downgrade taking care of the newly added configs.
- Add tool to trigger indexing in HoodieIndexer.
- Handle corner cases related to partial failures.
- Abort gracefully after deleting partition and instant.
- Handle other actions in timeline to consider before catching up
2022-04-01 01:33:12 +05:30
liuhe0702
1da196c1e8 [HUDI-2777] Improve HoodieSparkSqlWriter write performance (#5187) 2022-03-31 12:48:47 -07:00
codejoyan
51a701cef1 [HUDI-3020] Utility to create manifest file (#5153)
Co-authored-by: joyan <joyan.sil@walmart.com>
2022-03-31 07:22:03 -07:00
xiarixiaoyao
7889c7852f [HUDI-3729][SPARK] fixed the per regression by enable vectorizeReader for parquet file (#5168)
* [MINOR][SPARK] fixed the per regression by enable vectorizeReader for parquet file

* address comments

* add perf result
2022-03-31 20:09:26 +08:00
Sivabalan Narayanan
73a21092f8 [HUDI-3732] Fixing rollback validation (#5157)
* Fixing rollback validation

* Adding tests
2022-03-31 17:25:24 +05:30
ForwardXu
80011df995 [HUDI-3135] Make delete partitions lazy to be executed by the cleaner (#4489)
As of now, delete partitions will ensure all file groups are deleted, but the partition as such is not deleted. So, get all partitions might be returning the deleted partitions as well. but no data will be served since all file groups are deleted. With this patch, we are fixing it. We are letting cleaner take care of deleting the partitions when all file groups pertaining to a partitions are deleted.

- Fixed the CleanPlanActionExecutor to return meta info about list of partitions to be deleted. If there are no valid file groups for a partition, clean planner will include the partition to be deleted.
- Fixed HoodieCleanPlan avro schema to include the list of partitions to be deleted
- CleanActionExecutor is fixed to delete partitions if any (as per clean plan)
- Same info is added to HoodieCleanMetadata
- Metadata table when applying clean metadata, will check for partitions to be deleted and will update the "all_partitions" record for the deleted partitions.

Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-31 03:35:39 -04:00
Sivabalan Narayanan
3cdb590e15 [HUDI-3733] Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli (#5158)
* Adding HoodieFailedWritesCleaningPolicy for restore with hudi-cli

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-31 13:00:49 +05:30
Yuwei XIAO
ce45f7f129 [HUDI-3692] MetadataFileSystemView includes compaction in timeline (#5110) 2022-03-31 02:24:59 -04:00
Sivabalan Narayanan
4569734d60 [HUDI-3713] Guarding archival for multi-writer (#5138) 2022-03-31 01:44:31 -04:00
Y Ethan Guo
f6ff95f97c [MINOR][DOCS] Update hudi-utilities-slim-bundle docs (#5184) 2022-03-30 21:48:54 -07:00
YueZhang
2dbb273d26 [HUDI-3721] Delete MDT if necessary when trigger rollback to savepoint (#5173)
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2022-03-30 20:26:37 -07:00
KnightChess
2c4554fada [HUDI-3750] Fix NPE when build HoodieFileIndex (#5134)
Co-authored-by: wulingqi <wulingqi@baijiahulian.com>
2022-03-30 19:19:05 -07:00
Sivabalan Narayanan
d80c80699f [MINOR] Fixing flakiness in TestHoodieSparkMergeOnReadTableRollback.testRollbackWithDeltaAndCompactionCommit (#5183) 2022-03-30 22:07:22 -04:00
Y Ethan Guo
4fb1a590b1 [HUDI-3700] Add hudi-utilities-slim-bundle excluding hudi-spark-datasource modules (#5176) 2022-03-30 18:08:35 -07:00
Y Ethan Guo
9830005e9b [HUDI-3681] Provision additional hudi-spark-bundle with different versions (#5171) 2022-03-30 17:35:56 -07:00