1
0
Commit Graph

898 Commits

Author SHA1 Message Date
Shawy Geng
ad773b3d96 [HUDI-3654] Preparations for hudi metastore. (#5572)
* [HUDI-3654] Preparations for hudi metastore.

Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>
2022-05-17 09:47:10 +08:00
Danny Chan
43e08193ef [HUDI-4098] Metadata table heartbeat for instant has expired, last heartbeat 0 (#5583) 2022-05-16 17:40:08 +08:00
Yuwei XIAO
61030d8e7a [HUDI-3123] consistent hashing index: basic write path (upsert/insert) (#4480)
1. basic write path(insert/upsert) implementation
 2. adapt simple bucket index
2022-05-16 11:07:01 +08:00
xi chaomin
6e16e719cd [HUDI-3980] Suport kerberos hbase index (#5464)
- Add configurations in HoodieHBaseIndexConfig.java to support kerberos hbase connection.

Co-authored-by: xicm <xicm@asiainfo.com>
2022-05-14 07:37:31 -04:00
wqwl611
52e63b39d6 [HUDI-4097] add table info to jobStatus (#5529)
Co-authored-by: wqwl611 <wqwl611@gmail.com>
2022-05-13 21:01:15 -04:00
Alexey Kudinkin
4a8589f222 [HUDI-4038] Avoid calling getDataSize after every record written (#5497)
- getDataSize has non-trivial overhead in the current ParquetWriter impl, requiring traversal of already composed Column Groups in memory. Instead we can sample these calls to getDataSize to amortize its cost.

Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-05-11 08:08:31 -04:00
Sivabalan Narayanan
6285a239a3 [HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462)
- Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen.
- Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord.
- Other minor fixes around using static values instead of looking up hashmap.
2022-05-09 12:40:22 -04:00
guanziyue
abb4893b25 [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor exit gracefully (#4264) 2022-05-05 13:49:34 -07:00
Sagar Sumit
1562bb658f [HUDI-4031] Avoid clustering update handling when no pending replacecommit (#5487) 2022-05-04 10:17:11 -04:00
xicm
f492c52ee4 [HUDI-3862] Fix default configurations of HoodieHBaseIndexConfig (#5308)
Co-authored-by: xicm <xicm@asiainfo.com>
2022-04-29 16:21:52 -07:00
LiChuang
4e928a6fe1 [HUDI-3943] Some description fixes for 0.10.1 docs (#5447) 2022-04-28 15:18:56 -07:00
Danny Chan
e1ccf2e00b [HUDI-3977] Flink hudi table with date type partition path throws HoodieNotSupportedException (#5432) 2022-04-27 13:19:55 +08:00
Yuwei XIAO
f2ba0fead2 [HUDI-3085] Improve bulk insert partitioner abstraction (#4441) 2022-04-25 18:42:17 +08:00
Alexey Kudinkin
c05a4e7b6f [HUDI-3934] Fix Spark32HoodieParquetFileFormat not being compatible w/ Spark 3.2.0 (#5378)
- Due to the fact that Spark 3.2.1 is non-BWC w/ 3.2.0, we have to handle all these incompatibilities in Spark32HoodieParquetFileFormat. This PR is addressing that.

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-21 21:00:38 -04:00
xiarixiaoyao
037f89ee7c [HUDI-3921] Fixed schema evolution cannot work with HUDI-3855 (#5376)
- when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.
2022-04-21 18:27:54 -04:00
Sagar Sumit
de5fa1fe03 [HUDI-3940] Fix retry count increment in lock manager (#5387) 2022-04-21 16:52:05 -04:00
Alexey Kudinkin
4b296f79cc [HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377) 2022-04-21 01:36:19 -07:00
Sivabalan Narayanan
a9506aa545 [HUDI-3938] Fix default value for num retries to acquire lock (#5380) 2022-04-21 01:08:43 -07:00
Alexey Kudinkin
f7544e23ac [HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364)
- Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row
 - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not
 - Fallback to append partition values in cases when the source columns are not persisted in data-file
 - Fixing HoodieBaseRelation incorrectly handling mandatory columns
2022-04-20 19:30:27 +08:00
Sagar Sumit
4f44e6aeb5 [HUDI-3899] Drop index to delete pending index instants from timeline if applicable (#5342)
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-04-18 22:28:46 -04:00
Sagar Sumit
1718bcab84 [HUDI-3707] Fix target schema handling in HoodieSparkUtils while creating RDD (#5347) 2022-04-18 13:34:04 -04:00
董可伦
b8e465fdfc [MINOR] Fix typos in log4j-surefire.properties (#5212) 2022-04-15 13:33:37 -07:00
董可伦
99dd1cb6e6 [HUDI-3835] Add UT for delete in java client (#5270) 2022-04-15 15:03:48 -04:00
Sivabalan Narayanan
57612c5c32 [HUDI-3848] Fixing restore with cleaned up commits (#5288) 2022-04-15 14:47:53 -04:00
Y Ethan Guo
bab691692e [HUDI-3686] Fix inline and async table service check in HoodieWriteConfig (#5307) 2022-04-13 17:33:26 -04:00
Alexey Kudinkin
7b78dff45f [HUDI-3855] Fixing FILENAME_METADATA_FIELD not being correctly updated in HoodieMergeHandle (#5296)
Fixing FILENAME_METADATA_FIELD not being correctly updated in HoodieMergeHandle, in cases when old-record is carried over from existing file as is.

- Revisited HoodieFileWriter API to accept HoodieKey instead of HoodieRecord
- Fixed FILENAME_METADATA_FIELD not being overridden in cases when simply old record is carried over
- Exposing standard JVM's debugger ports in Docker setup
2022-04-12 20:42:15 -04:00
Alexey Kudinkin
101b82a679 [HUDI-3839] Fixing incorrect selection of MT partitions to be updated (#5274)
* Fixing incorrect selection of MT partitions to be updated

* Ensure that metadata partitions table config is inherited correctly

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-12 13:37:52 +05:30
Sivabalan Narayanan
f91e9e63e1 [HUDI-3799] Fixing not deleting empty instants w/o archiving (#5261) 2022-04-11 21:02:43 -07:00
Sagar Sumit
3d8fc78c66 [HUDI-3844] Update props in indexer based on table config (#5293) 2022-04-11 18:16:06 -04:00
Sivabalan Narayanan
2245a9515f [HUDI-3798] Fixing ending of a transaction by different owner and removing some extraneous methods in trxn manager (#5255) 2022-04-11 10:16:07 +05:30
董可伦
15c264535f [MINOR] Fix typos in the comments of HoodieMergeHandle (#5271) 2022-04-09 17:51:58 -07:00
Y Ethan Guo
3e97c88c4f [HUDI-3807] Add a new config to control the use of metadata index in HoodieBloomIndex (#5268) 2022-04-09 15:30:11 -04:00
Alexey Kudinkin
81b25c543a [HUDI-3825] Fixing Column Stats Index updating sequence (#5267) 2022-04-08 23:14:08 -07:00
Alexey Kudinkin
d7cc767dbc [HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259)
* Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record

* Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME`

* Cleaned up `HoodieBackedTableMetadataWriter`

* Make sure REPLACE_COMMITS are handled as well
2022-04-08 15:58:31 +05:30
Y Ethan Guo
9d744bb35c [HUDI-3805] Delete existing corrupted requested rollback plan during rollback (#5245) 2022-04-07 15:32:34 +05:30
Danny Chan
e33149be9a [HUDI-3808] Flink bulk_insert timestamp(3) can not be read by Spark (#5236) 2022-04-07 15:17:39 +08:00
Alexey Kudinkin
9e87d164b3 [HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208)
- Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats 
- Index records pertaining to the columns being queried by, instead of reading out whole Index.
- Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. 

Brief change log
- Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
- Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
- Wiring key-prefix lookup t/h LogRecordScanner impls
- Cleaning up HoodieHFileReader impl

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-06 12:11:08 -04:00
Sivabalan Narayanan
8683fb1d49 [HUDI-3800] Fixed preserve commit metadata for compaction for untouched records (#5232) 2022-04-06 13:26:53 +05:30
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Sagar Sumit
898be6174a [HUDI-3782] Fixing table config when any of the index is disabled (#5222) 2022-04-05 23:06:52 -04:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Y Ethan Guo
c34eb07598 [MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217) 2022-04-03 16:12:14 +05:30
Sivabalan Narayanan
84064a9b08 [HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207) 2022-04-02 23:44:10 -07:00
Alexey Kudinkin
cc3737be50 [HUDI-3664] Fixing Column Stats Index composition (#5181)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-02 17:15:52 -07:00
Sagar Sumit
74eb09be9b [HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213) 2022-04-02 18:22:57 -04:00
YueZhang
020786a5f9 [HUDI-3451] Delete metadata table when the write client disables MDT (#5186)
* Add checks for metadata table init to avoid possible out-of-sync

* Revise the logic to reuse existing table config

* Revise docs and naming

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-04-02 16:31:06 +05:30
Y Ethan Guo
fb45fc9cb9 [HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209) 2022-04-01 20:14:07 -07:00
xiarixiaoyao
444ff496a4 [RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910)
* [HUDI-2560] introduce id_based schema to support full schema evolution.

* add test for FileBasedInternalSchemaStorageManger and rebase code

* add support for change column type and fix some test case

* fix some bugs encountered in the production env and delete useless code

* fix test error

* rebase code

* fixed some nested schema change bugs

* [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark

* [use dummyInternalSchema instead of null]

* add support for spark3.1.x

* remove support for spark3.1.x , sicne some compile fail

* support spark3.1.x

* rebase and prepare solve all comments

* address all comments

* rebase code

* fixed the count(*) bug

* try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost
address some comments

* fixed all comments

* fix new comments

* rebase code,fix UT failed

* fixed mistake

* rebase code ,fixed new comments

* rebase code , and prepare for address new comments

* address commits

* address new comments

* fix new issues

* control fallback original write logical
2022-04-01 13:20:24 -07:00
董可伦
23b31225df [HUDI-3769] Optimize the logs of HoodieMergeHandle and BufferedConnectWriter (#5200) 2022-04-01 21:17:49 +08:00
Danny Chan
6df14f15a3 [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC (#4880) 2022-04-01 20:46:51 +08:00