1
0
Commit Graph

2794 Commits

Author SHA1 Message Date
Sivabalan Narayanan
52ea1e4964 [MINOR] fixing timeline server for integ tests (#5289) 2022-04-11 10:14:51 -04:00
RexXiong
5c41e30ac5 [HUDI-3817] shade parquet dependency for hudi-hadoop-mr-bundle (#5250)
Co-authored-by: lvshuang.xjs <lvshuang.xjs@alibaba-inc.com>
2022-04-11 05:44:46 -07:00
Sivabalan Narayanan
2245a9515f [HUDI-3798] Fixing ending of a transaction by different owner and removing some extraneous methods in trxn manager (#5255) 2022-04-11 10:16:07 +05:30
Y Ethan Guo
63a099c5b7 [HUDI-3847] Fix NPE due to null schema in HoodieMetadataTableValidator (#5284) 2022-04-10 17:59:29 -07:00
Sivabalan Narayanan
12731f5b89 [HUDI-3842] Integ tests for non partitioned datasets (#5276)
- Adding non-partitioned support to integ tests
- Fixing some of the test yamls and properties
2022-04-10 20:09:48 -04:00
Alexey Kudinkin
976840e8eb [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs (#5244)
Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.

- Fixing Data Skipping configuration to respect MT configs (on the Read path)
- Tightening up DS handling of cases when no top-level columns are in the target query
- Enhancing tests to cover all possible case
2022-04-10 13:43:47 -04:00
Alexey Kudinkin
7a9d48d126 [HUDI-3834] Fixing performance hits in reading Column Stats Index (#5266)
Fixing performance hits in reading Column Stats Index:

[HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s)

Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.
2022-04-10 13:42:06 -04:00
董可伦
15c264535f [MINOR] Fix typos in the comments of HoodieMergeHandle (#5271) 2022-04-09 17:51:58 -07:00
Y Ethan Guo
3e97c88c4f [HUDI-3807] Add a new config to control the use of metadata index in HoodieBloomIndex (#5268) 2022-04-09 15:30:11 -04:00
Raymond Xu
5e65aefc61 [HUDI-3837] Fix license and rat check settings (#5273)
- add missing licenses
- fix CI setting to run rat plugin
- fix deploy script to include integ test modules
2022-04-09 11:01:18 -07:00
Alexey Kudinkin
81b25c543a [HUDI-3825] Fixing Column Stats Index updating sequence (#5267) 2022-04-08 23:14:08 -07:00
Y Ethan Guo
1cc7542357 [MINOR] Update README of docker build setup (#5256) 2022-04-08 16:12:25 -07:00
satishm
26eb7b8183 [HUDI-3571] Spark datasource continuous checkpoint should have own fs variable (#5265) 2022-04-08 07:16:01 -04:00
Alexey Kudinkin
d7cc767dbc [HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259)
* Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record

* Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME`

* Cleaned up `HoodieBackedTableMetadataWriter`

* Make sure REPLACE_COMMITS are handled as well
2022-04-08 15:58:31 +05:30
Danny Chan
67215abaf0 [HUDI-3827] Promote the inetAddress picking strategy for NetworkUtils#getHostname (#5260) 2022-04-08 14:33:56 +08:00
KnightChess
7a6272fba1 [HUDI-3781] fix spark delete sql can not delete record (#5215) 2022-04-08 14:26:40 +08:00
Sagar Sumit
df87095ef0 [HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252)
* Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent) 
to get the partition.

* If the list of log file paths in the split is empty, then fallback to usual behaviour.
2022-04-08 09:59:36 +05:30
Y Ethan Guo
672974c412 [HUDI-3823] Fix hudi-hive-sync-bundle to include HBase dependencies and shading (#5257) 2022-04-07 17:30:33 -07:00
Sivabalan Narayanan
ef06e4a526 [HUDI-3810] Fixing lazy read for metadata log record readers (#5241) 2022-04-07 18:40:51 -04:00
Y Ethan Guo
cd2c346df6 [HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234) 2022-04-07 13:03:03 -07:00
Sivabalan Narayanan
b3c834a242 [HUDI-3571] Spark datasource continuous ingestion tool (#5156) 2022-04-07 14:13:46 -04:00
董可伦
6a8396420c [HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 (#5051) 2022-04-07 04:21:03 -07:00
Y Ethan Guo
9d744bb35c [HUDI-3805] Delete existing corrupted requested rollback plan during rollback (#5245) 2022-04-07 15:32:34 +05:30
xiarixiaoyao
531381faff [HUDI-3096] fixed the bug that the cow table(contains decimalType) write by flink cannot be read by spark. (#4421) 2022-04-07 17:21:25 +08:00
Danny Chan
e33149be9a [HUDI-3808] Flink bulk_insert timestamp(3) can not be read by Spark (#5236) 2022-04-07 15:17:39 +08:00
Alexey Kudinkin
d43b4cd95e [HUDI-3739] Fix handling of the isNotNull predicate in Data Skipping (#5224)
- Fix handling of the isNotNull predicate in Data Skipping
2022-04-06 15:17:36 -04:00
Raymond Xu
b2f09a1fee [HUDI-3340] Fix deploy_staging_jars command (#5243) 2022-04-06 12:14:23 -07:00
rkkalluri
939b3d1b07 [HUDI-3726] Switching from non-partitioned to partitioned key gen does not throw any exception (#5205) 2022-04-06 13:35:32 -04:00
Raymond Xu
ca273274b0 [HUDI-3340] Fix deploy_staging_jars for different profiles (#5240) 2022-04-06 09:42:11 -07:00
Alexey Kudinkin
9e87d164b3 [HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208)
- Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats 
- Index records pertaining to the columns being queried by, instead of reading out whole Index.
- Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. 

Brief change log
- Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
- Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
- Wiring key-prefix lookup t/h LogRecordScanner impls
- Cleaning up HoodieHFileReader impl

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-06 12:11:08 -04:00
BruceLin
7612549bcc [MINOR] Fixing build failure when using flink-1.13 (#5214) 2022-04-06 16:07:20 +08:00
Sivabalan Narayanan
8683fb1d49 [HUDI-3800] Fixed preserve commit metadata for compaction for untouched records (#5232) 2022-04-06 13:26:53 +05:30
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Alexey Kudinkin
8baeb816d5 [HUDI-3723] Fixed stack overflows in Record Iterators (#5235) 2022-04-05 23:12:13 -04:00
Sagar Sumit
898be6174a [HUDI-3782] Fixing table config when any of the index is disabled (#5222) 2022-04-05 23:06:52 -04:00
Vinoth Govindarajan
92ca426ab7 [HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220)
* dbt example models to demonstrate hudi dbt integration

* Fixed readme text
2022-04-05 08:58:13 -07:00
Yann Byron
3195f51562 [HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201) 2022-04-05 01:31:41 -07:00
ForwardXu
325b3d610a [HUDI-3795] Fix hudi-examples checkstyle and maven enforcer error (#5221)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-05 01:10:11 -07:00
ForwardXu
3449e86989 [HUDI-3780] improve drop partitions (#5178) 2022-04-05 11:52:33 +08:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Vinoth Govindarajan
8add740d22 [HUDI-3534] [RFC-34] Added the implementation details for the BigQuery integration (#4503) 2022-04-03 03:53:25 -07:00
Y Ethan Guo
c34eb07598 [MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217) 2022-04-03 16:12:14 +05:30
Sivabalan Narayanan
84064a9b08 [HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207) 2022-04-02 23:44:10 -07:00
Alexey Kudinkin
cc3737be50 [HUDI-3664] Fixing Column Stats Index composition (#5181)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-02 17:15:52 -07:00
Sagar Sumit
74eb09be9b [HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213) 2022-04-02 18:22:57 -04:00
Vinoth Govindarajan
20964df770 [HUDI-3357] MVP implementation of BigQuerySyncTool (#5125)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-02 13:18:06 -07:00
Y Ethan Guo
c19f505b5a [HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216) 2022-04-02 13:16:17 -07:00
todd5167
eef3f9c74a [HUDI-3771] flink supports sync table information to aws glue (#5202) 2022-04-02 21:16:10 +08:00
YueZhang
020786a5f9 [HUDI-3451] Delete metadata table when the write client disables MDT (#5186)
* Add checks for metadata table init to avoid possible out-of-sync

* Revise the logic to reuse existing table config

* Revise docs and naming

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
2022-04-02 16:31:06 +05:30
Y Ethan Guo
b1e7e1f14e [HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check (#5204) 2022-04-01 20:17:02 -07:00