lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Sagar Sumit	3d8fc78c66	[HUDI-3844] Update props in indexer based on table config (#5293 )	2022-04-11 18:16:06 -04:00
Alexey Kudinkin	458fdd5611	[HUDI-3841] Fixing Column Stats in the presence of Schema Evolution (#5275 ) Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI. This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that. - Added appropriate aligning for the transposed CSI projection	2022-04-11 15:45:53 -04:00
Sivabalan Narayanan	52ea1e4964	[MINOR] fixing timeline server for integ tests (#5289 )	2022-04-11 10:14:51 -04:00
RexXiong	5c41e30ac5	[HUDI-3817] shade parquet dependency for hudi-hadoop-mr-bundle (#5250 ) Co-authored-by: lvshuang.xjs <lvshuang.xjs@alibaba-inc.com>	2022-04-11 05:44:46 -07:00
Sivabalan Narayanan	2245a9515f	[HUDI-3798] Fixing ending of a transaction by different owner and removing some extraneous methods in trxn manager (#5255 )	2022-04-11 10:16:07 +05:30
Y Ethan Guo	63a099c5b7	[HUDI-3847] Fix NPE due to null schema in HoodieMetadataTableValidator (#5284 )	2022-04-10 17:59:29 -07:00
Sivabalan Narayanan	12731f5b89	[HUDI-3842] Integ tests for non partitioned datasets (#5276 ) - Adding non-partitioned support to integ tests - Fixing some of the test yamls and properties	2022-04-10 20:09:48 -04:00
Alexey Kudinkin	976840e8eb	[HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs (#5244 ) Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812. - Fixing Data Skipping configuration to respect MT configs (on the Read path) - Tightening up DS handling of cases when no top-level columns are in the target query - Enhancing tests to cover all possible case	2022-04-10 13:43:47 -04:00
Alexey Kudinkin	7a9d48d126	[HUDI-3834] Fixing performance hits in reading Column Stats Index (#5266 ) Fixing performance hits in reading Column Stats Index: [HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s) Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.	2022-04-10 13:42:06 -04:00
董可伦	15c264535f	[MINOR] Fix typos in the comments of HoodieMergeHandle (#5271 )	2022-04-09 17:51:58 -07:00
Y Ethan Guo	3e97c88c4f	[HUDI-3807] Add a new config to control the use of metadata index in HoodieBloomIndex (#5268 )	2022-04-09 15:30:11 -04:00
Raymond Xu	5e65aefc61	[HUDI-3837] Fix license and rat check settings (#5273 ) - add missing licenses - fix CI setting to run rat plugin - fix deploy script to include integ test modules	2022-04-09 11:01:18 -07:00
Alexey Kudinkin	81b25c543a	[HUDI-3825] Fixing Column Stats Index updating sequence (#5267 )	2022-04-08 23:14:08 -07:00
Y Ethan Guo	1cc7542357	[MINOR] Update README of docker build setup (#5256 )	2022-04-08 16:12:25 -07:00
satishm	26eb7b8183	[HUDI-3571] Spark datasource continuous checkpoint should have own fs variable (#5265 )	2022-04-08 07:16:01 -04:00
Alexey Kudinkin	d7cc767dbc	[HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259 ) * Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record * Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME` * Cleaned up `HoodieBackedTableMetadataWriter` * Make sure REPLACE_COMMITS are handled as well	2022-04-08 15:58:31 +05:30
Danny Chan	67215abaf0	[HUDI-3827] Promote the inetAddress picking strategy for NetworkUtils#getHostname (#5260 )	2022-04-08 14:33:56 +08:00
KnightChess	7a6272fba1	[HUDI-3781] fix spark delete sql can not delete record (#5215 )	2022-04-08 14:26:40 +08:00
Sagar Sumit	df87095ef0	[HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252 ) * Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent) to get the partition. * If the list of log file paths in the split is empty, then fallback to usual behaviour.	2022-04-08 09:59:36 +05:30
Y Ethan Guo	672974c412	[HUDI-3823] Fix hudi-hive-sync-bundle to include HBase dependencies and shading (#5257 )	2022-04-07 17:30:33 -07:00
Sivabalan Narayanan	ef06e4a526	[HUDI-3810] Fixing lazy read for metadata log record readers (#5241 )	2022-04-07 18:40:51 -04:00
Y Ethan Guo	cd2c346df6	[HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234 )	2022-04-07 13:03:03 -07:00
Sivabalan Narayanan	b3c834a242	[HUDI-3571] Spark datasource continuous ingestion tool (#5156 )	2022-04-07 14:13:46 -04:00
董可伦	6a8396420c	[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 (#5051 )	2022-04-07 04:21:03 -07:00
Y Ethan Guo	9d744bb35c	[HUDI-3805] Delete existing corrupted requested rollback plan during rollback (#5245 )	2022-04-07 15:32:34 +05:30
xiarixiaoyao	531381faff	[HUDI-3096] fixed the bug that the cow table(contains decimalType) write by flink cannot be read by spark. (#4421 )	2022-04-07 17:21:25 +08:00
Danny Chan	e33149be9a	[HUDI-3808] Flink bulk_insert timestamp(3) can not be read by Spark (#5236 )	2022-04-07 15:17:39 +08:00
Alexey Kudinkin	d43b4cd95e	[HUDI-3739] Fix handling of the `isNotNull` predicate in Data Skipping (#5224 ) - Fix handling of the isNotNull predicate in Data Skipping	2022-04-06 15:17:36 -04:00
Raymond Xu	b2f09a1fee	[HUDI-3340] Fix deploy_staging_jars command (#5243 )	2022-04-06 12:14:23 -07:00
rkkalluri	939b3d1b07	[HUDI-3726] Switching from non-partitioned to partitioned key gen does not throw any exception (#5205 )	2022-04-06 13:35:32 -04:00
Raymond Xu	ca273274b0	[HUDI-3340] Fix deploy_staging_jars for different profiles (#5240 )	2022-04-06 09:42:11 -07:00
Alexey Kudinkin	9e87d164b3	[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208 ) - Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats - Index records pertaining to the columns being queried by, instead of reading out whole Index. - Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. Brief change log - Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS - Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader - Wiring key-prefix lookup t/h LogRecordScanner impls - Cleaning up HoodieHFileReader impl Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-06 12:11:08 -04:00
BruceLin	7612549bcc	[MINOR] Fixing build failure when using flink-1.13 (#5214 )	2022-04-06 16:07:20 +08:00
Sivabalan Narayanan	8683fb1d49	[HUDI-3800] Fixed preserve commit metadata for compaction for untouched records (#5232 )	2022-04-06 13:26:53 +05:30
Raymond Xu	e96f08f355	Moving to 0.12.0-SNAPSHOT on master branch.	2022-04-06 15:24:10 +08:00
Alexey Kudinkin	8baeb816d5	[HUDI-3723] Fixed stack overflows in Record Iterators (#5235 )	2022-04-05 23:12:13 -04:00
Sagar Sumit	898be6174a	[HUDI-3782] Fixing table config when any of the index is disabled (#5222 )	2022-04-05 23:06:52 -04:00
Vinoth Govindarajan	92ca426ab7	[HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220 ) * dbt example models to demonstrate hudi dbt integration * Fixed readme text	2022-04-05 08:58:13 -07:00
Yann Byron	3195f51562	[HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201 )	2022-04-05 01:31:41 -07:00
ForwardXu	325b3d610a	[HUDI-3795] Fix hudi-examples checkstyle and maven enforcer error (#5221 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-05 01:10:11 -07:00
ForwardXu	3449e86989	[HUDI-3780] improve drop partitions (#5178 )	2022-04-05 11:52:33 +08:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Vinoth Govindarajan	8add740d22	[HUDI-3534] [RFC-34] Added the implementation details for the BigQuery integration (#4503 )	2022-04-03 03:53:25 -07:00
Y Ethan Guo	c34eb07598	[MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217 )	2022-04-03 16:12:14 +05:30
Sivabalan Narayanan	84064a9b08	[HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207 )	2022-04-02 23:44:10 -07:00
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
Sagar Sumit	74eb09be9b	[HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213 )	2022-04-02 18:22:57 -04:00
Vinoth Govindarajan	20964df770	[HUDI-3357] MVP implementation of BigQuerySyncTool (#5125 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-02 13:18:06 -07:00
Y Ethan Guo	c19f505b5a	[HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216 )	2022-04-02 13:16:17 -07:00
todd5167	eef3f9c74a	[HUDI-3771] flink supports sync table information to aws glue (#5202 )	2022-04-02 21:16:10 +08:00

1 2 3 4 5 ...

2746 Commits