lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexey Kudinkin	7a9d48d126	[HUDI-3834] Fixing performance hits in reading Column Stats Index (#5266 ) Fixing performance hits in reading Column Stats Index: [HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s) Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.	2022-04-10 13:42:06 -04:00
董可伦	15c264535f	[MINOR] Fix typos in the comments of HoodieMergeHandle (#5271 )	2022-04-09 17:51:58 -07:00
Y Ethan Guo	3e97c88c4f	[HUDI-3807] Add a new config to control the use of metadata index in HoodieBloomIndex (#5268 )	2022-04-09 15:30:11 -04:00
Raymond Xu	5e65aefc61	[HUDI-3837] Fix license and rat check settings (#5273 ) - add missing licenses - fix CI setting to run rat plugin - fix deploy script to include integ test modules	2022-04-09 11:01:18 -07:00
Alexey Kudinkin	81b25c543a	[HUDI-3825] Fixing Column Stats Index updating sequence (#5267 )	2022-04-08 23:14:08 -07:00
Y Ethan Guo	1cc7542357	[MINOR] Update README of docker build setup (#5256 )	2022-04-08 16:12:25 -07:00
satishm	26eb7b8183	[HUDI-3571] Spark datasource continuous checkpoint should have own fs variable (#5265 )	2022-04-08 07:16:01 -04:00
Alexey Kudinkin	d7cc767dbc	[HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259 ) * Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record * Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME` * Cleaned up `HoodieBackedTableMetadataWriter` * Make sure REPLACE_COMMITS are handled as well	2022-04-08 15:58:31 +05:30
Danny Chan	67215abaf0	[HUDI-3827] Promote the inetAddress picking strategy for NetworkUtils#getHostname (#5260 )	2022-04-08 14:33:56 +08:00
KnightChess	7a6272fba1	[HUDI-3781] fix spark delete sql can not delete record (#5215 )	2022-04-08 14:26:40 +08:00
Sagar Sumit	df87095ef0	[HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252 ) * Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent) to get the partition. * If the list of log file paths in the split is empty, then fallback to usual behaviour.	2022-04-08 09:59:36 +05:30
Y Ethan Guo	672974c412	[HUDI-3823] Fix hudi-hive-sync-bundle to include HBase dependencies and shading (#5257 )	2022-04-07 17:30:33 -07:00
Sivabalan Narayanan	ef06e4a526	[HUDI-3810] Fixing lazy read for metadata log record readers (#5241 )	2022-04-07 18:40:51 -04:00
Y Ethan Guo	cd2c346df6	[HUDI-3637] Exclude uncommitted log files from metadata table validation (#5234 )	2022-04-07 13:03:03 -07:00
Sivabalan Narayanan	b3c834a242	[HUDI-3571] Spark datasource continuous ingestion tool (#5156 )	2022-04-07 14:13:46 -04:00
董可伦	6a8396420c	[HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 (#5051 )	2022-04-07 04:21:03 -07:00
Y Ethan Guo	9d744bb35c	[HUDI-3805] Delete existing corrupted requested rollback plan during rollback (#5245 )	2022-04-07 15:32:34 +05:30
xiarixiaoyao	531381faff	[HUDI-3096] fixed the bug that the cow table(contains decimalType) write by flink cannot be read by spark. (#4421 )	2022-04-07 17:21:25 +08:00
Danny Chan	e33149be9a	[HUDI-3808] Flink bulk_insert timestamp(3) can not be read by Spark (#5236 )	2022-04-07 15:17:39 +08:00
Alexey Kudinkin	d43b4cd95e	[HUDI-3739] Fix handling of the `isNotNull` predicate in Data Skipping (#5224 ) - Fix handling of the isNotNull predicate in Data Skipping	2022-04-06 15:17:36 -04:00
Raymond Xu	b2f09a1fee	[HUDI-3340] Fix deploy_staging_jars command (#5243 )	2022-04-06 12:14:23 -07:00
rkkalluri	939b3d1b07	[HUDI-3726] Switching from non-partitioned to partitioned key gen does not throw any exception (#5205 )	2022-04-06 13:35:32 -04:00
Raymond Xu	ca273274b0	[HUDI-3340] Fix deploy_staging_jars for different profiles (#5240 )	2022-04-06 09:42:11 -07:00
Alexey Kudinkin	9e87d164b3	[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208 ) - Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats - Index records pertaining to the columns being queried by, instead of reading out whole Index. - Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. Brief change log - Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS - Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader - Wiring key-prefix lookup t/h LogRecordScanner impls - Cleaning up HoodieHFileReader impl Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-06 12:11:08 -04:00
BruceLin	7612549bcc	[MINOR] Fixing build failure when using flink-1.13 (#5214 )	2022-04-06 16:07:20 +08:00
Sivabalan Narayanan	8683fb1d49	[HUDI-3800] Fixed preserve commit metadata for compaction for untouched records (#5232 )	2022-04-06 13:26:53 +05:30
Raymond Xu	e96f08f355	Moving to 0.12.0-SNAPSHOT on master branch.	2022-04-06 15:24:10 +08:00
Alexey Kudinkin	8baeb816d5	[HUDI-3723] Fixed stack overflows in Record Iterators (#5235 )	2022-04-05 23:12:13 -04:00
Sagar Sumit	898be6174a	[HUDI-3782] Fixing table config when any of the index is disabled (#5222 )	2022-04-05 23:06:52 -04:00
Vinoth Govindarajan	92ca426ab7	[HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220 ) * dbt example models to demonstrate hudi dbt integration * Fixed readme text	2022-04-05 08:58:13 -07:00
Yann Byron	3195f51562	[HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201 )	2022-04-05 01:31:41 -07:00
ForwardXu	325b3d610a	[HUDI-3795] Fix hudi-examples checkstyle and maven enforcer error (#5221 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-05 01:10:11 -07:00
ForwardXu	3449e86989	[HUDI-3780] improve drop partitions (#5178 )	2022-04-05 11:52:33 +08:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Vinoth Govindarajan	8add740d22	[HUDI-3534] [RFC-34] Added the implementation details for the BigQuery integration (#4503 )	2022-04-03 03:53:25 -07:00
Y Ethan Guo	c34eb07598	[MINOR] Reuse deleteMetadataTable for disabling metadata table (#5217 )	2022-04-03 16:12:14 +05:30
Sivabalan Narayanan	84064a9b08	[HUDI-3772] Fixing auto adjustment of lock configs for deltastreamer (#5207 )	2022-04-02 23:44:10 -07:00
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
Sagar Sumit	74eb09be9b	[HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213 )	2022-04-02 18:22:57 -04:00
Vinoth Govindarajan	20964df770	[HUDI-3357] MVP implementation of BigQuerySyncTool (#5125 ) Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-02 13:18:06 -07:00
Y Ethan Guo	c19f505b5a	[HUDI-3784] Improve docs and logs of HoodieMetadataTableValidator (#5216 )	2022-04-02 13:16:17 -07:00
todd5167	eef3f9c74a	[HUDI-3771] flink supports sync table information to aws glue (#5202 )	2022-04-02 21:16:10 +08:00
YueZhang	020786a5f9	[HUDI-3451] Delete metadata table when the write client disables MDT (#5186 ) * Add checks for metadata table init to avoid possible out-of-sync * Revise the logic to reuse existing table config * Revise docs and naming Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-04-02 16:31:06 +05:30
Y Ethan Guo	b1e7e1f14e	[HUDI-3708] Fix failure with HoodieMetadataRecord due to schema compatibility check (#5204 )	2022-04-01 20:17:02 -07:00
Y Ethan Guo	fb45fc9cb9	[HUDI-3773] Fix parallelism used for metadata table bloom filter index (#5209 )	2022-04-01 20:14:07 -07:00
xiarixiaoyao	444ff496a4	[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910 ) * [HUDI-2560] introduce id_based schema to support full schema evolution. * add test for FileBasedInternalSchemaStorageManger and rebase code * add support for change column type and fix some test case * fix some bugs encountered in the production env and delete useless code * fix test error * rebase code * fixed some nested schema change bugs * [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark * [use dummyInternalSchema instead of null] * add support for spark3.1.x * remove support for spark3.1.x , sicne some compile fail * support spark3.1.x * rebase and prepare solve all comments * address all comments * rebase code * fixed the count() bug try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost address some comments * fixed all comments * fix new comments * rebase code,fix UT failed * fixed mistake * rebase code ,fixed new comments * rebase code , and prepare for address new comments * address commits * address new comments * fix new issues * control fallback original write logical	2022-04-01 13:20:24 -07:00
Raymond Xu	9275b8fc7e	[HUDI-3468][RFC-49] Support sync with DataHub (#5022 )	2022-04-01 12:27:01 -07:00
Sagar Sumit	dfdd2de99c	[HUDI-3225] [RFC-45] for async metadata indexing (#4640 ) * Add RFC for async metadata indexing Add more details * Add changes since last discussion * Add another race condition handling * Update rfc	2022-04-01 11:49:23 -07:00
Sivabalan Narayanan	7dfb168003	[HUDI-3763] Fixing hadoop conf class loading for inline reading (#5194 )	2022-04-01 11:27:40 -04:00
董可伦	23b31225df	[HUDI-3769] Optimize the logs of HoodieMergeHandle and BufferedConnectWriter (#5200 )	2022-04-01 21:17:49 +08:00

1 2 3 4 5 ...

2738 Commits