lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
v-zhangjc9	d9581682a2	Add option to control use hsync or not	2024-05-24 15:17:37 +08:00
v-zhangjc9	215a794fd3	Add victoria metrics reporter	2024-05-24 15:17:37 +08:00
v-zhangjc9	32f7e323dc	Change version to private	2024-05-24 15:16:38 +08:00
Alexey Kudinkin	cfd0c1ee34	[HUDI-4081][HUDI-4472] Addressing Spark SQL vs Spark DS performance gap (#6213 )	2022-07-28 15:36:03 -07:00
Shawn Chang	70b5cf6dab	[MINOR] Minor changes around Spark 3.3 support (#6231 ) Co-authored-by: Shawn Chang <yxchang@amazon.com>	2022-07-28 09:32:34 -07:00
Danny Chan	07eedd3ef6	[HUDI-4484] Add default lock config options for flink metadata table (#6222 )	2022-07-28 20:57:13 +08:00
Shawn Chang	cdaec5a8da	[HUDI-4186] Support Hudi with Spark 3.3.0 (#5943 ) Co-authored-by: Shawn Chang <yxchang@amazon.com>	2022-07-27 14:47:49 -07:00
Y Ethan Guo	924c30c7ea	[HUDI-4469] Flip reuse flag to true in HoodieBackedTableMetadata to improve file listing (#6214 )	2022-07-27 14:04:59 -07:00
冯健	e5faf2cc84	[HUDI-4210] Create custom hbase index to solve data skew issue on hbase regions (#5797 )	2022-07-26 18:09:17 +08:00
Alexey Kudinkin	e7c8df7e8b	[HUDI-4250][HUDI-4202] Optimize performance of Column Stats Index reading in Data Skipping (#5746 ) We provide an alternative way of fetching Column Stats Index within the reading process to avoid the penalty of a more heavy-weight execution scheduled through a Spark engine.	2022-07-25 15:36:12 -07:00
Sagar Sumit	6e7ac45735	[HUDI-3884] Support archival beyond savepoint commits (#5837 ) Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-07-25 13:42:29 -05:00
Shiyan Xu	eee6a02f77	[HUDI-4456] Clean up test resources (#6203 )	2022-07-25 10:13:06 -05:00
Tim Brown	76a28daeb0	[HUDI-4456] Close FileSystem in SparkClientFunctionalTestHarness (#6201 )	2022-07-24 21:42:15 -07:00
Vander	2a08a65f71	[MINOR] Fix typos in Spark client related classes (#6204 )	2022-07-24 21:41:42 -07:00
Rahil C	1a5a9f7f03	[HUDI-4439] Fix Amazon CloudWatch reporter for metadata enabled tables (#6164 ) Co-authored-by: Udit Mehrotra <uditme@amazon.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-07-23 21:08:21 -07:00
Danny Chan	a0ffd05b77	[HUDI-4448] Remove the latest commit refresh for timeline server (#6179 )	2022-07-23 16:10:53 -07:00
Alexey Kudinkin	2d745057ea	[HUDI-4420] Fixing table schema delineation on partition/data schema for Spark relations (#5708 )	2022-07-23 16:59:16 -05:00
Sagar Sumit	da28e38fe3	[HUDI-4071] Make NONE sort mode as default for bulk insert (#6195 )	2022-07-23 14:37:04 -05:00
Alexey Kudinkin	eea4a692c0	[HUDI-4039] Make sure all builtin `KeyGenerator`s properly implement Spark specific APIs (#5523 ) This set of changes makes sure that all builtin KeyGenerators properly implement Spark-specific APIs in a performant way (minimizing key-generators overhead)	2022-07-22 08:35:07 -07:00
Sivabalan Narayanan	2e0dd29714	[HUDI-4204] Fixing NPE with row writer path and with OCC (#5850 )	2022-07-21 15:57:34 -07:00
Y Ethan Guo	50cdb867c7	[HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table (#6113 ) Fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation, because of wrong file names when generating the metadata payload for the bloom filter.	2022-07-21 11:38:25 -07:00
Alexey Kudinkin	a33bdd32e3	[HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470 )	2022-07-21 06:20:47 -07:00
Alexey Kudinkin	de37774e12	[HUDI-3896] Porting Nested Schema Pruning optimization for Hudi's custom Relations (#5428 ) Currently, all Hudi Relations bear performance gap relative to Spark's HadoopFsRelation and the reason to that is SchemaPruning optimization rule (pruning nested schemas) that is unfortunately predicated on usage of HadoopFsRelation, meaning that it's not applied in cases when any other relation is used. This change is porting this rule to Hudi relations (MOR, Incremental, etc) by the virtue of leveraging HoodieSparkSessionExtensions mechanism injecting modified version of the original SchemaPruning rule that is adopted to work w/ Hudi's custom relations. - Added customOptimizerRules to HoodieAnalysis - Added NestedSchemaPrunning Spark's Optimizer rule - Handle Spark's Optimizer pruned data schema (to effectively prune nested schemas) - Enable HoodieClientTestHarness to inject HoodieSparkSessionExtensions - Injecting Spark Session extensions for TestMORDataSource, TestCOWDataSource - Disabled fallback to HadoopFsRelation	2022-07-21 15:06:06 +05:30
Danny Chan	e3675fe9b0	[HUDI-4372] Enable matadata table by default for flink (#6066 )	2022-07-20 16:10:19 +08:00
冯健	382d19e85b	[HUDI-4065] Add FileBasedLockProvider (#6071 )	2022-07-19 07:52:47 +08:00
liujinhui	1959b843b7	[HUDI-4409] Improve LockManager wait logic when catch exception (#6122 )	2022-07-18 22:45:52 +08:00
Alexey Kudinkin	4bda6afe0b	[HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily (#5855 )	2022-07-16 18:26:48 -05:00
Danny Chan	6aec9d754f	[HUDI-4408] Reuse old rollover file as base file for flink merge handle (#6120 )	2022-07-16 20:46:23 +08:00
Danny Chan	05606708fa	[HUDI-4393] Add marker file for target file when flink merge handle rolls over (#6103 )	2022-07-14 16:00:08 +08:00
liujinhui	126b88b48d	[HUDI-2150] Rename/Restructure configs for better modularity (#6061 ) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig	2022-07-09 20:00:48 +05:30
xiarixiaoyao	b686c07407	[HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017 ) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>	2022-07-09 03:08:38 +08:00
Danny Chan	7eeaff9ee0	[HUDI-4357] Support flink 1.15.x (#6050 )	2022-07-06 13:42:58 +08:00
xi chaomin	23c9c5c296	[HUDI-3836] Improve the way of fetching metadata partitions from table (#5286 ) Co-authored-by: xicm <xicm@asiainfo.com>	2022-07-05 07:50:17 -07:00
Danny Chan	47792a3186	[HUDI-4353] Column stats data skipping for flink (#6026 )	2022-07-03 08:29:31 +08:00
bschell	fd7d25ab63	[HUDI-1176] Upgrade hudi to log4j2 (#5366 ) * Move to log4j2 cr: https://code.amazon.com/reviews/CR-71010705 * Upgrade unit tests to log4j2 * update exclusion Co-authored-by: Brandon Scheller <bschelle@amazon.com>	2022-06-28 12:54:23 -07:00
Alexey Kudinkin	ed823f1c6f	[HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (#5970 ) Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller	2022-06-28 12:27:32 -07:00
Sagar Sumit	8846849a03	[HUDI-4291] Fix flaky TestCleanPlanExecutor#testKeepLatestFileVersions (#5930 )	2022-06-27 17:27:16 +05:30
cxzl25	72fa19bcc9	[HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (#5959 )	2022-06-27 11:09:30 +08:00
Alexey Kudinkin	c86edfc28e	[HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (#5966 ) * Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)	2022-06-24 23:52:28 -04:00
xi chaomin	30ebdc708b	[HUDI-3735] TestHoodieSparkMergeOnReadTableRollback is flaky (#5874 )	2022-06-24 02:47:36 -04:00
Zhaojing Yu	6456bd3a51	[HUDI-4273] Support inline schedule clustering for Flink stream (#5890 ) * [HUDI-4273] Support inline schedule clustering for Flink stream * delete deprecated clustering plan strategy and add clustering ITTest	2022-06-24 11:28:06 +08:00
Zhaojing Yu	c7e430bb46	Revert master (#5925 ) * Revert "udate" This reverts commit `092e35c1e3`. * Revert "[HUDI-3475] Initialize hudi table management module." This reverts commit `4640a3bbb8`.	2022-06-21 16:58:50 +08:00
喻兆靖	4640a3bbb8	[HUDI-3475] Initialize hudi table management module.	2022-06-21 15:21:30 +08:00
RexAn	17ac5a4573	[HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723 )	2022-06-20 23:02:34 +05:30
huberylee	d4f0326b4b	[HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (#5894 )	2022-06-20 14:29:21 +08:00
Danny Chan	0811bb38fb	[HUDI-4255] Make the flink merge and replace handle intermediate file visible (#5866 )	2022-06-15 14:23:23 +08:00
Danny Chan	25bbff64cf	[minor] Following HUDI-4207, remote the new wrapper #init method (#5865 )	2022-06-15 08:48:13 +08:00
HunterXHunter	264b15df87	[HUDI-4207] HoodieFlinkWriteClient.getOrCreateWriteHandle throws an e… (#5788 ) Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception	2022-06-13 10:36:06 -04:00
xi chaomin	e89f5627e4	[HUDI-3682] testReaderFilterRowKeys fails in TestHoodieOrcReaderWriter (#5790 ) TestReaderFilterRowKeys needs to get the key from RECORD_KEY_METADATA_FIELD, but the writer in current UT does not populate the meta field and the schema does not contains meta fields. This fix writes data with schema which contains meta fields and calls writeAvroWithMetadata for writing. Co-authored-by: xicm <xicm@asiainfo.com>	2022-06-13 10:22:12 -04:00
Alexey Kudinkin	35afdb4316	[HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737 ) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.	2022-06-07 16:30:46 -07:00

1 2 3 4 5 ...

972 Commits