lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexey Kudinkin	4f7ea8c79a	[HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (#5733 ) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy;	2022-06-06 13:14:26 -04:00
marchpure	73b0be3c96	[HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw NullPointerException (#5755 ) SeekTo top cells avoid NullPointerException	2022-06-06 12:07:26 +08:00
KnightChess	51602a34f7	[HUDI-4179] Cluster with sort cloumns invalid (#5739 )	2022-06-02 20:28:21 +08:00
Danny Chan	7f8630cc57	[HUDI-4167] Remove the timeline refresh with initializing hoodie table (#5716 ) The timeline refresh on table initialization invokes the fs view #sync, which has two actions now: 1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata 2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest, the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally. In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view: 1. if the fs view is local, the visibility is based on the client table metadata client's latest commit 2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not. That make the client logic more clear and less error-prone. Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the remote fs view, the server would encounter conflicts and the client encounters a response error.	2022-06-02 09:48:48 +08:00
Danny Chan	329da34ee0	[HUDI-4163] Catch general exception instead of IOException while fetching rollback plan during rollback (#5703 ) If the avro file is corrupted, an InvalidAvroMagicException throws.	2022-05-30 13:08:02 +08:00
苏承祥	7e86884604	[HUDI-4086] Use CustomizedThreadFactory in async compaction and clustering (#5563 ) Co-authored-by: 苏承祥 <sucx@tuya.com>	2022-05-28 22:35:47 -07:00
komao	8d2f009048	[HUDI-4124] Add valid check in Spark Datasource configs (#5637 ) Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>	2022-05-26 05:21:28 -07:00
RexAn	98c5c6c654	[HUDI-4040] Bulk insert Support CustomColumnsSortPartitioner with Row (#5502 ) * Along the lines of RDDCustomColumnsSortPartitioner but for Row	2022-05-26 10:39:04 +05:30
Danny Chan	4e42ed5eae	[HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (part2) (#5676 )	2022-05-26 11:21:39 +08:00
Sagar Sumit	cf837b4900	[HUDI-3193] Decouple hudi-aws from hudi-client-common (#5666 ) Move HoodieMetricsCloudWatchConfig to hudi-client-common	2022-05-25 19:38:56 +05:30
喻兆靖	c20db99a7b	[HUDI-2207] Support independent flink hudi clustering function	2022-05-24 20:16:48 +08:00
Danny Chan	eb219010d2	[HUDI-4145] Archives the metadata file in HoodieInstant.State sequence (#5669 )	2022-05-24 17:33:30 +08:00
Sivabalan Narayanan	c05ebf2417	[HUDI-2473] Fixing compaction write operation in commit metadata (#5203 )	2022-05-24 13:03:21 +05:30
Danny Chan	676d5cefe0	[HUDI-4138] Fix the concurrency modification of hoodie table config for flink (#5660 ) * Remove the metadata cleaning strategy for flink, that means the multi-modal index may be affected * Improve the HoodieTable#clearMetadataTablePartitionsConfig to only update table config when necessary * Remove the modification of read code path in HoodieTableConfig	2022-05-24 13:07:55 +08:00
Heap	47b764ec33	[HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655 )	2022-05-23 15:28:48 -07:00
Danny Chan	c7576f7613	[HUDI-4130] Remove the upgrade/downgrade for flink #initTable (#5642 )	2022-05-20 21:31:23 +08:00
Danny Chan	6f37863ba8	[HUDI-4114] Remove the unnecessary fs view sync for BaseWriteClient#initTable (#5617 ) No need to #sync actively because the table instance is instantiated freshly, its view manager has empty fiew instantces, the fs view would be synced lazily when is it requested.	2022-05-19 10:59:05 +08:00
Danny Chan	f1f8a1abb7	[HUDI-4109] Copy the old record directly when it is chosen for merging (#5603 )	2022-05-18 10:17:00 +08:00
Danny Chan	ebbe56e862	[minor] Some code refactoring for LogFileComparator and Instant instantiation (#5600 )	2022-05-18 09:30:09 +08:00
BruceLin	99555c897a	[HUDI-4110] Clean the marker files for flink compaction (#5604 )	2022-05-17 21:09:27 +08:00
Danny Chan	d52d13302d	[HUDI-4101] BucketIndexPartitioner should take partition path for better dispersion (#5590 )	2022-05-17 10:34:57 +08:00
Shawy Geng	ad773b3d96	[HUDI-3654] Preparations for hudi metastore. (#5572 ) * [HUDI-3654] Preparations for hudi metastore. Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>	2022-05-17 09:47:10 +08:00
Danny Chan	43e08193ef	[HUDI-4098] Metadata table heartbeat for instant has expired, last heartbeat 0 (#5583 )	2022-05-16 17:40:08 +08:00
Yuwei XIAO	61030d8e7a	[HUDI-3123] consistent hashing index: basic write path (upsert/insert) (#4480 ) 1. basic write path(insert/upsert) implementation 2. adapt simple bucket index	2022-05-16 11:07:01 +08:00
xi chaomin	6e16e719cd	[HUDI-3980] Suport kerberos hbase index (#5464 ) - Add configurations in HoodieHBaseIndexConfig.java to support kerberos hbase connection. Co-authored-by: xicm <xicm@asiainfo.com>	2022-05-14 07:37:31 -04:00
wqwl611	52e63b39d6	[HUDI-4097] add table info to jobStatus (#5529 ) Co-authored-by: wqwl611 <wqwl611@gmail.com>	2022-05-13 21:01:15 -04:00
Alexey Kudinkin	4a8589f222	[HUDI-4038] Avoid calling `getDataSize` after every record written (#5497 ) - getDataSize has non-trivial overhead in the current ParquetWriter impl, requiring traversal of already composed Column Groups in memory. Instead we can sample these calls to getDataSize to amortize its cost. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-05-11 08:08:31 -04:00
Sivabalan Narayanan	6285a239a3	[HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462 ) - Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen. - Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord. - Other minor fixes around using static values instead of looking up hashmap.	2022-05-09 12:40:22 -04:00
guanziyue	abb4893b25	[HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor exit gracefully (#4264 )	2022-05-05 13:49:34 -07:00
Sagar Sumit	1562bb658f	[HUDI-4031] Avoid clustering update handling when no pending replacecommit (#5487 )	2022-05-04 10:17:11 -04:00
xicm	f492c52ee4	[HUDI-3862] Fix default configurations of HoodieHBaseIndexConfig (#5308 ) Co-authored-by: xicm <xicm@asiainfo.com>	2022-04-29 16:21:52 -07:00
LiChuang	4e928a6fe1	[HUDI-3943] Some description fixes for 0.10.1 docs (#5447 )	2022-04-28 15:18:56 -07:00
Danny Chan	e1ccf2e00b	[HUDI-3977] Flink hudi table with date type partition path throws HoodieNotSupportedException (#5432 )	2022-04-27 13:19:55 +08:00
Yuwei XIAO	f2ba0fead2	[HUDI-3085] Improve bulk insert partitioner abstraction (#4441 )	2022-04-25 18:42:17 +08:00
Alexey Kudinkin	c05a4e7b6f	[HUDI-3934] Fix `Spark32HoodieParquetFileFormat` not being compatible w/ Spark 3.2.0 (#5378 ) - Due to the fact that Spark 3.2.1 is non-BWC w/ 3.2.0, we have to handle all these incompatibilities in Spark32HoodieParquetFileFormat. This PR is addressing that. Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-21 21:00:38 -04:00
xiarixiaoyao	037f89ee7c	[HUDI-3921] Fixed schema evolution cannot work with HUDI-3855 (#5376 ) - when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.	2022-04-21 18:27:54 -04:00
Sagar Sumit	de5fa1fe03	[HUDI-3940] Fix retry count increment in lock manager (#5387 )	2022-04-21 16:52:05 -04:00
Alexey Kudinkin	4b296f79cc	[HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377 )	2022-04-21 01:36:19 -07:00
Sivabalan Narayanan	a9506aa545	[HUDI-3938] Fix default value for num retries to acquire lock (#5380 )	2022-04-21 01:08:43 -07:00
Alexey Kudinkin	f7544e23ac	[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364 ) - Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not - Fallback to append partition values in cases when the source columns are not persisted in data-file - Fixing HoodieBaseRelation incorrectly handling mandatory columns	2022-04-20 19:30:27 +08:00
Sagar Sumit	4f44e6aeb5	[HUDI-3899] Drop index to delete pending index instants from timeline if applicable (#5342 ) Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-04-18 22:28:46 -04:00
Sagar Sumit	1718bcab84	[HUDI-3707] Fix target schema handling in HoodieSparkUtils while creating RDD (#5347 )	2022-04-18 13:34:04 -04:00
董可伦	b8e465fdfc	[MINOR] Fix typos in log4j-surefire.properties (#5212 )	2022-04-15 13:33:37 -07:00
董可伦	99dd1cb6e6	[HUDI-3835] Add UT for delete in java client (#5270 )	2022-04-15 15:03:48 -04:00
Sivabalan Narayanan	57612c5c32	[HUDI-3848] Fixing restore with cleaned up commits (#5288 )	2022-04-15 14:47:53 -04:00
Y Ethan Guo	bab691692e	[HUDI-3686] Fix inline and async table service check in HoodieWriteConfig (#5307 )	2022-04-13 17:33:26 -04:00
Alexey Kudinkin	7b78dff45f	[HUDI-3855] Fixing `FILENAME_METADATA_FIELD` not being correctly updated in `HoodieMergeHandle` (#5296 ) Fixing FILENAME_METADATA_FIELD not being correctly updated in HoodieMergeHandle, in cases when old-record is carried over from existing file as is. - Revisited HoodieFileWriter API to accept HoodieKey instead of HoodieRecord - Fixed FILENAME_METADATA_FIELD not being overridden in cases when simply old record is carried over - Exposing standard JVM's debugger ports in Docker setup	2022-04-12 20:42:15 -04:00
Alexey Kudinkin	101b82a679	[HUDI-3839] Fixing incorrect selection of MT partitions to be updated (#5274 ) * Fixing incorrect selection of MT partitions to be updated * Ensure that metadata partitions table config is inherited correctly Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-12 13:37:52 +05:30
Sivabalan Narayanan	f91e9e63e1	[HUDI-3799] Fixing not deleting empty instants w/o archiving (#5261 )	2022-04-11 21:02:43 -07:00
Sagar Sumit	3d8fc78c66	[HUDI-3844] Update props in indexer based on table config (#5293 )	2022-04-11 18:16:06 -04:00

1 2 3 4 5 ...

969 Commits