lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
冯健	382d19e85b	[HUDI-4065] Add FileBasedLockProvider (#6071 )	2022-07-19 07:52:47 +08:00
Alexey Kudinkin	4bda6afe0b	[HUDI-4249] Fixing in-memory `HoodieData` implementation to operate lazily (#5855 )	2022-07-16 18:26:48 -05:00
liujinhui	126b88b48d	[HUDI-2150] Rename/Restructure configs for better modularity (#6061 ) - Move clean related configuration to HoodieCleanConfig - Move Archival related configuration to HoodieArchivalConfig - hoodie.compaction.payload.class move this to HoodiePayloadConfig	2022-07-09 20:00:48 +05:30
xiarixiaoyao	b686c07407	[HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields (#6017 ) * [HUDI-4276] Reconcile schema-inject null values for missing fields and add new fields. * fix comments Co-authored-by: public (bdcee5037027) <mengtao0326@qq.com>	2022-07-09 03:08:38 +08:00
xi chaomin	23c9c5c296	[HUDI-3836] Improve the way of fetching metadata partitions from table (#5286 ) Co-authored-by: xicm <xicm@asiainfo.com>	2022-07-05 07:50:17 -07:00
Alexey Kudinkin	ed823f1c6f	[HUDI-4320] Make sure `HoodieStorageConfig.PARQUET_WRITE_LEGACY_FORMAT_ENABLED` could be specified by the writer (#5970 ) Fixed sequence determining whether Parquet's legacy-format writing property should be overridden to only kick in when it has not been explicitly specified by the caller	2022-06-28 12:27:32 -07:00
Sagar Sumit	8846849a03	[HUDI-4291] Fix flaky TestCleanPlanExecutor#testKeepLatestFileVersions (#5930 )	2022-06-27 17:27:16 +05:30
cxzl25	72fa19bcc9	[HUDI-4316] Support for spillable diskmap configuration when constructing HoodieMergedLogRecordScanner (#5959 )	2022-06-27 11:09:30 +08:00
Alexey Kudinkin	c86edfc28e	[HUDI-4319] Fixed Parquet's `PLAIN_DICTIONARY` encoding not being applied when bulk-inserting (#5966 ) * Fixed Dictionary encoding config not being properly propagated to Parquet writer (making it unable to apply it, substantially bloating the storage footprint)	2022-06-24 23:52:28 -04:00
xi chaomin	30ebdc708b	[HUDI-3735] TestHoodieSparkMergeOnReadTableRollback is flaky (#5874 )	2022-06-24 02:47:36 -04:00
Zhaojing Yu	c7e430bb46	Revert master (#5925 ) * Revert "udate" This reverts commit `092e35c1e3`. * Revert "[HUDI-3475] Initialize hudi table management module." This reverts commit `4640a3bbb8`.	2022-06-21 16:58:50 +08:00
喻兆靖	4640a3bbb8	[HUDI-3475] Initialize hudi table management module.	2022-06-21 15:21:30 +08:00
RexAn	17ac5a4573	[HUDI-4173] Fix wrong results if the user read no base files hudi table by glob paths (#5723 )	2022-06-20 23:02:34 +05:30
huberylee	d4f0326b4b	[HUDI-4275] Refactor rollback inflight instant for clustering/compaction to reuse some code (#5894 )	2022-06-20 14:29:21 +08:00
Alexey Kudinkin	35afdb4316	[HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737 ) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.	2022-06-07 16:30:46 -07:00
Alexey Kudinkin	4f7ea8c79a	[HUDI-4176] Fixing `TableSchemaResolver` to avoid repeated `HoodieCommitMetadata` parsing (#5733 ) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy;	2022-06-06 13:14:26 -04:00
KnightChess	51602a34f7	[HUDI-4179] Cluster with sort cloumns invalid (#5739 )	2022-06-02 20:28:21 +08:00
Danny Chan	7f8630cc57	[HUDI-4167] Remove the timeline refresh with initializing hoodie table (#5716 ) The timeline refresh on table initialization invokes the fs view #sync, which has two actions now: 1. reload the timeline of the fs view, so that the next fs view request is based on this timeline metadata 2. if this is a local fs view, clear all the local states; if this is a remote fs view, send request to sync the remote fs view But, let's see the construction, the meta client is instantiated freshly so the timeline is already the latest, the table is also constructed freshly, so the fs view has no local states, that means, the #sync is unnecessary totally. In this patch, the metadata lifecycle and data set fs view are kept in sync, when the fs view is refreshed, the underneath metadata is also refreshed synchronouly. The freshness of the metadata follows the same rules as data fs view: 1. if the fs view is local, the visibility is based on the client table metadata client's latest commit 2. if the fs view is remote, the timeline server would #sync the fs view and metadata together based on the lagging server local timeline From the perspective of client, no need to care about the refresh action anymore no matter whether the metadata table is enabled or not. That make the client logic more clear and less error-prone. Removes the timeline refresh has another benefit: if avoids unncecessary #refresh of the remote fs view, if all the clients send request to #sync the remote fs view, the server would encounter conflicts and the client encounters a response error.	2022-06-02 09:48:48 +08:00
RexAn	98c5c6c654	[HUDI-4040] Bulk insert Support CustomColumnsSortPartitioner with Row (#5502 ) * Along the lines of RDDCustomColumnsSortPartitioner but for Row	2022-05-26 10:39:04 +05:30
Heap	47b764ec33	[HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655 )	2022-05-23 15:28:48 -07:00
Danny Chan	ebbe56e862	[minor] Some code refactoring for LogFileComparator and Instant instantiation (#5600 )	2022-05-18 09:30:09 +08:00
Shawy Geng	ad773b3d96	[HUDI-3654] Preparations for hudi metastore. (#5572 ) * [HUDI-3654] Preparations for hudi metastore. Co-authored-by: gengxiaoyu <gengxiaoyu@bytedance.com>	2022-05-17 09:47:10 +08:00
Yuwei XIAO	61030d8e7a	[HUDI-3123] consistent hashing index: basic write path (upsert/insert) (#4480 ) 1. basic write path(insert/upsert) implementation 2. adapt simple bucket index	2022-05-16 11:07:01 +08:00
xi chaomin	6e16e719cd	[HUDI-3980] Suport kerberos hbase index (#5464 ) - Add configurations in HoodieHBaseIndexConfig.java to support kerberos hbase connection. Co-authored-by: xicm <xicm@asiainfo.com>	2022-05-14 07:37:31 -04:00
wqwl611	52e63b39d6	[HUDI-4097] add table info to jobStatus (#5529 ) Co-authored-by: wqwl611 <wqwl611@gmail.com>	2022-05-13 21:01:15 -04:00
Alexey Kudinkin	4a8589f222	[HUDI-4038] Avoid calling `getDataSize` after every record written (#5497 ) - getDataSize has non-trivial overhead in the current ParquetWriter impl, requiring traversal of already composed Column Groups in memory. Instead we can sample these calls to getDataSize to amortize its cost. Co-authored-by: sivabalan <n.siva.b@gmail.com>	2022-05-11 08:08:31 -04:00
Sivabalan Narayanan	6285a239a3	[HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462 ) - Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen. - Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord. - Other minor fixes around using static values instead of looking up hashmap.	2022-05-09 12:40:22 -04:00
guanziyue	abb4893b25	[HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor exit gracefully (#4264 )	2022-05-05 13:49:34 -07:00
Sagar Sumit	1562bb658f	[HUDI-4031] Avoid clustering update handling when no pending replacecommit (#5487 )	2022-05-04 10:17:11 -04:00
Yuwei XIAO	f2ba0fead2	[HUDI-3085] Improve bulk insert partitioner abstraction (#4441 )	2022-04-25 18:42:17 +08:00
Alexey Kudinkin	c05a4e7b6f	[HUDI-3934] Fix `Spark32HoodieParquetFileFormat` not being compatible w/ Spark 3.2.0 (#5378 ) - Due to the fact that Spark 3.2.1 is non-BWC w/ 3.2.0, we have to handle all these incompatibilities in Spark32HoodieParquetFileFormat. This PR is addressing that. Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-21 21:00:38 -04:00
Alexey Kudinkin	f7544e23ac	[HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364 ) - Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not - Fallback to append partition values in cases when the source columns are not persisted in data-file - Fixing HoodieBaseRelation incorrectly handling mandatory columns	2022-04-20 19:30:27 +08:00
Sagar Sumit	1718bcab84	[HUDI-3707] Fix target schema handling in HoodieSparkUtils while creating RDD (#5347 )	2022-04-18 13:34:04 -04:00
董可伦	b8e465fdfc	[MINOR] Fix typos in log4j-surefire.properties (#5212 )	2022-04-15 13:33:37 -07:00
Sivabalan Narayanan	57612c5c32	[HUDI-3848] Fixing restore with cleaned up commits (#5288 )	2022-04-15 14:47:53 -04:00
Alexey Kudinkin	7b78dff45f	[HUDI-3855] Fixing `FILENAME_METADATA_FIELD` not being correctly updated in `HoodieMergeHandle` (#5296 ) Fixing FILENAME_METADATA_FIELD not being correctly updated in HoodieMergeHandle, in cases when old-record is carried over from existing file as is. - Revisited HoodieFileWriter API to accept HoodieKey instead of HoodieRecord - Fixed FILENAME_METADATA_FIELD not being overridden in cases when simply old record is carried over - Exposing standard JVM's debugger ports in Docker setup	2022-04-12 20:42:15 -04:00
Sivabalan Narayanan	f91e9e63e1	[HUDI-3799] Fixing not deleting empty instants w/o archiving (#5261 )	2022-04-11 21:02:43 -07:00
Y Ethan Guo	3e97c88c4f	[HUDI-3807] Add a new config to control the use of metadata index in HoodieBloomIndex (#5268 )	2022-04-09 15:30:11 -04:00
Alexey Kudinkin	81b25c543a	[HUDI-3825] Fixing Column Stats Index updating sequence (#5267 )	2022-04-08 23:14:08 -07:00
Alexey Kudinkin	d7cc767dbc	[HUDI-3825] Fixing non-partitioned table Partition Records persistence in MT (#5259 ) * Filter out empty string (for non-partitioned table) being added to "__all_partitions__" record * Instead of filtering, transform empty partition-id to `NON_PARTITIONED_NAME` * Cleaned up `HoodieBackedTableMetadataWriter` * Make sure REPLACE_COMMITS are handled as well	2022-04-08 15:58:31 +05:30
Y Ethan Guo	9d744bb35c	[HUDI-3805] Delete existing corrupted requested rollback plan during rollback (#5245 )	2022-04-07 15:32:34 +05:30
Alexey Kudinkin	9e87d164b3	[HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208 ) - Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats - Index records pertaining to the columns being queried by, instead of reading out whole Index. - Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. Brief change log - Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS - Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader - Wiring key-prefix lookup t/h LogRecordScanner impls - Cleaning up HoodieHFileReader impl Co-authored-by: sivabalan <n.siva.b@gmail.com> Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-06 12:11:08 -04:00
Raymond Xu	e96f08f355	Moving to 0.12.0-SNAPSHOT on master branch.	2022-04-06 15:24:10 +08:00
Sagar Sumit	898be6174a	[HUDI-3782] Fixing table config when any of the index is disabled (#5222 )	2022-04-05 23:06:52 -04:00
Prashant Wason	b28f0d6ceb	[HUDI-3290] Different file formats for the partition metadata file. (#5179 ) * [HUDI-3290] Different file formats for the partition metadata file. Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition. Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset. For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting. * Code review comments - Adding a new command to migrate from text to base file formats for meta file. - Reimplementing readFromFS() to first read the text format, then base format - Avoid extra exists() checks in readFromFS() - Added unit tests, enabled parquet format across hoodie-hadoop-mr - Code cleanup, restructuring, naming consistency. * Wiring in all the other Spark code paths to respect this config - Turned on parquet meta format for COW data source tests - Removed the deltastreamer command line to keep it shorter * populate HoodiePartitionMetadata#format after readFromFS() Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2022-04-04 08:08:20 -07:00
Alexey Kudinkin	cc3737be50	[HUDI-3664] Fixing Column Stats Index composition (#5181 ) Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>	2022-04-02 17:15:52 -07:00
Sagar Sumit	74eb09be9b	[HUDI-3776] Fix BloomIndex incorrectly using ColStats to lookup records locations (#5213 )	2022-04-02 18:22:57 -04:00
YueZhang	020786a5f9	[HUDI-3451] Delete metadata table when the write client disables MDT (#5186 ) * Add checks for metadata table init to avoid possible out-of-sync * Revise the logic to reuse existing table config * Revise docs and naming Co-authored-by: yuezhang <yuezhang@freewheel.tv> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>	2022-04-02 16:31:06 +05:30
xiarixiaoyao	444ff496a4	[RFC-33] [HUDI-2429][Stacked on HUDI-2560] Support full Schema evolution for Spark (#4910 ) * [HUDI-2560] introduce id_based schema to support full schema evolution. * add test for FileBasedInternalSchemaStorageManger and rebase code * add support for change column type and fix some test case * fix some bugs encountered in the production env and delete useless code * fix test error * rebase code * fixed some nested schema change bugs * [HUDI-2429][Stacked On HUDI-2560]Support full schema evolution for spark * [use dummyInternalSchema instead of null] * add support for spark3.1.x * remove support for spark3.1.x , sicne some compile fail * support spark3.1.x * rebase and prepare solve all comments * address all comments * rebase code * fixed the count() bug try to get internalSchema by parser commit file/history file directly, not use metaclient which is time cost address some comments * fixed all comments * fix new comments * rebase code,fix UT failed * fixed mistake * rebase code ,fixed new comments * rebase code , and prepare for address new comments * address commits * address new comments * fix new issues * control fallback original write logical	2022-04-01 13:20:24 -07:00
ForwardXu	98b4e9796e	[HUDI-3406] Rollback incorrectly relying on FS listing instead of Com… (#4957 ) * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * [HUDI-3406] Rollback incorrectly relying on FS listing instead of Commit Metadata * fix comments * fix comments * fix comments	2022-04-01 10:01:41 +08:00

1 2 3 4 5 ...

394 Commits