lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Prashant Wason	8eed440694	[HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant. (#3210 ) * [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant. If the rolled-back instant was synced to the Metadata Table, a corresponding deltacommit with the same timestamp should have been created on the Metadata Table timeline. To ensure we can always perfomr this check, the Metadata Table instants should not be archived until their corresponding instants are present in the dataset timeline. But ensuring this requires a large number of instants to be kept on the metadata table. In this change, the metadata table will keep atleast the number of instants that the main dataset is keeping. If the instant being rolled back was before the metadata table timeline, the code will throw an exception and the metadata table will have to be re-bootstrapped. This should be a very rare occurance and should occur only when the dataset is being repaired by rolling back multiple commits or restoring to an much older time. * Fixed checkstyle * Improvements from review comments. Fixed checkstyle Replaced explicit null check with Option.ofNullable Removed redundant function getSynedInstantTime * Renamed getSyncedInstantTime and getSyncedInstantTimeForReader. Sync is confusing so renamed to getUpdateTime() and getReaderTime(). * Removed getReaderTime which is only for testing as the same method can be accessed during testing differently without making it part of the public interface. * Fix compilation error * Reverting changes to HoodieMetadataFileSystemView Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-08-13 21:23:34 -07:00
Sagar Sumit	0544d70d8f	[MINOR] Deprecate older configs (#3464 ) Rename and deprecate props in HoodieWriteConfig Rename and deprecate older props	2021-08-12 20:31:04 -07:00
Prashant Wason	76bc686a77	[HUDI-1292] Created a config to enable/disable syncing of metadata table. (#3427 ) * [HUDI-1292] Created a config to enable/disable syncing of metadata table. - Metadata Table should only be synced from a single pipeline to prevent conflicts. - Skip syncing metadata table for clustering and compaction - Renamed useFileListingMetadata Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-08-12 15:45:57 -07:00
Prashant Wason	b3e430f24b	[HUDI-2017] Add API to set a metric in the registry. (#3084 ) Registry.add() API adds the new value to existing metric value. For some use-cases We need a API to set/replace the existing value. Metadata Table is synced in preWrite() and postWrite() functions of commit. As part of the sync, the current sizes and basefile/logfile counts are published as metrics. If we use the Registry.add() API, the count and sizes are incorrectly published as sum of the two values. This is corrected by using the Registry.set() API instead.	2021-08-11 16:47:16 -07:00
Y Ethan Guo	4783176554	[HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency (#3233 ) - Can be enabled for cloud stores like S3. Not supported for hdfs yet, due to partial write failures.	2021-08-11 11:48:13 -04:00
Sivabalan Narayanan	c9fa3cffaf	[HUDI-1774] Adding support for delete_partitions to spark data source (#3437 )	2021-08-11 01:03:01 -04:00
swuferhong	5448cdde7e	[HUDI-2170] [HUDI-1763] Always choose the latest record for HoodieRecordPayload (#3401 )	2021-08-11 10:20:55 +08:00
zhangyue19921010	b4441abcf7	[HUDI-2194] Skip the latest N partitions when choosing partitions to create ClusteringPlan (#3300 ) * skip from latest partitions based on hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions && 0(default means skip nothing) * change config verison * add ut Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2021-08-09 10:10:15 -07:00
Sagar Sumit	70b6bd485f	[HUDI-1468] Support custom clustering strategies and preserve commit metadata as part of clustering (#3419 ) Co-authored-by: Satish Kotha <satishkotha@uber.com>	2021-08-06 22:53:08 -04:00
wenningd	91bb0d1318	[HUDI-2255] Refactor Datasource options (#3373 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-08-03 17:50:30 -07:00
satishkotha	826a04d142	[HUDI-2072] Add pre-commit validator framework (#3153 ) * [HUDI-2072] Add pre-commit validator framework * trigger Travis rebuild	2021-08-03 12:07:45 -07:00
Sivabalan Narayanan	fe508376fa	[HUDI-2177][HUDI-2200] Adding virtual keys support for MOR table (#3315 )	2021-08-02 09:45:09 -04:00
Gary Li	6353fc865f	[HUDI-2218] Fix missing HoodieWriteStat in HoodieCreateHandle (#3341 )	2021-07-30 02:36:57 -07:00
Shawy Geng	44e41dc9bb	[HUDI-2117] Unpersist the input rdd after the commit is completed to … (#3207 ) Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-07-29 08:16:58 -07:00
rmahindra123	8fef50e237	[HUDI-2044] Integrate consumers with rocksDB and compression within External Spillable Map (#3318 )	2021-07-28 01:31:03 -04:00
Sivabalan Narayanan	61148c1c43	[HUDI-2176, 2178, 2179] Adding virtual key support to COW table (#3306 )	2021-07-26 17:21:04 -04:00
xiarixiaoyao	5353243449	[HUDI-2214]residual temporary files after clustering are not cleaned up (#3335 )	2021-07-26 10:26:20 -07:00
Gary Li	a5638b995b	[MINOR] Close log scanner after compaction completed (#3294 )	2021-07-26 17:39:13 +08:00
Sivabalan Narayanan	d5026e9a24	[HUDI-2161] Adding support to disable meta columns with bulk insert operation (#3247 )	2021-07-19 20:43:48 -04:00
Jintao Guan	2debb9b3ed	[HUDI-1828] Update unit tests to support ORC as the base file format (#3237 )	2021-07-15 00:05:42 +08:00
zhangyue19921010	c8a2033c27	[HUDI-2144]Bug-Fix:Offline clustering(HoodieClusteringJob) will cause insert action losing data (#3240 ) * fixed * add testUpsertPartitionerWithSmallFileHandlingAndClusteringPlan ut * fix CheckStyle Co-authored-by: yuezhang <yuezhang@freewheel.tv>	2021-07-12 18:14:17 -07:00
Sagar Sumit	5804ad8e32	[HUDI-1483] Support async clustering for deltastreamer and Spark streaming (#3142 ) - Integrate async clustering service with HoodieDeltaStreamer and HoodieStreamingSink - Added methods in HoodieAsyncService to reuse code	2021-07-11 14:43:38 -04:00
Yungthuis	1d3cd06572	[HUDI-2134]Add generics to avoif forced conversion in BaseSparkCommitActionExecutor#partition (#3232 )	2021-07-08 13:31:38 +08:00
Sivabalan Narayanan	16e90d30ea	[HUDI-1105] Adding dedup support for Bulk Insert w/ Rows (#2206 )	2021-07-07 17:38:26 -04:00
Sivabalan Narayanan	ea9e5d0e8b	[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows (#3149 )	2021-07-07 11:15:25 -04:00
wangxianghu	650c4455c6	[HUDI-2122] Improvement in packaging insert into smallfiles (#3213 )	2021-07-05 09:30:57 -07:00
wangxianghu	62a1ad8b3a	[HUDI-1930] Bootstrap support configure KeyGenerator by type (#3170 ) * [HUDI-1930] Bootstrap support configure KeyGenerator by type	2021-07-03 20:27:37 +08:00
pengzhiwei	6eca06d074	[HUDI-2105] Compaction Failed For MergeInto MOR Table (#3190 )	2021-07-01 23:40:14 +08:00
wenningd	d412fb2fe6	[HUDI-89] Add configOption & refactor all configs based on that (#2833 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-06-30 14:26:30 -07:00
Prashant Wason	11e64b2db0	[HUDI-1717] Metadata Reader should merge all the un-synced but complete instants from the dataset timeline. (#3082 )	2021-06-22 23:52:18 +08:00
Wei	7865da1e15	[MINOR] Fix Javadoc wrong references (#3115 )	2021-06-18 21:51:54 -07:00
wangxianghu	7261f08507	[HUDI-1929] Support configure KeyGenerator by type (#2993 )	2021-06-08 09:26:10 -04:00
pengzhiwei	f760ec543e	[HUDI-1659] Basic Implement Of Spark Sql Support For Hoodie (#2645 ) Main functions: Support create table for hoodie. Support CTAS. Support Insert for hoodie. Including dynamic partition and static partition insert. Support MergeInto for hoodie. Support DELETE Support UPDATE Both support spark2 & spark3 based on DataSourceV1. Main changes: Add sql parser for spark2. Add HoodieAnalysis for sql resolve and logical plan rewrite. Add commands implementation for CREATE TABLE、INSERT、MERGE INTO & CTAS. In order to push down the update&insert logical to the HoodieRecordPayload for MergeInto, I make same change to the HoodieWriteHandler and other related classes. 1、Add the inputSchema for parser the incoming record. This is because the inputSchema for MergeInto is different from writeSchema as there are some transforms in the update& insert expression. 2、Add WRITE_SCHEMA to HoodieWriteConfig to pass the write schema for merge into. 3、Pass properties to HoodieRecordPayload#getInsertValue to pass the insert expression and table schema. Verify this pull request Add TestCreateTable for test create hoodie tables and CTAS. Add TestInsertTable for test insert hoodie tables. Add TestMergeIntoTable for test merge hoodie tables. Add TestUpdateTable for test update hoodie tables. Add TestDeleteTable for test delete hoodie tables. Add TestSqlStatement for test supported ddl/dml currently.	2021-06-07 23:24:32 -07:00
rmpifer	0709c62a6b	[HUDI-1800] Exclude file slices in pending compaction when performing small file sizing (#2902 ) Co-authored-by: Ryan Pifer <ryanpife@amazon.com>	2021-05-29 08:06:01 -04:00
Y Ethan Guo	a96034d38d	[HUDI-1888] Fix NPE when the nested partition path field has null value (#2957 )	2021-05-21 08:28:11 -04:00
wangxianghu	ced068e1ee	[MINOR] Remove unused method in BaseSparkCommitActionExecutor (#2965 )	2021-05-20 10:18:07 +08:00
TeRS-K	be9db2c4f5	[HUDI-1055] Remove hardcoded parquet in tests (#2740 ) * Remove hardcoded parquet in tests * Use DataFileUtils.getInstance * Renaming DataFileUtils to BaseFileUtils Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-05-11 10:01:45 -07:00
satishkotha	2999586509	[HUDI-1690] use jsc union instead of rdd union (#2872 )	2021-04-26 23:35:01 -07:00
Danny Chan	d047e91d86	[HUDI-1837] Add optional instant range to log record scanner for log (#2870 )	2021-04-26 16:53:18 +08:00
Aditya Tiwari	ec2334ceac	[HUDI-1716]: Resolving default values for schema from dataframe (#2765 ) - Adding default values and setting null as first entry in UNION data types in avro schema. Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com>	2021-04-19 10:05:20 -04:00
hj2016	1da16dfd2e	[HUDI-1784] Added print detailed stack log when hbase connection error (#2799 )	2021-04-12 13:46:06 +08:00
pengzhiwei	684622c7c9	[HUDI-1591] Implement Spark's FileIndex for Hudi to support queries via Hudi DataSource using non-globbed table path and partition pruning (#2651 )	2021-04-01 11:12:28 -07:00
Jintao Guan	1277c62398	[HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (#2627 ) * [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator * update NonpartitionedKeyGenerator to support composite record keys * update NonpartitionedKeyGenerator	2021-03-18 15:33:31 -07:00
n3nash	74241947c1	[HUDI-845] Added locking capability to allow multiple writers (#2374 ) * [HUDI-845] Added locking capability to allow multiple writers 1. Added LockProvider API for pluggable lock methodologies 2. Added Resolution Strategy API to allow for pluggable conflict resolution 3. Added TableService client API to schedule table services 4. Added Transaction Manager for wrapping actions within transactions	2021-03-16 16:43:53 -07:00
hj2016	77ba561a6b	[HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (#2188 ) Co-authored-by: huangjing <huangjing@clinbrain.com> Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>	2021-02-23 20:56:58 -05:00
n3nash	ffcfb58bac	[HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359 ) 1. Refactor rollback and move cleaning failed commits logic into cleaner 2. Introduce hoodie heartbeat to ascertain failed commits 3. Fix test cases	2021-02-19 20:12:22 -08:00
Sivabalan Narayanan	c9fcf964b2	[HUDI-1315] Adding builder for HoodieTableMetaClient initialization (#2534 )	2021-02-20 09:54:26 +08:00
Karl_Wang	9431aabfab	[HUDI-1381] Schedule compaction based on time elapsed (#2260 ) - introduce configs to control how compaction is triggered - Compaction can be triggered using time, number of delta commits and/or combinations - Default behaviour remains the same.	2021-02-17 07:44:53 -08:00
pengzhiwei	37972071ff	[HUDI-1109] Support Spark Structured Streaming read from Hudi table (#2485 )	2021-02-17 03:36:29 -08:00
Danny Chan	4c5b6923cc	[HUDI-1557] Make Flink write pipeline write task scalable (#2506 ) This is the #step 2 of RFC-24: https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal This PR introduce a BucketAssigner that assigns bucket ID (partition path & fileID) for each stream record. There is no need to look up index and partition the records anymore in the following pipeline for these records, we actually decide the write target location before the write and each record computes its location when the BucketAssigner receives it, thus, the indexing is with streaming style. Computing locations for a batch of records all at a time is resource consuming so a pressure to the engine, we should avoid that in streaming system.	2021-02-06 22:03:52 +08:00

1 2

87 Commits