lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Ankush Kanungo	f5e31be086	[HUDI-1685] keep updating current date for every batch (#2671 )	2021-03-12 15:53:01 -08:00
Sivabalan Narayanan	5cf2f2618b	[HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (#2577 )	2021-03-07 16:40:40 -05:00
pengzhiwei	bc883db5de	[HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (#2596 )	2021-03-05 14:10:27 +08:00
Raymond Xu	f53bca404f	[HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (#2621 ) - Add a config to allow parsing custom date format in `DatePartitionPathSelector`. Currently it assumes date partition string in the format of `yyyy-MM-dd`. - Fix a bug where `UnsupportedOperationException` was thrown when sort `eligibleFiles` in-place. Changed to sort it and store in a new list.	2021-03-04 21:01:51 -08:00
t0il3ts0ap	4fa43359cb	[MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (#2617 )	2021-03-03 09:49:18 +08:00
liujinhui	617cc24ad1	[HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (#2227 ) Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>	2021-02-25 07:08:13 -05:00
n3nash	ffcfb58bac	[HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359 ) 1. Refactor rollback and move cleaning failed commits logic into cleaner 2. Introduce hoodie heartbeat to ascertain failed commits 3. Fix test cases	2021-02-19 20:12:22 -08:00
Sivabalan Narayanan	c9fcf964b2	[HUDI-1315] Adding builder for HoodieTableMetaClient initialization (#2534 )	2021-02-20 09:54:26 +08:00
Sivabalan Narayanan	d5f202821b	Adding fixes to test suite framework. Adding clustering node and validate async operations node. (#2400 )	2021-02-12 09:29:21 -08:00
Volodymyr Burenin	17802569fd	[HUDI-1538] Try to init class trying different signatures instead of checking its name (#2476 ) * [HUDI-1538] Try to init class trying different signatures instead of checking its name. * Removed unused imports Co-authored-by: volodymyr.burenin <volodymyr.burenin@cloudkitchens.com>	2021-02-03 12:29:08 -08:00
wangxianghu	23f2ef3efb	[HUDI-623] Remove UpgradePayloadFromUberToApache (#2455 )	2021-01-28 17:48:50 -08:00
vinoth chandar	5ca0625b27	[HUDI 1308] Harden RFC-15 Implementation based on production testing (#2441 ) Addresses leaks, perf degradation observed during testing. These were regressions from the original rfc-15 PoC implementation. * Pass a single instance of HoodieTableMetadata everywhere * Fix tests and add config for enabling metrics - Removed special casing of assumeDatePartitioning inside FSUtils#getAllPartitionPaths() - Consequently, IOException is never thrown and many files had to be adjusted - More diligent handling of open file handles in metadata table - Added config for controlling reuse of connections - Added config for turning off fallback to listing, so we can see tests fail - Changed all ipf listing code to cache/amortize the open/close for better performance - Timelineserver also reuses connections, for better performance - Without timelineserver, when metadata table is opened from executors, reuse is not allowed - HoodieMetadataConfig passed into HoodieTableMetadata#create as argument. - Fix TestHoodieBackedTableMetadata#testSync	2021-01-19 21:20:28 -08:00
Udit Mehrotra	7ce3ac778e	[HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths (#2417 ) * [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partition paths * Adding testClass for FileSystemBackedTableMetadata Co-authored-by: Nishith Agarwal <nagarwal@uber.com>	2021-01-10 21:19:52 -08:00
lw0090	368c1a8f5c	[HUDI-1399] support a independent clustering spark job to asynchronously clustering (#2379 ) * [HUDI-1481] add structured streaming and delta streamer clustering unit test * [HUDI-1399] support a independent clustering spark job to asynchronously clustering * [HUDI-1399] support a independent clustering spark job to asynchronously clustering * [HUDI-1498] Read clustering plan from requested file for inflight instant (#2389) * [HUDI-1399] support a independent clustering spark job with schedule generate instant time Co-authored-by: satishkotha <satishkotha@uber.com>	2021-01-09 17:30:16 -08:00
puyvqi	1a836f9a84	[HUDI-1514] Avoid raw type use for parameter of Transformer interface (#2420 )	2021-01-09 18:36:13 +08:00
Udit Mehrotra	17df517b81	[HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common (#2410 )	2021-01-07 11:34:06 -08:00
Udit Mehrotra	4e64226844	[HUDI-1450] Use metadata table for listing in HoodieROTablePathFilter (apache#2326) [HUDI-1394] [RFC-15] Use metadata table (if present) to get all partition paths (apache#2351)	2021-01-04 07:59:47 -08:00
lw0090	9e6889a8ce	[HUDI-1481] add structured streaming and delta streamer clustering unit test (#2360 )	2020-12-27 20:27:09 -08:00
Sivabalan Narayanan	8cf6a7223f	[HUDI-1331] Adding support for validating entire dataset and long running tests in test suite framework (#2168 ) * trigger rebuild * [HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrapper Class (#1927) * Adding support for validating records and long running tests in test sutie framework * Adding partial validate node * Fixing spark session initiation in Validate nodes * Fixing validation * Adding hive table validation to ValidateDatasetNode * Rebasing with latest commits from master * Addressing feedback * Addressing comments Co-authored-by: lamber-ken <lamberken@163.com> Co-authored-by: linshan-ma <mabin194046@163.com>	2020-12-26 09:29:24 -08:00
Sivabalan Narayanan	33d338f392	[HUDI-115] Adding DefaultHoodieRecordPayload to honor ordering with combineAndGetUpdateValue (#2311 ) * Added ability to pass in `properties` to payload methods, so they can perform table/record specific merges * Added default methods so existing payload classes are backwards compatible. * Adding DefaultHoodiePayload to honor ordering while merging two records * Fixing default payload based on feedback	2020-12-19 19:19:42 -08:00
Bhavani Sudha Saktheeswaran	14d5d1100c	[HUDI-1406] Add date partition based source input selector for Delta streamer (#2264 ) - Adds ability to list only recent date based partitions from source data. - Parallelizes listing for faster tailing of DFSSources	2020-12-17 03:59:30 -08:00
Raymond Xu	bd9cceccb5	[HUDI-1395] Fix partition path using FSUtils (#2312 ) Fixed the logic to get partition path in Copier and Exporter utilities.	2020-12-10 10:19:19 +08:00
wenningd	fce1453fa6	[HUDI-1040] Make Hudi support Spark 3 (#2208 ) * Fix flaky MOR unit test * Update Spark APIs to make it be compatible with both spark2 & spark3 * Refactor bulk insert v2 part to make Hudi be able to compile with Spark3 * Add spark3 profile to handle fasterxml & spark version * Create hudi-spark-common module & refactor hudi-spark related modules Co-authored-by: Wenning Ding <wenningd@amazon.com>	2020-12-09 15:52:23 -08:00
liujinhui	62b392b49c	[HUDI-1343] Add standard schema postprocessor which would rewrite the schema using spark-avro conversion (#2192 ) Co-authored-by: liujh <liujh@t3go.cn>	2020-12-03 19:28:34 -08:00
wangxianghu	a23230c8c2	[HUDI-1400] Replace Operation enum with WriteOperationType (#2259 )	2020-11-19 13:40:04 +08:00
Shen Hong	49407169ac	[HUDI-1209] Properties File must be optional when running deltastreamer (#2085 )	2020-10-21 17:49:28 -07:00
Ho Tien Vu	af5ef4d49d	[HUDI-1330] handle prefix filtering at directory level (#2157 ) The current DFSPathSelector only ignore prefix(_, .) at the file level while files under subdirectories e.g. (.checkpoint/*) are still considered which result in bad-format exception during reading.	2020-10-20 23:20:19 -07:00
Ho Tien Vu	fd269ddeb0	[MINOR] Make sure factory method is used to instanciate DFSPathSelector (#2187 ) * Move createSourceSelector into DFSPathSelector factory method * Replace constructor call with factory method * Added some javadoc	2020-10-20 17:52:31 +08:00
lw0090	2126f13e13	[HUDI-791] Replace null by Option in Delta Streamer (#2171 )	2020-10-11 18:29:57 -07:00
Prashant Wason	788d236c44	[HUDI-1303] Some improvements for the HUDI Test Suite. (#2128 ) 1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs. 2. Fix CleanNode constructor which is not correctly implemented 3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts. 4. Fixed generation of records of specific size - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size. - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map. 5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp). 6. Making --workload-generator-classname an optional parameter as most probably the default will be used	2020-10-07 08:33:51 -04:00
Pratyaksh Sharma	524193eb4b	[HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode (#1566 ) Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>	2020-10-06 20:34:03 -07:00
Shen Hong	b335459c80	[HUDI-1208] Ordering Field should be optional when precombine is turned off (#2088 )	2020-10-04 11:34:21 -07:00
Mathieu	1f7add9291	[HUDI-1089] Refactor hudi-client to support multi-engine (#1827 ) - This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules - Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc - Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common` - Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies - To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-10-01 14:25:29 -07:00
liujinhui	20b9b399c9	[HUDI-1233] Deltastreamer Kafka consumption delay reporting indicators (#2074 )	2020-09-29 13:44:31 +08:00
Alexander Filipchik	c8e19e2def	[HUDI-801] Adding a way to post process schema after it is fetched (#1524 ) * [HUDI-801] Adding a way to post process schema after it is fetched Co-authored-by: Alex Filipchik <alex.filipchik@csscompany.com> Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>	2020-09-19 11:18:36 -07:00
shenh062326	581d54097c	[HUDI-1143] Change timestamp field in HoodieTestDataGenerator from double to long	2020-09-15 20:58:29 -07:00
liujinhui	6c84ef20ac	[HUDI-1282] Check whether the topic exists before deltastrmer consumes Kafka (#2090 )	2020-09-16 10:43:52 +08:00
Abhishek Modi	53d1e55110	Test Suite should work with Docker + Unit Tests	2020-09-08 22:41:14 -07:00
wenningd	2fee087f0f	[HUDI-1181] Fix decimal type display issue for record key field (#1953 ) * [HUDI-1181] Fix decimal type display issue for record key field * Remove getNestedFieldVal method from DataSourceUtils * resolve comments Co-authored-by: Wenning Ding <wenningd@amazon.com>	2020-09-08 17:50:54 -07:00
Dongwook	8d19ebfd0f	[HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703 ) For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases. This patch makes the following fixes in this regard. - Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism" - Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"	2020-09-01 12:55:31 -04:00
Satish Kotha	ea983ff912	[HUDI-1137] Add option to configure different path selector	2020-08-24 13:26:44 -07:00
Mathieu	b883b6d268	[HUDI-1122] Introduce a kafka implementation of hoodie write commit ca… (#1886 )	2020-08-20 23:00:59 +08:00
vinoth chandar	9bde6d616c	[HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to mark public APIs (#1965 ) - Maturity levels one of : evolving, stable, deprecated - Took a pass and marked out most of the existing public API	2020-08-13 23:28:17 -07:00
Udit Mehrotra	e4a2d98f79	[HUDI-426] Bootstrap datasource integration (#1702 )	2020-08-09 14:06:13 -07:00
wenningd	9fe2d2b14a	[HUDI-427] [HUDI-971] Implement CLI support for performing bootstrap (#1869 ) * [HUDI-971] Clean partitions & fileIds returned by HFileBootstrapIndex * [HUDI-427] Implement CLI support for performing bootstrap Co-authored-by: Wenning Ding <wenningd@amazon.com> Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>	2020-08-08 12:37:29 -07:00
Mathieu	b51646dcc7	[HUDI-1151] Fix NPE when no new data in kafka using HoodieDeltaStreamer (#1921 )	2020-08-07 00:03:20 +08:00
lw0090	51ea27d665	[HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync, hudi-dla-sync (#1810 ) - Generalize the hive-sync module for syncing to multiple metastores - Added new options for datasource - Added new command line for delta streamer Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-08-05 21:34:55 -07:00
Balaji Varadarajan	7a2429f5ba	[HUDI-575] Spark Streaming with async compaction support (#1752 )	2020-08-05 07:50:15 -07:00
Sreeram Ramji	217a84192c	[HUDI-1140] Fix Jcommander issue for --hoodie-conf in DeltaStreamer (#1898 )	2020-08-04 21:42:51 -07:00
Sivabalan Narayanan	ab11ba43e1	[REVERT] "[HUDI-1058] Make delete marker configurable (#1819 )" (#1914 ) This reverts commit `433d7d2c98`.	2020-08-04 15:20:38 -07:00

1 2 3 4

198 Commits