1
0
Commit Graph

232 Commits

Author SHA1 Message Date
Nick Young
f4e3b94971 [HUDI-1742] Improve table level config priority for HoodieMultiTableDeltaStreamer (#2744) 2021-04-26 22:05:06 +08:00
Sivabalan Narayanan
3e4fa170cf [HUDI-1835] Fixing kafka native config param for auto offset reset (#2864) 2021-04-25 12:16:09 -04:00
pengzhiwei
aacb8be521 [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283) 2021-04-20 14:21:38 -07:00
Jintao Guan
3253079507 [HUDI-1764] Add Hudi-CLI support for clustering (#2773)
* tmp base

* update

* update unit test

* update

* update

* update CLI parameters

* linting

* update doSchedule in HoodieClusteringJob

* update

* update diff according to comments
2021-04-20 09:46:42 -07:00
Aditya Tiwari
ec2334ceac [HUDI-1716]: Resolving default values for schema from dataframe (#2765)
- Adding default values and setting null as first entry in UNION data types in avro schema. 

Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com>
2021-04-19 10:05:20 -04:00
Roc Marshal
62bb9e10d9 [Hotfix][utilities] Optimized codes (#2821) 2021-04-15 09:40:14 +08:00
wangxianghu
040756d8c0 [HUDI-1785] Move OperationConverter to hudi-client-common for code reuse (#2798) 2021-04-12 16:22:33 +08:00
li36909
dadd081d45 [HUDI-1751] DeltaStreamer print many unnecessary warn log (#2754) 2021-04-07 00:47:03 -07:00
li36909
920537cac8 [HUDI-1749] Clean/Compaction/Rollback command maybe never exit when operation fail (#2752) 2021-04-05 23:23:15 -07:00
pengzhiwei
684622c7c9 [HUDI-1591] Implement Spark's FileIndex for Hudi to support queries via Hudi DataSource using non-globbed table path and partition pruning (#2651) 2021-04-01 11:12:28 -07:00
Gary Li
452f5e2d66 [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (#2727) 2021-03-29 06:04:48 -07:00
n3nash
bec70413c0 [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (#2731) 2021-03-27 10:07:10 -07:00
garyli1019
6e803e08b1 Moving to 0.9.0-SNAPSHOT on master branch. 2021-03-24 21:37:14 +08:00
n3nash
01a1d7997b [HUDI-1712] Rename & standardize config to match other configs (#2708) 2021-03-24 17:24:02 +08:00
n3nash
d7b18783bd [HUDI-1709] Improving config names and adding hive metastore uri config (#2699) 2021-03-22 01:22:06 -07:00
Volodymyr Burenin
900de34e45 [HUDI-1650] Custom avro kafka deserializer. (#2619)
* Custom avro kafka deserializer

Co-authored-by: volodymyr.burenin <volodymyr.burenin@cloudkitchens.com>
Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>
2021-03-20 00:51:08 -07:00
Sivabalan Narayanan
161d530f93 Fixing kafka auto.reset.offsets config param key (#2691) 2021-03-19 12:54:29 -07:00
n3nash
74241947c1 [HUDI-845] Added locking capability to allow multiple writers (#2374)
* [HUDI-845] Added locking capability to allow multiple writers
1. Added LockProvider API for pluggable lock methodologies
2. Added Resolution Strategy API to allow for pluggable conflict resolution
3. Added TableService client API to schedule table services
4. Added Transaction Manager for wrapping actions within transactions
2021-03-16 16:43:53 -07:00
Vinoth Govindarajan
16864aee14 [HUDI-1695] Fixed the error messaging (#2679) 2021-03-16 11:30:26 +08:00
Ankush Kanungo
f5e31be086 [HUDI-1685] keep updating current date for every batch (#2671) 2021-03-12 15:53:01 -08:00
Sivabalan Narayanan
5cf2f2618b [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (#2577) 2021-03-07 16:40:40 -05:00
pengzhiwei
bc883db5de [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (#2596) 2021-03-05 14:10:27 +08:00
Raymond Xu
f53bca404f [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (#2621)
- Add a config to allow parsing custom date format in `DatePartitionPathSelector`. Currently it assumes date partition string in the format of `yyyy-MM-dd`.
- Fix a bug where `UnsupportedOperationException` was thrown when sort `eligibleFiles` in-place. Changed to sort it and store in a new list.
2021-03-04 21:01:51 -08:00
t0il3ts0ap
4fa43359cb [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (#2617) 2021-03-03 09:49:18 +08:00
liujinhui
617cc24ad1 [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (#2227)
Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>
2021-02-25 07:08:13 -05:00
n3nash
ffcfb58bac [HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359)
1. Refactor rollback and move cleaning failed commits logic into cleaner
2. Introduce hoodie heartbeat to ascertain failed commits
3. Fix test cases
2021-02-19 20:12:22 -08:00
Sivabalan Narayanan
c9fcf964b2 [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (#2534) 2021-02-20 09:54:26 +08:00
Sivabalan Narayanan
d5f202821b Adding fixes to test suite framework. Adding clustering node and validate async operations node. (#2400) 2021-02-12 09:29:21 -08:00
Volodymyr Burenin
17802569fd [HUDI-1538] Try to init class trying different signatures instead of checking its name (#2476)
* [HUDI-1538] Try to init class trying different signatures instead of checking its name.

* Removed unused imports

Co-authored-by: volodymyr.burenin <volodymyr.burenin@cloudkitchens.com>
2021-02-03 12:29:08 -08:00
wangxianghu
23f2ef3efb [HUDI-623] Remove UpgradePayloadFromUberToApache (#2455) 2021-01-28 17:48:50 -08:00
Vinoth Chandar
3719e7b388 Moving to 0.8.0-SNAPSHOT on master branch. 2021-01-20 11:31:22 -08:00
vinoth chandar
5ca0625b27 [HUDI 1308] Harden RFC-15 Implementation based on production testing (#2441)
Addresses leaks, perf degradation observed during testing. These were regressions from the original rfc-15 PoC implementation.

* Pass a single instance of HoodieTableMetadata everywhere
* Fix tests and add config for enabling metrics
 - Removed special casing of assumeDatePartitioning inside FSUtils#getAllPartitionPaths()
 - Consequently, IOException is never thrown and many files had to be adjusted
- More diligent handling of open file handles in metadata table
 - Added config for controlling reuse of connections
 - Added config for turning off fallback to listing, so we can see tests fail
 - Changed all ipf listing code to cache/amortize the open/close for better performance
 - Timelineserver also reuses connections, for better performance
 - Without timelineserver, when metadata table is opened from executors, reuse is not allowed
 - HoodieMetadataConfig passed into HoodieTableMetadata#create as argument.
 -  Fix TestHoodieBackedTableMetadata#testSync
2021-01-19 21:20:28 -08:00
Sivabalan Narayanan
a43e191d6c [MINOR] Bumping snapshot version to 0.7.0 (#2435) 2021-01-16 09:56:28 -05:00
Udit Mehrotra
7ce3ac778e [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths (#2417)
* [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partition paths

* Adding testClass for FileSystemBackedTableMetadata

Co-authored-by: Nishith Agarwal <nagarwal@uber.com>
2021-01-10 21:19:52 -08:00
lw0090
368c1a8f5c [HUDI-1399] support a independent clustering spark job to asynchronously clustering (#2379)
* [HUDI-1481]  add  structured streaming and delta streamer clustering unit test

* [HUDI-1399] support a independent clustering spark job to asynchronously clustering

* [HUDI-1399]  support a  independent clustering spark job to asynchronously clustering

* [HUDI-1498] Read clustering plan from requested file for inflight instant (#2389)

* [HUDI-1399]  support  a independent clustering spark job with schedule generate instant time

Co-authored-by: satishkotha <satishkotha@uber.com>
2021-01-09 17:30:16 -08:00
puyvqi
1a836f9a84 [HUDI-1514] Avoid raw type use for parameter of Transformer interface (#2420) 2021-01-09 18:36:13 +08:00
Udit Mehrotra
17df517b81 [HUDI-1510] Move HoodieEngineContext and its dependencies to hudi-common (#2410) 2021-01-07 11:34:06 -08:00
Udit Mehrotra
4e64226844 [HUDI-1450] Use metadata table for listing in HoodieROTablePathFilter (apache#2326)
[HUDI-1394] [RFC-15] Use metadata table (if present) to get all partition paths (apache#2351)
2021-01-04 07:59:47 -08:00
lw0090
9e6889a8ce [HUDI-1481] add structured streaming and delta streamer clustering unit test (#2360) 2020-12-27 20:27:09 -08:00
Sivabalan Narayanan
8cf6a7223f [HUDI-1331] Adding support for validating entire dataset and long running tests in test suite framework (#2168)
* trigger rebuild

* [HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrapper Class (#1927)

* Adding support for validating records and long running tests in test sutie framework

* Adding partial validate node

* Fixing spark session initiation in Validate nodes

* Fixing validation

* Adding hive table validation to ValidateDatasetNode

* Rebasing with latest commits from master

* Addressing feedback

* Addressing comments

Co-authored-by: lamber-ken <lamberken@163.com>
Co-authored-by: linshan-ma <mabin194046@163.com>
2020-12-26 09:29:24 -08:00
Sivabalan Narayanan
33d338f392 [HUDI-115] Adding DefaultHoodieRecordPayload to honor ordering with combineAndGetUpdateValue (#2311)
* Added ability to pass in `properties` to payload methods, so they can perform table/record specific merges
* Added default methods so existing payload classes are backwards compatible. 
* Adding DefaultHoodiePayload to honor ordering while merging two records
* Fixing default payload based on feedback
2020-12-19 19:19:42 -08:00
Bhavani Sudha Saktheeswaran
14d5d1100c [HUDI-1406] Add date partition based source input selector for Delta streamer (#2264)
- Adds ability to list only recent date based partitions from source data.
- Parallelizes listing for faster tailing of DFSSources
2020-12-17 03:59:30 -08:00
Raymond Xu
bd9cceccb5 [HUDI-1395] Fix partition path using FSUtils (#2312)
Fixed the logic to get partition path in Copier and Exporter utilities.
2020-12-10 10:19:19 +08:00
wenningd
fce1453fa6 [HUDI-1040] Make Hudi support Spark 3 (#2208)
* Fix flaky MOR unit test

* Update Spark APIs to make it be compatible with both spark2 & spark3

* Refactor bulk insert v2 part to make Hudi be able to compile with Spark3

* Add spark3 profile to handle fasterxml & spark version

* Create hudi-spark-common module & refactor hudi-spark related modules

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-12-09 15:52:23 -08:00
liujinhui
62b392b49c [HUDI-1343] Add standard schema postprocessor which would rewrite the schema using spark-avro conversion (#2192)
Co-authored-by: liujh <liujh@t3go.cn>
2020-12-03 19:28:34 -08:00
wangxianghu
a23230c8c2 [HUDI-1400] Replace Operation enum with WriteOperationType (#2259) 2020-11-19 13:40:04 +08:00
Shen Hong
49407169ac [HUDI-1209] Properties File must be optional when running deltastreamer (#2085) 2020-10-21 17:49:28 -07:00
Ho Tien Vu
af5ef4d49d [HUDI-1330] handle prefix filtering at directory level (#2157)
The current DFSPathSelector only ignore prefix(_, .) at the file level while files under subdirectories
e.g. (.checkpoint/*) are still considered which result in bad-format exception during reading.
2020-10-20 23:20:19 -07:00
Ho Tien Vu
fd269ddeb0 [MINOR] Make sure factory method is used to instanciate DFSPathSelector (#2187)
* Move createSourceSelector into DFSPathSelector factory method
* Replace constructor call with factory method
* Added some javadoc
2020-10-20 17:52:31 +08:00
lw0090
2126f13e13 [HUDI-791] Replace null by Option in Delta Streamer (#2171) 2020-10-11 18:29:57 -07:00