1
0
Commit Graph

297 Commits

Author SHA1 Message Date
Shen Hong
236d1b0dec [HUDI-1439] Remove scala dependency from hudi-client-common (#2306) 2020-12-11 00:36:37 -08:00
Danny Chan
4bc45a391a [HUDI-1445] Refactor AbstractHoodieLogRecordScanner to use Builder (#2313) 2020-12-10 20:02:02 +08:00
wenningd
fce1453fa6 [HUDI-1040] Make Hudi support Spark 3 (#2208)
* Fix flaky MOR unit test

* Update Spark APIs to make it be compatible with both spark2 & spark3

* Refactor bulk insert v2 part to make Hudi be able to compile with Spark3

* Add spark3 profile to handle fasterxml & spark version

* Create hudi-spark-common module & refactor hudi-spark related modules

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-12-09 15:52:23 -08:00
wangxianghu
de2fbeac33 [HUDI-1412] Make HoodieWriteConfig support setting different default … (#2278)
* [HUDI-1412] Make HoodieWriteConfig support setting different default value according to engine type
2020-12-07 09:29:53 +08:00
lw0090
1f0d5c077e [HUDI-1349] spark sql support overwrite use insert_overwrite_table (#2196) 2020-12-03 12:26:21 -08:00
rmpifer
78fd122594 [HUDI-1196] Update HoodieKey when deduplicating records with global index (#2248)
- Works only for overwrite payload (default)
- Does not alter current semantics otherwise 

Co-authored-by: Ryan Pifer <ryanpife@amazon.com>
2020-12-01 13:50:46 -08:00
Prashant Wason
ac23d2587f [HUDI-1357] Added a check to validate records are not lost during merges. (#2216)
- Turned off by default
2020-12-01 13:44:57 -08:00
leesf
3d5e9fee7f [MINOR] refactor code in HoodieMergeHandle (#2272) 2020-11-28 21:47:05 +08:00
Balaji Varadarajan
0ebef1c0a0 [HUDI-1358] Fix leaks in DiskBasedMap and LazyFileIterable (#2249) 2020-11-23 10:56:26 -08:00
Shen Hong
d9411c38db [HUDI-1364] Add HoodieJavaEngineContext to hudi-java-client (#2222) 2020-11-23 10:06:28 -08:00
Gary Li
c8d5ea2752 [MINOR] clean up and add comments to flink client (#2261) 2020-11-19 15:27:52 +08:00
wangxianghu
4d05680038 [HUDI-1327] Introduce base implemetation of hudi-flink-client (#2176) 2020-11-18 17:57:11 +08:00
Balaji Varadarajan
42b6aeca28 [HUDI-1358] Fix Memory Leak in HoodieLogFormatWriter (#2217) 2020-11-09 19:26:13 -08:00
wangxianghu
d160abb437 [HUDI-912] Refactor and relocate KeyGenerator to support more engines (#2200)
* [HUDI-912] Refactor and relocate KeyGenerator to support more engines

* Rename KeyGenerators
2020-11-02 13:12:51 -08:00
Venkatesh Rudraraju
59f995a3f5 Use RateLimiter instead of sleep. Repartition WriteStatus to optimize Hbase index writes (#1484) 2020-11-02 08:33:27 -08:00
wangxianghu
e206ddd431 [MINOR] Private the NoArgsConstructor of SparkMergeHelper and code clean (#2194) 2020-10-26 12:22:11 +08:00
lw0090
8545ea3856 [HUDI-1118] Cleanup rollback files residing in .hoodie folder (#2205) 2020-10-25 21:04:56 -07:00
Prashant Wason
49e855c348 [HUDI-1326] Added an API to force publish metrics and flush them. (#2152)
* [HUDI-1326] Added an API to force publish metrics and flush them.

Using the added API, publish metrics after each level of the DAG completed in hudi-test-suite.

* Code cleanups

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-24 16:47:24 -07:00
Raymond Xu
14c4611857 [MINOR] Fix caller to SparkBulkInsertCommitActionExecutor (#2195)
Fixed calling the wrong constructor
2020-10-21 19:50:10 -07:00
lw0090
4d80e1e221 [HUDI-284] add more test for UpdateSchemaEvolution (#2127)
Unit test different schema evolution scenarios.
2020-10-19 07:38:04 -07:00
wangxianghu
c7d962efff [HUDI-1328] Introduce HoodieFlinkEngineContext to hudi-flink-client (#2161) 2020-10-14 09:30:49 +08:00
satishkotha
0d407342ef [HUDI-1304] Add unit test for testing compaction on replaced file groups (#2150) 2020-10-12 16:48:29 -07:00
Raymond Xu
c5e10d668f [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2167)
Remove APIs in `HoodieTestUtils`
- `createCommitFiles`
- `createDataFile`
- `createNewLogFile`
- `createCompactionRequest`

Migrated usages in `TestCleaner#testPendingCompactions`.

Also improved some API names in `HoodieTestTable`.
2020-10-12 14:39:10 +08:00
hj2016
c0472d3317 [HUDI-1184] Fix the support of hbase index partition path change (#1978)
When the hbase index is used, when the record partition is changed to another partition, the path does not change according to the value of the partition column

Co-authored-by: huangjing <huangjing@clinbrain.com>
2020-10-11 19:05:57 -07:00
dugenkui
b58daf29ba [MINOR] remove unused generics type (#2163) 2020-10-11 18:38:42 -07:00
vinoyang
eafd7bf289 [MINOR] Fix wrong javadoc and refactor some naming issues (#2156) 2020-10-09 15:09:26 -07:00
Raymond Xu
1d1d91d444 [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2143)
* [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable

Remove APIs in `HoodieTestUtils`
- listAllDataFilesAndLogFilesInPath
- listAllLogFilesInPath
- listAllDataFilesInPath
- writeRecordsToLogFiles
- createCleanFiles
- createPendingCleanFiles

Migrate the callers to use `HoodieTestTable` and `HoodieWriteableTestTable` with new APIs added
- listAllBaseAndLogFiles
- listAllLogFiles
- listAllBaseFiles
- withLogAppends
- addClean
- addInflightClean

Also added related APIs in `FileCreateUtils`
- createCleanFile
- createRequestedCleanFile
- createInflightCleanFile
2020-10-09 10:21:27 +08:00
Pratyaksh Sharma
524193eb4b [HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode (#1566)
Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>
2020-10-06 20:34:03 -07:00
lw0090
fdae388626 [HUDI-1203] add port configuration for EmbeddedTimelineService (#2142) 2020-10-05 11:36:54 -07:00
Prashant Wason
6c610b91ef [HUDI-1305] Added an API to shutdown and remove the metrics reporter. (#2132)
This helps in removing reporter once the test has complete. Prevents log pollution from un-necessary metric logs.

- Added an API to shutdown the metrics reporter after tests.
2020-10-04 09:30:04 -07:00
Mathieu
1f7add9291 [HUDI-1089] Refactor hudi-client to support multi-engine (#1827)
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules 
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-01 14:25:29 -07:00
satishkotha
a99e93bed5 [HUDI-1072] Introduce REPLACE top level action. Implement insert_overwrite operation on top of replace action (#2048) 2020-09-29 17:04:25 -07:00
Raymond Xu
1be0b06ef8 [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2112)
Remove APIs in HoodieTestUtils

- HoodieTestUtils#createInflightCommitFiles
- HoodieTestUtils#getCommitFilePath
- HoodieTestUtils#doesCommitExist

and migrate usages to HoodieTestTable in

- hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRollbacksCommand.java
- hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestUpgradeDowngradeCommand.java
- hudi-cli/src/test/java/org/apache/hudi/cli/integ/ITTestCommitsCommand.java
- hudi-cli/src/test/java/org/apache/hudi/cli/testutils/HoodieTestCommitMetadataGenerator.java
- hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java
2020-09-26 21:21:47 +08:00
dugenkui
ae68b2b355 [MINOR] fix typos (#2116) 2020-09-26 20:40:33 +08:00
dugenkui
6837118c21 [MINOR] Improve description (#2113) 2020-09-25 22:21:37 +08:00
lw0090
fcc497eff1 [HUDI-1268] fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss (#2099) 2020-09-22 09:57:20 -07:00
Kaiux
8087016504 [HUDI-1213] Set Default for the bootstrap config : hoodie.bootstrap.full.input.provider (#2087) 2020-09-22 03:28:19 -07:00
Raymond Xu
7c45894f43 [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2094)
Migrate deprecated APIs in HoodieTestUtils to HoodieTestTable for test classes
- TestClientRollback
- TestCopyOnWriteRollbackActionExecutor

Use FileCreateUtils APIs in CompactionTestUtils.

Then remove unused deprecated APIs after migration.
2020-09-19 17:55:24 +08:00
Raymond Xu
3201665295 [HUDI-995] Use HoodieTestTable in more classes (#2079)
* [HUDI-995] Use HoodieTestTable in more classes

Migrate test data prep logic in
- TestStatsCommand
- TestHoodieROTablePathFilter

Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness
- Move relevant APIs to HoodieTestTable
- Migrate usages

After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils
2020-09-17 09:29:07 -07:00
shenh062326
581d54097c [HUDI-1143] Change timestamp field in HoodieTestDataGenerator from double to long 2020-09-15 20:58:29 -07:00
Raymond Xu
83e39e2b17 [HUDI-781] Add HoodieWriteableTestTable (#2040)
- Introduce HoodieWriteableTestTable for writing records into files
- Migrate writeParquetFiles() in HoodieClientTestUtils to HoodieWriteableTestTable
- Adopt HoodieWrittableTestTable for test cases in
  - ITTestRepairsCommand.java
  - TestHoodieIndex.java
  - TestHoodieKeyLocationFetchHandle.java
  - TestHoodieGlobalBloomIndex.java
  - TestHoodieBloomIndex.java
- Renamed HoodieTestTable and FileCreateUtils APIs
  - dataFile changed to baseFile
2020-09-07 17:54:36 +08:00
Dongwook
8d19ebfd0f [HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703)
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.

This patch makes the following fixes in this regard. 
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
2020-09-01 12:55:31 -04:00
Prashant Wason
6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804)
* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning


Other changes: 
 - Remove databricks/avro from pom
 - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
 - Breaking up limitFileSize(), per parquet and hfile base files
 - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
 - Throw UnsupportedException in HFileReader.getRecordKeys()
 - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-31 08:05:59 -07:00
Raymond Xu
0360bef217 [MINOR] Improve helper methods in TestCleaner (#2052)
- Use private static assert methods
- Use ParameterizedTest
- Rename HoodieTestTable APIs
2020-08-29 14:06:25 +08:00
Mathieu
fa81248247 [HUDI-531] Add java doc for hudi test suite general classes (#1900) 2020-08-28 08:44:40 +08:00
Mathieu
7e68c42eb1 [HUDI-1223] Remove unused UpdateHandler class in HoodieCopyOnWriteTable (#2032) 2020-08-26 08:46:19 +08:00
Trevor
6a4dc7384c [HUDI-1218] Introduce BulkInsertSortMode as Independent class (#2021) 2020-08-25 19:04:13 +08:00
Trevor
7291607ae3 [MINOR] Remove unused log code in HoodieReadClient (#2000) 2020-08-22 21:45:50 +08:00
Shen Hong
1d09c02f1c [HUDI-1083] Optimization in determining insert bucket location for a given key (#1868)
- To determine insert bucket location for a given key, hudi walks through all insert buckets with O(N) cost, while this patch adds an optimization to make it O(logN).
2020-08-22 07:41:39 -04:00
Raymond Xu
3a2ae16961 [HUDI-781] Introduce HoodieTestTable for test preparation (#1997) 2020-08-21 11:46:33 +08:00