1
0
Commit Graph

222 Commits

Author SHA1 Message Date
Danny Chan
4bc45a391a [HUDI-1445] Refactor AbstractHoodieLogRecordScanner to use Builder (#2313) 2020-12-10 20:02:02 +08:00
wenningd
fce1453fa6 [HUDI-1040] Make Hudi support Spark 3 (#2208)
* Fix flaky MOR unit test

* Update Spark APIs to make it be compatible with both spark2 & spark3

* Refactor bulk insert v2 part to make Hudi be able to compile with Spark3

* Add spark3 profile to handle fasterxml & spark version

* Create hudi-spark-common module & refactor hudi-spark related modules

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-12-09 15:52:23 -08:00
lw0090
1f0d5c077e [HUDI-1349] spark sql support overwrite use insert_overwrite_table (#2196) 2020-12-03 12:26:21 -08:00
Prashant Wason
ac23d2587f [HUDI-1357] Added a check to validate records are not lost during merges. (#2216)
- Turned off by default
2020-12-01 13:44:57 -08:00
Guy Khazma
b826c53e33 [HUDI-1373] Add Support for OpenJ9 JVM (#2231)
* add supoort for OpenJ9 VM
* add 32bit openJ9
* Pulled the memory layout specs into their own classes.
2020-12-01 13:19:40 -08:00
Balaji Varadarajan
0ebef1c0a0 [HUDI-1358] Fix leaks in DiskBasedMap and LazyFileIterable (#2249) 2020-11-23 10:56:26 -08:00
Balaji Varadarajan
42b6aeca28 [HUDI-1358] Fix Memory Leak in HoodieLogFormatWriter (#2217) 2020-11-09 19:26:13 -08:00
wenningd
0364498ae3 [HUDI-1375] Fix bug in HoodieAvroUtils.removeMetadataFields() method (#2232)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-11-05 17:30:17 -08:00
satishkotha
33ec88fc38 [HUDI-1352] Add FileSystemView APIs to query pending clustering operations (#2202) 2020-11-05 08:49:58 -08:00
Venkatesh Rudraraju
59f995a3f5 Use RateLimiter instead of sleep. Repartition WriteStatus to optimize Hbase index writes (#1484) 2020-11-02 08:33:27 -08:00
Guy Khazma
35d406de40 [HUDI-1344] IBM Cloud Object Storage Support (#2182) 2020-10-18 17:24:53 +08:00
lw0090
b66c3ef23a [HUDI-1298] Add better error messages when IOException occurs during log file reading (#2133) 2020-10-13 00:45:10 -07:00
Raymond Xu
c5e10d668f [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2167)
Remove APIs in `HoodieTestUtils`
- `createCommitFiles`
- `createDataFile`
- `createNewLogFile`
- `createCompactionRequest`

Migrated usages in `TestCleaner#testPendingCompactions`.

Also improved some API names in `HoodieTestTable`.
2020-10-12 14:39:10 +08:00
dugenkui
b58daf29ba [MINOR] remove unused generics type (#2163) 2020-10-11 18:38:42 -07:00
dugenkui
032bc3b08f [MINOR] NPE Optimization for Option (#2158) 2020-10-11 17:55:41 -07:00
dugenkui
d4d4c8c899 [MINOR] Fix typo and others (#2164)
* remove HoodieSerializationException that will never be throw
* remove unused method, make HoodieException more readable
* fix typo
2020-10-11 17:52:44 -07:00
lw0090
585ce0094d [HUDI-1301] use spark INCREMENTAL mode query hudi dataset support schema version. (#2125) 2020-10-10 20:53:41 +08:00
dugenkui
00271af64e [MINOR] Fix typo (#2159)
* fix typo

* fix typo
2020-10-09 14:52:55 -07:00
Raymond Xu
1d1d91d444 [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2143)
* [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable

Remove APIs in `HoodieTestUtils`
- listAllDataFilesAndLogFilesInPath
- listAllLogFilesInPath
- listAllDataFilesInPath
- writeRecordsToLogFiles
- createCleanFiles
- createPendingCleanFiles

Migrate the callers to use `HoodieTestTable` and `HoodieWriteableTestTable` with new APIs added
- listAllBaseAndLogFiles
- listAllLogFiles
- listAllBaseFiles
- withLogAppends
- addClean
- addInflightClean

Also added related APIs in `FileCreateUtils`
- createCleanFile
- createRequestedCleanFile
- createInflightCleanFile
2020-10-09 10:21:27 +08:00
Shen Hong
b335459c80 [HUDI-1208] Ordering Field should be optional when precombine is turned off (#2088) 2020-10-04 11:34:21 -07:00
satishkotha
a99e93bed5 [HUDI-1072] Introduce REPLACE top level action. Implement insert_overwrite operation on top of replace action (#2048) 2020-09-29 17:04:25 -07:00
hongdd
32c9cad52c [HUDI-840] Avoid blank file created by HoodieLogFormatWriter (#1567) 2020-09-29 08:02:15 -07:00
leesf
b0f1b736f8 [MINOR] Fix checkstyle (#2117) 2020-09-26 22:25:19 +08:00
Raymond Xu
1be0b06ef8 [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2112)
Remove APIs in HoodieTestUtils

- HoodieTestUtils#createInflightCommitFiles
- HoodieTestUtils#getCommitFilePath
- HoodieTestUtils#doesCommitExist

and migrate usages to HoodieTestTable in

- hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRollbacksCommand.java
- hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestUpgradeDowngradeCommand.java
- hudi-cli/src/test/java/org/apache/hudi/cli/integ/ITTestCommitsCommand.java
- hudi-cli/src/test/java/org/apache/hudi/cli/testutils/HoodieTestCommitMetadataGenerator.java
- hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java
2020-09-26 21:21:47 +08:00
dugenkui
ae68b2b355 [MINOR] fix typos (#2116) 2020-09-26 20:40:33 +08:00
Raymond Xu
7c45894f43 [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2094)
Migrate deprecated APIs in HoodieTestUtils to HoodieTestTable for test classes
- TestClientRollback
- TestCopyOnWriteRollbackActionExecutor

Use FileCreateUtils APIs in CompactionTestUtils.

Then remove unused deprecated APIs after migration.
2020-09-19 17:55:24 +08:00
Raymond Xu
3201665295 [HUDI-995] Use HoodieTestTable in more classes (#2079)
* [HUDI-995] Use HoodieTestTable in more classes

Migrate test data prep logic in
- TestStatsCommand
- TestHoodieROTablePathFilter

Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness
- Move relevant APIs to HoodieTestTable
- Migrate usages

After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils
2020-09-17 09:29:07 -07:00
shenh062326
581d54097c [HUDI-1143] Change timestamp field in HoodieTestDataGenerator from double to long 2020-09-15 20:58:29 -07:00
Karl-WangSK
a1cff8abae [HUDI-1255] Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage (#2056)
Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage

## Brief change log

update current value for several fields that you want to change.

The default payload OverwriteWithLatestAvroPayload overwrite the whole record when 

compared to `orderingVal`.This doesn't meet our need when we just want to change specified fields.
For example: (suppose Default value is null)
```
current Value 
Field:      name   age   gender
Value:     karl     20    male
```
```
insert Value
Field:      name   age   gender
Value:     null     30    null
```
```
After insert:
Field:      name   age   gender
Value:     karl     30    male
```
## Verify this pull request

Added TestOverwriteNonDefaultsWithLatestAvroPayload to verify the change.
2020-09-09 21:54:21 -07:00
linshan-ma
063a98fc2b [HUDI-1254] TypedProperties can not get values by initializing an existing properties (#2059) 2020-09-09 23:42:41 +08:00
Abhishek Modi
53d1e55110 Test Suite should work with Docker + Unit Tests 2020-09-08 22:41:14 -07:00
wenningd
2fee087f0f [HUDI-1181] Fix decimal type display issue for record key field (#1953)
* [HUDI-1181] Fix decimal type display issue for record key field

* Remove getNestedFieldVal method from DataSourceUtils

* resolve comments

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-09-08 17:50:54 -07:00
Prashant Wason
fe7c9e71eb [MINOR] Fix BindException when running tests of shared machines. (#2070)
When unit tests are run on shared machines (e.g. jenkins cluster), the unit tests sometimes fail due to BindException in starting HDFS Cluster. This is because the port chosen may have been bound by another process using the same machine. The fix is to retry the port selection a few times.
2020-09-07 19:30:45 -07:00
Raymond Xu
83e39e2b17 [HUDI-781] Add HoodieWriteableTestTable (#2040)
- Introduce HoodieWriteableTestTable for writing records into files
- Migrate writeParquetFiles() in HoodieClientTestUtils to HoodieWriteableTestTable
- Adopt HoodieWrittableTestTable for test cases in
  - ITTestRepairsCommand.java
  - TestHoodieIndex.java
  - TestHoodieKeyLocationFetchHandle.java
  - TestHoodieGlobalBloomIndex.java
  - TestHoodieBloomIndex.java
- Renamed HoodieTestTable and FileCreateUtils APIs
  - dataFile changed to baseFile
2020-09-07 17:54:36 +08:00
Sreeram Ramji
6537af2676 [HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured (#2014) 2020-09-04 09:08:30 -07:00
Prashant Wason
6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804)
* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning


Other changes: 
 - Remove databricks/avro from pom
 - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
 - Breaking up limitFileSize(), per parquet and hfile base files
 - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
 - Throw UnsupportedException in HFileReader.getRecordKeys()
 - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-31 08:05:59 -07:00
Satish Kotha
4dbeabffa3 [HUDI-1228] Add utility method to query extra metadata 2020-08-28 12:23:47 -07:00
Balajee Nagasubramaniam
cc555ba188 [HUDI-1133] Tune buffer sizes for the diskbased external spillable map 2020-08-25 14:23:58 -07:00
Satish Kotha
492ddcbb06 [HUDI-1191] Add incremental meta client API to query partitions modified in a time window 2020-08-25 12:40:10 -07:00
Prashant Wason
218d4a6836 [HUDI-1135] Make timeline server timeout settings configurable. 2020-08-24 18:09:00 -07:00
Prashant Wason
9b1f16b604 [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class. 2020-08-24 18:08:17 -07:00
Mathieu
f8dcd5334e [HUDI-1217] Improve avroToBytes method of HoodieAvroUtils (#2018) 2020-08-24 17:33:28 +08:00
Raymond Xu
3a2ae16961 [HUDI-781] Introduce HoodieTestTable for test preparation (#1997) 2020-08-21 11:46:33 +08:00
Abhishek Modi
bedbb825e0 [HUDI-1025] Meter RPC calls in HoodieWrapperFileSystem (#1916) 2020-08-18 22:42:05 +08:00
Bhavani Sudha Saktheeswaran
4226d75144 Moving to 0.6.1-SNAPSHOT on master branch. 2020-08-14 12:54:15 -07:00
vinoth chandar
9bde6d616c [HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to mark public APIs (#1965)
- Maturity levels one of : evolving, stable, deprecated
- Took a pass and marked out most of the existing public API
2020-08-13 23:28:17 -07:00
Sivabalan Narayanan
379cf0786f [HUDI-1013] Adding Bulk Insert V2 implementation (#1834)
- Adding ability to use native spark row writing for bulk_insert
 - Controlled by `ENABLE_ROW_WRITER_OPT_KEY` datasource write option
 - Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark
 - Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row)
 - Fixed all built-in key generators with new APIs
 - Made the field position map lazily created upon the first call to row based apis
 - Implemented native row based key generators for CustomKeyGenerator
 - Fixed all the tests, with these new APIs

Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-13 00:33:39 -07:00
wenningd
8b928e9bca [HUDI-808] Support cleaning bootstrap source data (#1870)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>
2020-08-11 01:43:46 -07:00
Balaji Varadarajan
626f78f6f6 Revert "[HUDI-781] Introduce HoodieTestTable for test preparation (#1871)"
This reverts commit b2e703d442.
2020-08-10 22:13:02 -07:00
Raymond Xu
b2e703d442 [HUDI-781] Introduce HoodieTestTable for test preparation (#1871) 2020-08-11 09:44:03 +08:00