1
0
Commit Graph

218 Commits

Author SHA1 Message Date
Raymond Xu
7c45894f43 [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2094)
Migrate deprecated APIs in HoodieTestUtils to HoodieTestTable for test classes
- TestClientRollback
- TestCopyOnWriteRollbackActionExecutor

Use FileCreateUtils APIs in CompactionTestUtils.

Then remove unused deprecated APIs after migration.
2020-09-19 17:55:24 +08:00
Raymond Xu
3201665295 [HUDI-995] Use HoodieTestTable in more classes (#2079)
* [HUDI-995] Use HoodieTestTable in more classes

Migrate test data prep logic in
- TestStatsCommand
- TestHoodieROTablePathFilter

Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness
- Move relevant APIs to HoodieTestTable
- Migrate usages

After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils
2020-09-17 09:29:07 -07:00
shenh062326
581d54097c [HUDI-1143] Change timestamp field in HoodieTestDataGenerator from double to long 2020-09-15 20:58:29 -07:00
Karl-WangSK
a1cff8abae [HUDI-1255] Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage (#2056)
Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage

## Brief change log

update current value for several fields that you want to change.

The default payload OverwriteWithLatestAvroPayload overwrite the whole record when 

compared to `orderingVal`.This doesn't meet our need when we just want to change specified fields.
For example: (suppose Default value is null)
```
current Value 
Field:      name   age   gender
Value:     karl     20    male
```
```
insert Value
Field:      name   age   gender
Value:     null     30    null
```
```
After insert:
Field:      name   age   gender
Value:     karl     30    male
```
## Verify this pull request

Added TestOverwriteNonDefaultsWithLatestAvroPayload to verify the change.
2020-09-09 21:54:21 -07:00
linshan-ma
063a98fc2b [HUDI-1254] TypedProperties can not get values by initializing an existing properties (#2059) 2020-09-09 23:42:41 +08:00
Prashant Wason
fe7c9e71eb [MINOR] Fix BindException when running tests of shared machines. (#2070)
When unit tests are run on shared machines (e.g. jenkins cluster), the unit tests sometimes fail due to BindException in starting HDFS Cluster. This is because the port chosen may have been bound by another process using the same machine. The fix is to retry the port selection a few times.
2020-09-07 19:30:45 -07:00
Raymond Xu
83e39e2b17 [HUDI-781] Add HoodieWriteableTestTable (#2040)
- Introduce HoodieWriteableTestTable for writing records into files
- Migrate writeParquetFiles() in HoodieClientTestUtils to HoodieWriteableTestTable
- Adopt HoodieWrittableTestTable for test cases in
  - ITTestRepairsCommand.java
  - TestHoodieIndex.java
  - TestHoodieKeyLocationFetchHandle.java
  - TestHoodieGlobalBloomIndex.java
  - TestHoodieBloomIndex.java
- Renamed HoodieTestTable and FileCreateUtils APIs
  - dataFile changed to baseFile
2020-09-07 17:54:36 +08:00
Prashant Wason
6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804)
* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning


Other changes: 
 - Remove databricks/avro from pom
 - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
 - Breaking up limitFileSize(), per parquet and hfile base files
 - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
 - Throw UnsupportedException in HFileReader.getRecordKeys()
 - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-31 08:05:59 -07:00
Satish Kotha
4dbeabffa3 [HUDI-1228] Add utility method to query extra metadata 2020-08-28 12:23:47 -07:00
Satish Kotha
492ddcbb06 [HUDI-1191] Add incremental meta client API to query partitions modified in a time window 2020-08-25 12:40:10 -07:00
Mathieu
f8dcd5334e [HUDI-1217] Improve avroToBytes method of HoodieAvroUtils (#2018) 2020-08-24 17:33:28 +08:00
Raymond Xu
3a2ae16961 [HUDI-781] Introduce HoodieTestTable for test preparation (#1997) 2020-08-21 11:46:33 +08:00
Abhishek Modi
bedbb825e0 [HUDI-1025] Meter RPC calls in HoodieWrapperFileSystem (#1916) 2020-08-18 22:42:05 +08:00
Sivabalan Narayanan
379cf0786f [HUDI-1013] Adding Bulk Insert V2 implementation (#1834)
- Adding ability to use native spark row writing for bulk_insert
 - Controlled by `ENABLE_ROW_WRITER_OPT_KEY` datasource write option
 - Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark
 - Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row)
 - Fixed all built-in key generators with new APIs
 - Made the field position map lazily created upon the first call to row based apis
 - Implemented native row based key generators for CustomKeyGenerator
 - Fixed all the tests, with these new APIs

Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-13 00:33:39 -07:00
wenningd
8b928e9bca [HUDI-808] Support cleaning bootstrap source data (#1870)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>
2020-08-11 01:43:46 -07:00
Balaji Varadarajan
626f78f6f6 Revert "[HUDI-781] Introduce HoodieTestTable for test preparation (#1871)"
This reverts commit b2e703d442.
2020-08-10 22:13:02 -07:00
Raymond Xu
b2e703d442 [HUDI-781] Introduce HoodieTestTable for test preparation (#1871) 2020-08-11 09:44:03 +08:00
wenningd
9fe2d2b14a [HUDI-427] [HUDI-971] Implement CLI support for performing bootstrap (#1869)
* [HUDI-971] Clean partitions & fileIds returned by HFileBootstrapIndex
* [HUDI-427] Implement CLI support for performing bootstrap

Co-authored-by: Wenning Ding <wenningd@amazon.com>
Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>
2020-08-08 12:37:29 -07:00
Raymond Xu
5ee676e34f [MINOR] Move a test method to Transformations (#1934)
- Move TestHoodieKeyLocationFetchHandle#getRecordsPerPartition to Transformations
- Improve some var namings
2020-08-08 18:25:55 +08:00
Gary Li
4f74a84607 [HUDI-69] Support Spark Datasource for MOR table - RDD approach (#1848)
- This PR implements Spark Datasource for MOR table in the RDD approach.
- Implemented SnapshotRelation
- Implemented HudiMergeOnReadRDD
- Implemented separate Iterator to handle merge and unmerge record reader.
- Added TestMORDataSource to verify this feature.
- Clean up test file name, add tests for mixed query type tests
 - We can now revert the change made in DefaultSource

Co-authored-by: Vinoth Chandar <vchandar@confluent.io>
2020-08-07 00:28:14 -07:00
Balaji Varadarajan
7a2429f5ba [HUDI-575] Spark Streaming with async compaction support (#1752) 2020-08-05 07:50:15 -07:00
Sivabalan Narayanan
ab11ba43e1 [REVERT] "[HUDI-1058] Make delete marker configurable (#1819)" (#1914)
This reverts commit 433d7d2c98.
2020-08-04 15:20:38 -07:00
vinoth chandar
539621bd33 [HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876)
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
 - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
 - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-421] Bootstrap Write Client with tests
 - [HUDI-425] Added HoodieDeltaStreamer support
 - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
 - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
 - [HUDI-424] Simplify Record reader implementation
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables

Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
2020-08-03 20:19:21 -07:00
Shen Hong
433d7d2c98 [HUDI-1058] Make delete marker configurable (#1819) 2020-08-03 11:06:31 -04:00
Raymond Xu
10e4268792 [HUDI-995] Use Transformations, Assertions and SchemaTestUtil (#1884)
- Consolidate transform functions for tests in Transformations.java
- Consolidate assertion functions for tests in Assertions.java
- Make use of SchemaTestUtil for loading schema from resource
2020-08-01 20:57:18 +08:00
Nishith Agarwal
2fc2b01d86 [HUDI-394] Provide a basic implementation of test suite 2020-07-30 21:21:15 -07:00
Raymond Xu
ca36c44cb3 [HUDI-995] Move TestRawTripPayload and HoodieTestDataGenerator to hudi-common (#1873) 2020-07-27 19:21:45 +08:00
Raymond Xu
0cb24e4a2d [MINOR] Use HoodieActiveTimeline.COMMIT_FORMATTER (#1874) 2020-07-24 18:48:56 -07:00
Gary Li
467d097dae [MINOR] Add Databricks File System to StorageSchemes (#1877) 2020-07-24 18:47:09 -07:00
Sivabalan Narayanan
5b6026ba43 [HUDI-802] Fixing deletes for inserts in same batch in write path (#1792)
* Fixing deletes for inserts in same batch in write path
* Fixing delta streamer tests
* Adding tests for OverwriteWithLatestAvroPayload
2020-07-22 19:39:57 -07:00
DeyinZhong
743ef322b8 [HUDI-871] Add support for Tencent Cloud Object Storage(COS) (#1855)
Co-authored-by: deyzhong <deyzhong@tencent.com>
2020-07-22 17:40:19 +08:00
lw0090
1ec89e9a94 [HUDI-839] Introducing support for rollbacks using marker files (#1756)
* [HUDI-839] Introducing rollback strategy using marker files

 - Adds a new mechanism for rollbacks where it's based on the marker files generated during the write
 - Consequently, marker file/dir deletion now happens post commit, instead of during finalize 
 - Marker files are also generated for AppendHandle, making it consistent throughout the write path 
 - Until upgrade-downgrade mechanism can upgrade non-marker based inflight writes to marker based, this should only be turned on for new datasets.
 - Added marker dir deletion after successful commit/rollback, individual files are not deleted during finalize
 - Fail safe for deleting marker directories, now during timeline archival process
 - Added check to ensure completed instants are not rolled back using marker based strategy. This will be incorrect
 - Reworked tests to rollback inflight instants, instead of completed instants whenever necessary
 - Added an unit test for MarkerBasedRollbackStrategy


Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-07-20 22:41:42 -07:00
Udit Mehrotra
1aae437257 [HUDI-1102] Add common useful Spark related and Table path detection utilities (#1841)
Co-authored-by: Mehrotra <uditme@amazon.com>
2020-07-18 16:16:32 -07:00
Raymond Xu
3b9a30528b [HUDI-996] Add functional test suite for hudi-utilities (#1746)
- Share resources for functional tests
- Add suite for functional test classes from hudi-utilities
2020-07-05 16:44:31 -07:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
hongdd
f3a701757b [HUDI-696] Add unit test for CommitsCommand (#1724) 2020-06-18 21:42:13 +08:00
Balaji Varadarajan
fb283934a3 [HUDI-990] Timeline API : filterCompletedAndCompactionInstants needs to handle requested state correctly. Also ensure timeline gets reloaded after we revert committed transactions 2020-06-04 02:52:21 -07:00
cxzl25
7c59095314 [HUDI-975] Add unit tests in TestHoodieTableFileSystemView to test view for non-partitioned table (#1692) 2020-06-01 07:23:28 -07:00
Sivabalan Narayanan
5a0d3f1cf9 [HUDI-786] Fixing read beyond inline length in InlineFS (#1616) 2020-05-28 12:59:11 -07:00
Raymond Xu
03f136361a [HUDI-811] Restructure test packages in hudi-common (#1644)
* [HUDI-811] Restructure test packages in hudi-common
2020-05-27 16:28:17 +08:00
hongdd
802d16c8c9 [HUDI-707] Add unit test for StatsCommand (#1645) 2020-05-21 18:28:04 +08:00
Pratyaksh Sharma
6a0aa9a645 [HUDI-803] Replaced used of NullNode with JsonProperties.NULL_VALUE in HoodieAvroUtils (#1538)
- added more test cases in TestHoodieAvroUtils.class

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-05-20 09:04:43 -07:00
Sivabalan Narayanan
29edf4b3b8 [HUDI-407] Adding Simple Index to Hoodie. (#1402)
This index finds the location by joining incoming records with records from base files.
2020-05-17 18:32:24 -07:00
Alexander Filipchik
83796b3189 [HUDI-793] Adding proper default to hudi metadata fields and proper handling to rewrite routine (#1513)
* Adding proper default to hudi metadata fields and proper handling to rewrite routine
* Handle fields declared with a null default

Co-authored-by: Alex Filipchik <alex.filipchik@csscompany.com>
2020-05-13 18:04:38 -07:00
liujinhui
32ea4c70ff [HUDI-869] Add support for alluxio (#1608) 2020-05-13 21:00:34 +08:00
vinoth chandar
f92b9fdcc4 [MINOR] Fix hardcoding of ports in TestHoodieJmxMetrics (#1606) 2020-05-10 19:23:26 -04:00
Raymond Xu
366bb10d8c [HUDI-812] Migrate hudi common tests to JUnit 5 (#1590)
* [HUDI-812] Migrate hudi-common tests to JUnit 5
2020-05-06 19:15:20 +08:00
vinoth chandar
c4b71622b9 [MINOR] Reorder HoodieTimeline#compareTimestamp arguments for better readability (#1575)
- reads nicely as (instantTime1, GREATER_THAN_OR_EQUALS, instantTime2) etc
2020-04-30 09:19:39 -07:00
Raymond Xu
06dae30297 [HUDI-810] Migrate ClientTestHarness to JUnit 5 (#1553) 2020-04-28 23:38:16 +08:00
vinoth chandar
19ca0b5629 [HUDI-785] Refactor compaction/savepoint execution based on ActionExector abstraction (#1548)
- Savepoint and compaction classes moved to table.action.* packages
 - HoodieWriteClient#savepoint(...) returns void
 - Renamed HoodieCommitArchiveLog -> HoodieTimelineArchiveLog
 - Fixed tests to take into account the additional validation done
 - Moved helper code into CompactHelpers and SavepointHelpers
2020-04-25 18:26:44 -07:00