Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage
## Brief change log
update current value for several fields that you want to change.
The default payload OverwriteWithLatestAvroPayload overwrite the whole record when
compared to `orderingVal`.This doesn't meet our need when we just want to change specified fields.
For example: (suppose Default value is null)
```
current Value
Field: name age gender
Value: karl 20 male
```
```
insert Value
Field: name age gender
Value: null 30 null
```
```
After insert:
Field: name age gender
Value: karl 30 male
```
## Verify this pull request
Added TestOverwriteNonDefaultsWithLatestAvroPayload to verify the change.
* [HUDI-1181] Fix decimal type display issue for record key field
* Remove getNestedFieldVal method from DataSourceUtils
* resolve comments
Co-authored-by: Wenning Ding <wenningd@amazon.com>
When unit tests are run on shared machines (e.g. jenkins cluster), the unit tests sometimes fail due to BindException in starting HDFS Cluster. This is because the port chosen may have been bound by another process using the same machine. The fix is to retry the port selection a few times.
- Introduce HoodieWriteableTestTable for writing records into files
- Migrate writeParquetFiles() in HoodieClientTestUtils to HoodieWriteableTestTable
- Adopt HoodieWrittableTestTable for test cases in
- ITTestRepairsCommand.java
- TestHoodieIndex.java
- TestHoodieKeyLocationFetchHandle.java
- TestHoodieGlobalBloomIndex.java
- TestHoodieBloomIndex.java
- Renamed HoodieTestTable and FileCreateUtils APIs
- dataFile changed to baseFile
* [HUDI-960] Implementation of the HFile base and log file format.
1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)
Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)
HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.
Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning
Other changes:
- Remove databricks/avro from pom
- Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
- Breaking up limitFileSize(), per parquet and hfile base files
- Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
- Throw UnsupportedException in HFileReader.getRecordKeys()
- Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)
* Fixing checkstyle
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- Adding ability to use native spark row writing for bulk_insert
- Controlled by `ENABLE_ROW_WRITER_OPT_KEY` datasource write option
- Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark
- Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row)
- Fixed all built-in key generators with new APIs
- Made the field position map lazily created upon the first call to row based apis
- Implemented native row based key generators for CustomKeyGenerator
- Fixed all the tests, with these new APIs
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- This pull request adds upgrade/downgrade infra for smooth transition from list based rollback to marker based rollback*
- A new property called hoodie.table.version is added to hoodie.properties file as part of this. Whenever hoodie is launched with newer table version i.e 1(or moving from pre 0.6.0 to 0.6.0), an upgrade step will be executed automatically to adhere to marker based rollback.*
- This automatic upgrade step will happen just once per dataset as the hoodie.table.version will be updated in property file after upgrade is completed once*
- Similarly, a command line tool for Downgrading is added if incase some user wants to downgrade hoodie from table version 1 to 0 or move from hoodie 0.6.0 to pre 0.6.0*
- *Added UpgradeDowngrade to assist in upgrading or downgrading hoodie table*
- *Added Interfaces for upgrade and downgrade and concrete implementations for upgrading from 0 to 1 and downgrading from 1 to 0.*
- *Made some changes to ListingBasedRollbackHelper to expose just rollback stats w/o performing actual rollback, which will be consumed by Upgrade infra*
- Reworking failure handling for upgrade/downgrade
- Changed tests accordingly, added one test around left over cleanup
- New tables now write table version into hoodie.properties
- Clean up code naming, abstractions.
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
- This PR implements Spark Datasource for MOR table in the RDD approach.
- Implemented SnapshotRelation
- Implemented HudiMergeOnReadRDD
- Implemented separate Iterator to handle merge and unmerge record reader.
- Added TestMORDataSource to verify this feature.
- Clean up test file name, add tests for mixed query type tests
- We can now revert the change made in DefaultSource
Co-authored-by: Vinoth Chandar <vchandar@confluent.io>
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
- [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
- [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
- [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
- [HUDI-421] Bootstrap Write Client with tests
- [HUDI-425] Added HoodieDeltaStreamer support
- [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
- [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
- [HUDI-424] Simplify Record reader implementation
- [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
- [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables
Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
- Consolidate transform functions for tests in Transformations.java
- Consolidate assertion functions for tests in Assertions.java
- Make use of SchemaTestUtil for loading schema from resource
* [HUDI-839] Introducing rollback strategy using marker files
- Adds a new mechanism for rollbacks where it's based on the marker files generated during the write
- Consequently, marker file/dir deletion now happens post commit, instead of during finalize
- Marker files are also generated for AppendHandle, making it consistent throughout the write path
- Until upgrade-downgrade mechanism can upgrade non-marker based inflight writes to marker based, this should only be turned on for new datasets.
- Added marker dir deletion after successful commit/rollback, individual files are not deleted during finalize
- Fail safe for deleting marker directories, now during timeline archival process
- Added check to ensure completed instants are not rolled back using marker based strategy. This will be incorrect
- Reworked tests to rollback inflight instants, instead of completed instants whenever necessary
- Added an unit test for MarkerBasedRollbackStrategy
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Notable changes:
1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
4. HiveSyncTool accepts the base file format as a CLI parameter
5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
6. HoodieSparkSqlWriter accepts the base file format as a parameter
add hoodie delta streamer mock source example and dfs source and kafka source examples
Signed-off-by: dengziming <dengziming1993@gmail.com>
add defaultSparkConf utils method
change version of hudi-examples to 0.5.2-SNAPSHOT
change the artifcatId of hudi-spark and hudi-utilities
alter some code to adapt kafka2.0
Update scritps
Add license