1
0
Commit Graph

93 Commits

Author SHA1 Message Date
rmpifer
fed01cd3c9 [MINOR] Update spark master default to yarn (#2148) 2020-10-05 15:22:28 -07:00
Mathieu
1f7add9291 [HUDI-1089] Refactor hudi-client to support multi-engine (#1827)
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules 
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-01 14:25:29 -07:00
dugenkui
ae68b2b355 [MINOR] fix typos (#2116) 2020-09-26 20:40:33 +08:00
hongdd
2eaba0962a [HUDI-544] Archived commits command code cleanup (#1242)
* Archived commits command code cleanup
2020-09-25 09:36:41 -07:00
Pratyaksh Sharma
73e5b4c7bb [HUDI-796] Add deduping logic for upserts case (#1558) 2020-09-18 19:37:52 +08:00
Sivabalan Narayanan
379cf0786f [HUDI-1013] Adding Bulk Insert V2 implementation (#1834)
- Adding ability to use native spark row writing for bulk_insert
 - Controlled by `ENABLE_ROW_WRITER_OPT_KEY` datasource write option
 - Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark
 - Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row)
 - Fixed all built-in key generators with new APIs
 - Made the field position map lazily created upon the first call to row based apis
 - Implemented native row based key generators for CustomKeyGenerator
 - Fixed all the tests, with these new APIs

Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-13 00:33:39 -07:00
Sivabalan Narayanan
ff53e8f0b6 [HUDI-1014] Adding Upgrade and downgrade infra for smooth transitioning from list based rollback to marker based rollback (#1858)
- This pull request adds upgrade/downgrade infra for smooth transition from list based rollback to marker based rollback*
 - A new property called hoodie.table.version is added to hoodie.properties file as part of this. Whenever hoodie is launched with newer table version i.e 1(or moving from pre 0.6.0 to 0.6.0), an upgrade step will be executed automatically to adhere to marker based rollback.*
 - This automatic upgrade step will happen just once per dataset as the hoodie.table.version will be updated in property file after upgrade is completed once*
 - Similarly, a command line tool for Downgrading is added if incase some user wants to downgrade hoodie from table version 1 to 0 or move from hoodie 0.6.0 to pre 0.6.0*
 - *Added UpgradeDowngrade to assist in upgrading or downgrading hoodie table*
 - *Added Interfaces for upgrade and downgrade and concrete implementations for upgrading from 0 to 1 and downgrading from 1 to 0.*
 - *Made some changes to ListingBasedRollbackHelper to expose just rollback stats w/o performing actual rollback, which will be consumed by Upgrade infra*
- Reworking failure handling for upgrade/downgrade
 - Changed tests accordingly, added one test around left over cleanup
 - New tables now write table version into hoodie.properties
 - Clean up code naming, abstractions.

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-09 15:32:43 -07:00
wenningd
9fe2d2b14a [HUDI-427] [HUDI-971] Implement CLI support for performing bootstrap (#1869)
* [HUDI-971] Clean partitions & fileIds returned by HFileBootstrapIndex
* [HUDI-427] Implement CLI support for performing bootstrap

Co-authored-by: Wenning Ding <wenningd@amazon.com>
Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>
2020-08-08 12:37:29 -07:00
vinoth chandar
539621bd33 [HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876)
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
 - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
 - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-421] Bootstrap Write Client with tests
 - [HUDI-425] Added HoodieDeltaStreamer support
 - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
 - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
 - [HUDI-424] Simplify Record reader implementation
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables

Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
2020-08-03 20:19:21 -07:00
hongdd
fa419213f6 [HUDI-703] Add test for HoodieSyncCommand (#1774) 2020-07-28 08:31:43 +08:00
hongdd
12ef8c9249 [HUDI-708] Add temps show and unit test for TempViewCommand (#1770) 2020-07-23 08:43:46 +08:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
hongdd
f3a701757b [HUDI-696] Add unit test for CommitsCommand (#1724) 2020-06-18 21:42:13 +08:00
hongdd
5099a91edd [HUDI-709] Add unit test for UtilsCommand (#1686) 2020-06-18 19:54:14 +08:00
hongdd
fcabc8fbca [HUDI-1019] Clean refresh command in CLI (#1725) 2020-06-14 14:30:28 +08:00
hongdd
802d16c8c9 [HUDI-707] Add unit test for StatsCommand (#1645) 2020-05-21 18:28:04 +08:00
rolandjohann
244d47494e [HUDI-888] fix NullPointerException in HoodieCompactor (#1622) 2020-05-20 04:22:35 -07:00
hongdd
161a798337 [HUDI-706] Add unit test for SavepointsCommand (#1624) 2020-05-19 18:36:01 +08:00
hongdd
57132f79bb [HUDI-705] Add unit test for RollbacksCommand (#1611) 2020-05-18 14:04:06 +08:00
hongdd
3a2fe13fcb [HUDI-701] Add unit test for HDFSParquetImportCommand (#1574) 2020-05-14 19:15:49 +08:00
Balaji Varadarajan
8d0e23173b [HUDI-820] cleaner repair command should only inspect clean metadata files (#1542) 2020-05-11 09:25:54 +08:00
hongdd
f921469afc [HUDI-704] Add test for RepairsCommand (#1554) 2020-05-07 23:02:28 +08:00
vinoth chandar
c4b71622b9 [MINOR] Reorder HoodieTimeline#compareTimestamp arguments for better readability (#1575)
- reads nicely as (instantTime1, GREATER_THAN_OR_EQUALS, instantTime2) etc
2020-04-30 09:19:39 -07:00
hongdd
9059bce977 [HUDI-702] Add test for HoodieLogFileCommand (#1522) 2020-04-29 18:47:27 +08:00
vinoth chandar
19ca0b5629 [HUDI-785] Refactor compaction/savepoint execution based on ActionExector abstraction (#1548)
- Savepoint and compaction classes moved to table.action.* packages
 - HoodieWriteClient#savepoint(...) returns void
 - Renamed HoodieCommitArchiveLog -> HoodieTimelineArchiveLog
 - Fixed tests to take into account the additional validation done
 - Moved helper code into CompactHelpers and SavepointHelpers
2020-04-25 18:26:44 -07:00
Prashant Wason
62bd3e7ded [HUDI-757] Added hudi-cli command to export metadata of Instants.
Example:
hudi:db.table-> export instants --localFolder /tmp/ --limit 5 --actions clean,rollback,commit --desc false
2020-04-21 12:41:19 -07:00
Prashant Wason
19d29ac7d0 [HUDI-741] Added checks to validate Hoodie's schema evolution.
HUDI specific validation of schema evolution should ensure that a newer schema can be used for the dataset by checking that the data written using the old schema can be read using the new schema.

Code changes:

1. Added a new config in HoodieWriteConfig to enable schema validation check (disabled by default)
2. Moved code that reads schema from base/log files into hudi-common from hudi-hive-sync
3. Added writerSchema to the extraMetadata of compaction commits in MOR table. This is same as that for commits on COW table.

Testing changes:

4. Extended TestHoodieClientBase to add insertBatch API which allows inserting a new batch of unique records into a HUDI table
5. Added a unit test to verify schema evolution for both COW and MOR tables.
6. Added unit tests for schema compatiblity checks.
2020-04-15 23:34:59 -07:00
hongdd
644c1cc8bd [HUDI-698]Add unit test for CleansCommand (#1449) 2020-04-14 17:54:47 +08:00
vinoth chandar
661b0b3bab [HUDI-761] Refactoring rollback and restore actions using the ActionExecutor abstraction (#1492)
- rollback() and restore() table level APIs introduced
- Restore is implemented by wrapping calls to rollback executor
- Existing tests transparently cover this, since its just a refactor
2020-04-13 08:29:19 -07:00
hongdd
a464a2972e [HUDI-700]Add unit test for FileSystemViewCommand (#1490) 2020-04-11 10:12:21 +08:00
hongdd
4e5c8671ef [HUDI-740]Fix can not specify the sparkMaster and code clean for SparkUtil (#1452) 2020-04-08 21:33:15 +08:00
Shaofeng Shi
78b3194e82 [HUDI-751] Fix some coding issues reported by FindBugs (#1470) 2020-03-31 21:19:32 +08:00
lamber-ken
dbc9acd23a [HUDI-716] Exception: Not an Avro data file when running HoodieCleanClient.runClean (#1432) 2020-03-30 11:19:17 -07:00
Suneel Marthi
fa36082554 [HUDI-746] Reduce build warnings < 10 (#1465) 2020-03-30 11:46:52 +08:00
vinoth chandar
e057c27603 [HUDI-744] Restructure hudi-common and clean up files under util packages (#1462)
- Brings more order and cohesion to the classes in hudi-common
 - Utils classes related to a particular concept (avro, timeline,...) are placed near to the package
 - common.fs package now contains all the filesystem level classes including wrapper filesystem
 - bloom.filter package renamed to just bloom
 - config package contains classes that help store properties
 - common.fs.inline package contains all the inline filesystem classes/impl
 - common.table.timeline now consolidates all timeline related classes
 - common.table.view consolidates all the classes related to filesystem view metadata
 - common.table.timeline.versioning contains all classes related to versioning of timeline
 - Fix few unit tests as a result
 - Moved the test packages around to match the source file move
 - Rename AvroUtils to TimelineMetadataUtils & minor fixes/typos
2020-03-29 10:58:49 -07:00
leesf
07c3c5d797 [HUDI-679] Make io package Spark free (#1460)
* [HUDI-679] Make io package Spark free
2020-03-29 16:54:00 +08:00
Suneel Marthi
8c3001363d HUDI-479: Eliminate or Minimize use of Guava if possible (#1159) 2020-03-28 03:11:32 -04:00
Zhiyuan Zhao
0241b21f77 [HUDI-65] commitTime rename to instantTime (#1431) 2020-03-22 18:06:00 -07:00
hongdd
3ef9e885ca [HUDI-715] Fix duplicate name in TableCommand (#1410) 2020-03-16 17:19:57 +08:00
hongdd
0f892ef62c [HUDI-692] Add delete savepoint for cli (#1397)
* Add delete savepoint for cli
* Add check
* Move JavaSparkContext to try
2020-03-11 16:49:02 -07:00
satishkotha
7194514aff [HUDI-689] Change CLI command names to not have overlap (#1392) 2020-03-11 16:29:54 -07:00
Satish Kotha
3d3781810c [CLI] Add export to table 2020-03-06 08:53:23 -08:00
vinoth chandar
71170fafe7 [HUDI-554] Cleanup package structure in hudi-client (#1346)
- Just package, class moves and renames with the following intent
 - `client` now has all the various client classes, that do the transaction management
 - `func` renamed to `execution` and some helpers moved to `client/utils`
 - All compaction code under `io` now under `table/compact`
 - Rollback code under `table/rollback` and in general all code for individual operations under `table`
 - `exception` `config`, `metrics` left untouched
 - Moved the tests also accordingly
 - Fixed some flaky tests
2020-02-27 08:05:58 -08:00
Suneel Marthi
078d4825d9 [HUDI-624]: Split some of the code from PR for HUDI-479 (#1344) 2020-02-21 14:22:21 +08:00
Satish Kotha
20ed2516d3 [HUDI-571] Add show archived compaction(s) to CLI 2020-02-14 10:58:28 -08:00
lamber-ken
01c868ab86 [HUDI-574] Fix CLI counts small file inserts as updates (#1321) 2020-02-13 22:20:58 +08:00
Satish Kotha
63b42166b1 CLI - add option to print additional commit metadata 2020-02-12 14:11:24 -08:00
Satish Kotha
462fd02556 [HUDI-571] Add 'commits show archived' command to CLI 2020-02-05 11:25:34 -08:00
lamber-ken
46842f4e92 [MINOR] Remove the declaration of thrown RuntimeException (#1305) 2020-02-05 23:23:20 +08:00
Balaji Varadarajan
923e2b4a1e [HUDI-535] Ensure Compaction Plan is always written in .aux folder to avoid 0.5.0/0.5.1 reader-writer compatibility issues (#1229) 2020-01-17 10:56:35 -08:00