1
0
Commit Graph

128 Commits

Author SHA1 Message Date
n3nash
01a1d7997b [HUDI-1712] Rename & standardize config to match other configs (#2708) 2021-03-24 17:24:02 +08:00
n3nash
d7b18783bd [HUDI-1709] Improving config names and adding hive metastore uri config (#2699) 2021-03-22 01:22:06 -07:00
n3nash
74241947c1 [HUDI-845] Added locking capability to allow multiple writers (#2374)
* [HUDI-845] Added locking capability to allow multiple writers
1. Added LockProvider API for pluggable lock methodologies
2. Added Resolution Strategy API to allow for pluggable conflict resolution
3. Added TableService client API to schedule table services
4. Added Transaction Manager for wrapping actions within transactions
2021-03-16 16:43:53 -07:00
Balajee Nagasubramaniam
d8af24d8a2 [HUDI-1635] Improvements to Hudi Test Suite (#2628) 2021-03-09 13:29:38 -08:00
pengzhiwei
bc883db5de [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (#2596) 2021-03-05 14:10:27 +08:00
n3nash
ffcfb58bac [HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359)
1. Refactor rollback and move cleaning failed commits logic into cleaner
2. Introduce hoodie heartbeat to ascertain failed commits
3. Fix test cases
2021-02-19 20:12:22 -08:00
Sivabalan Narayanan
c9fcf964b2 [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (#2534) 2021-02-20 09:54:26 +08:00
Sivabalan Narayanan
d5f202821b Adding fixes to test suite framework. Adding clustering node and validate async operations node. (#2400) 2021-02-12 09:29:21 -08:00
cooper
048633da1a [MINOR] Improve code readability,remove the continue keyword (#2459) 2021-01-22 13:47:14 +08:00
vinoth chandar
5ca0625b27 [HUDI 1308] Harden RFC-15 Implementation based on production testing (#2441)
Addresses leaks, perf degradation observed during testing. These were regressions from the original rfc-15 PoC implementation.

* Pass a single instance of HoodieTableMetadata everywhere
* Fix tests and add config for enabling metrics
 - Removed special casing of assumeDatePartitioning inside FSUtils#getAllPartitionPaths()
 - Consequently, IOException is never thrown and many files had to be adjusted
- More diligent handling of open file handles in metadata table
 - Added config for controlling reuse of connections
 - Added config for turning off fallback to listing, so we can see tests fail
 - Changed all ipf listing code to cache/amortize the open/close for better performance
 - Timelineserver also reuses connections, for better performance
 - Without timelineserver, when metadata table is opened from executors, reuse is not allowed
 - HoodieMetadataConfig passed into HoodieTableMetadata#create as argument.
 -  Fix TestHoodieBackedTableMetadata#testSync
2021-01-19 21:20:28 -08:00
n3nash
749f657856 [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (#2424) 2021-01-14 12:47:50 -08:00
Udit Mehrotra
7ce3ac778e [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths (#2417)
* [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partition paths

* Adding testClass for FileSystemBackedTableMetadata

Co-authored-by: Nishith Agarwal <nagarwal@uber.com>
2021-01-10 21:19:52 -08:00
Udit Mehrotra
4e64226844 [HUDI-1450] Use metadata table for listing in HoodieROTablePathFilter (apache#2326)
[HUDI-1394] [RFC-15] Use metadata table (if present) to get all partition paths (apache#2351)
2021-01-04 07:59:47 -08:00
Balajee Nagasubramaniam
e33a8f733c [HUDI-1147] Modify GenericRecordFullPayloadGenerator to generate vali… (#2045)
* [HUDI-1147] Modify GenericRecordFullPayloadGenerator to generate valid timestamps

Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>
2020-12-29 16:33:19 -05:00
Sivabalan Narayanan
8cf6a7223f [HUDI-1331] Adding support for validating entire dataset and long running tests in test suite framework (#2168)
* trigger rebuild

* [HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrapper Class (#1927)

* Adding support for validating records and long running tests in test sutie framework

* Adding partial validate node

* Fixing spark session initiation in Validate nodes

* Fixing validation

* Adding hive table validation to ValidateDatasetNode

* Rebasing with latest commits from master

* Addressing feedback

* Addressing comments

Co-authored-by: lamber-ken <lamberken@163.com>
Co-authored-by: linshan-ma <mabin194046@163.com>
2020-12-26 09:29:24 -08:00
Sivabalan Narayanan
33d338f392 [HUDI-115] Adding DefaultHoodieRecordPayload to honor ordering with combineAndGetUpdateValue (#2311)
* Added ability to pass in `properties` to payload methods, so they can perform table/record specific merges
* Added default methods so existing payload classes are backwards compatible. 
* Adding DefaultHoodiePayload to honor ordering while merging two records
* Fixing default payload based on feedback
2020-12-19 19:19:42 -08:00
Balajee Nagasubramaniam
5388c7f7a3 [HUDI-1470] Use the latest writer schema, when reading from existing parquet files in the hudi-test-suite (#2344) 2020-12-18 19:18:52 +08:00
Danny Chan
4bc45a391a [HUDI-1445] Refactor AbstractHoodieLogRecordScanner to use Builder (#2313) 2020-12-10 20:02:02 +08:00
wenningd
fce1453fa6 [HUDI-1040] Make Hudi support Spark 3 (#2208)
* Fix flaky MOR unit test

* Update Spark APIs to make it be compatible with both spark2 & spark3

* Refactor bulk insert v2 part to make Hudi be able to compile with Spark3

* Add spark3 profile to handle fasterxml & spark version

* Create hudi-spark-common module & refactor hudi-spark related modules

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-12-09 15:52:23 -08:00
wangxianghu
a23230c8c2 [HUDI-1400] Replace Operation enum with WriteOperationType (#2259) 2020-11-19 13:40:04 +08:00
Sivabalan Narayanan
a205dd10fa [HUDI-1338] Adding Delete support to test suite framework (#2172)
- Adding Delete support to test suite. 
         Added DeleteNode 
         Added support to generate delete records
2020-11-01 00:15:41 -04:00
Prashant Wason
6310a2307a [HUDI-1351] Improvements to the hudi test suite for scalability and repeated testing. (#2197)
1. Added the --clean-input and --clean-output parameters to clean the input and output directories before starting the job
2. Added the --delete-old-input parameter to deleted older batches for data already ingested. This helps keep number of redundant files low.
3. Added the --input-parallelism parameter to restrict the parallelism when generating input data. This helps keeping the number of generated input files low.
4. Added an option start_offset to Dag Nodes. Without ability to specify start offsets, data is generated into existing partitions. With start offset, DAG can control on which partition, the data is to be written.
5. Fixed generation of records for correct number of partitions
  - In the existing implementation, the partition is chosen as a random long. This does not guarantee exact number of requested partitions to be created.
6. Changed variable blacklistedFields to be a Set as that is faster than List for membership checks.
7. Fixed integer division for Math.ceil. If two integers are divided, the result is not double unless one of the integer is casted to double.
2020-10-29 06:50:37 -07:00
n3nash
e109a61803 1. Fix merge on read DAG to make docker demo pass (#2092)
1. Fix merge on read DAG to make docker demo pass (#2092)
2. Fix repeat_count, rollback node
2020-10-28 22:34:26 -04:00
Prashant Wason
49e855c348 [HUDI-1326] Added an API to force publish metrics and flush them. (#2152)
* [HUDI-1326] Added an API to force publish metrics and flush them.

Using the added API, publish metrics after each level of the DAG completed in hudi-test-suite.

* Code cleanups

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-24 16:47:24 -07:00
Prashant Wason
788d236c44 [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)
1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs.
2. Fix CleanNode constructor which is not correctly implemented
3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts.
4. Fixed generation of records of specific size
   - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size.
   - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map.
5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp).
6. Making --workload-generator-classname an optional parameter as most probably the default will be used
2020-10-07 08:33:51 -04:00
Mathieu
1f7add9291 [HUDI-1089] Refactor hudi-client to support multi-engine (#1827)
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules 
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-01 14:25:29 -07:00
vinoyang
5aaaf8bff1 [MINOR] Change the log level of the dag scheduler for the test suite (#2134) 2020-09-30 17:17:44 +08:00
hongdd
2eaba0962a [HUDI-544] Archived commits command code cleanup (#1242)
* Archived commits command code cleanup
2020-09-25 09:36:41 -07:00
wenningd
d37977b310 [MINOR] Remove useless config for bootstrap integ testing (#2102)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-09-22 13:29:59 -07:00
Balajee Nagasubramaniam
fec7cd3c97 [HUDI-1130] hudi-test-suite support for schema evolution (can be triggered on any insert/upsert DAG node). 2020-09-08 22:43:59 -07:00
Abhishek Modi
53d1e55110 Test Suite should work with Docker + Unit Tests 2020-09-08 22:41:14 -07:00
Dongwook
8d19ebfd0f [HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703)
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.

This patch makes the following fixes in this regard. 
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
2020-09-01 12:55:31 -04:00
Prashant Wason
6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804)
* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning


Other changes: 
 - Remove databricks/avro from pom
 - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
 - Breaking up limitFileSize(), per parquet and hfile base files
 - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
 - Throw UnsupportedException in HFileReader.getRecordKeys()
 - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-31 08:05:59 -07:00
Mathieu
fa81248247 [HUDI-531] Add java doc for hudi test suite general classes (#1900) 2020-08-28 08:44:40 +08:00
Mathieu
df8f099c99 [HUDI-532] Add java doc for the test classes of hudi test suite (#1901) 2020-08-26 08:49:01 +08:00
Satish Kotha
ea983ff912 [HUDI-1137] Add option to configure different path selector 2020-08-24 13:26:44 -07:00
Balaji Varadarajan
b8f4a30efd Fix Integration test flakiness in HoodieJavaStreamingApp (#1967) 2020-08-14 01:42:15 -07:00
Sivabalan Narayanan
9c24151929 [HUDI-1175] Commenting out testsuite tests from Integration tests until we investigate the CI flakiness (#1945) 2020-08-10 21:00:57 -07:00
Yungthuis
8b66524090 [MINOR] Remove unused import (#1932)
Co-authored-by: tom_glb <goodMorning_glb@hotmail.com>
2020-08-08 12:04:31 +08:00
Balaji Varadarajan
7a2429f5ba [HUDI-575] Spark Streaming with async compaction support (#1752) 2020-08-05 07:50:15 -07:00
Balaji Varadarajan
61e027fadd [MINOR] Adding timeout for each command execution in docker and capture output. This will help get stdout/stderr of stuck commands (#1918) 2020-08-05 07:46:34 -07:00
vinoth chandar
539621bd33 [HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876)
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
 - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
 - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-421] Bootstrap Write Client with tests
 - [HUDI-425] Added HoodieDeltaStreamer support
 - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
 - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
 - [HUDI-424] Simplify Record reader implementation
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables

Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
2020-08-03 20:19:21 -07:00
Bhavani Sudha Saktheeswaran
4ebd2db05b [MINOR] Suppressing full hive log and fetching only exceptions with context (#1903)
Co-authored-by: Bhavani Sudha Saktheeswaran <bsaktheeswaran@moveworks.ai>
2020-08-02 19:44:51 -07:00
Mathieu
30dcd5cf06 [MINOR] Remove redundant import in hudi-integ-test (#1899) 2020-08-02 21:30:23 +08:00
n3nash
727f1df62c [MINOR] Suppressing spark logs for hudi-integ and hudi-utilities (#1894) 2020-07-31 19:01:25 -07:00
Nishith Agarwal
2fc2b01d86 [HUDI-394] Provide a basic implementation of test suite 2020-07-30 21:21:15 -07:00
hongdd
fa419213f6 [HUDI-703] Add test for HoodieSyncCommand (#1774) 2020-07-28 08:31:43 +08:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
leesf
492f324bc7 [HUDI-938] Removing incubating/incubator from project (#1658) 2020-05-24 18:28:13 +08:00
Mathieu
25a0080b2f [HUDI-714]Add javadoc and comments to hudi write method link (#1409)
* [HUDI-714] Add javadoc and comments to hudi write method link
2020-05-16 08:36:51 -04:00