1
0
Commit Graph

80 Commits

Author SHA1 Message Date
Balajee Nagasubramaniam
e33a8f733c [HUDI-1147] Modify GenericRecordFullPayloadGenerator to generate vali… (#2045)
* [HUDI-1147] Modify GenericRecordFullPayloadGenerator to generate valid timestamps

Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>
2020-12-29 16:33:19 -05:00
Sivabalan Narayanan
8cf6a7223f [HUDI-1331] Adding support for validating entire dataset and long running tests in test suite framework (#2168)
* trigger rebuild

* [HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrapper Class (#1927)

* Adding support for validating records and long running tests in test sutie framework

* Adding partial validate node

* Fixing spark session initiation in Validate nodes

* Fixing validation

* Adding hive table validation to ValidateDatasetNode

* Rebasing with latest commits from master

* Addressing feedback

* Addressing comments

Co-authored-by: lamber-ken <lamberken@163.com>
Co-authored-by: linshan-ma <mabin194046@163.com>
2020-12-26 09:29:24 -08:00
Sivabalan Narayanan
33d338f392 [HUDI-115] Adding DefaultHoodieRecordPayload to honor ordering with combineAndGetUpdateValue (#2311)
* Added ability to pass in `properties` to payload methods, so they can perform table/record specific merges
* Added default methods so existing payload classes are backwards compatible. 
* Adding DefaultHoodiePayload to honor ordering while merging two records
* Fixing default payload based on feedback
2020-12-19 19:19:42 -08:00
Balajee Nagasubramaniam
5388c7f7a3 [HUDI-1470] Use the latest writer schema, when reading from existing parquet files in the hudi-test-suite (#2344) 2020-12-18 19:18:52 +08:00
Danny Chan
4bc45a391a [HUDI-1445] Refactor AbstractHoodieLogRecordScanner to use Builder (#2313) 2020-12-10 20:02:02 +08:00
wenningd
fce1453fa6 [HUDI-1040] Make Hudi support Spark 3 (#2208)
* Fix flaky MOR unit test

* Update Spark APIs to make it be compatible with both spark2 & spark3

* Refactor bulk insert v2 part to make Hudi be able to compile with Spark3

* Add spark3 profile to handle fasterxml & spark version

* Create hudi-spark-common module & refactor hudi-spark related modules

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-12-09 15:52:23 -08:00
wangxianghu
a23230c8c2 [HUDI-1400] Replace Operation enum with WriteOperationType (#2259) 2020-11-19 13:40:04 +08:00
Sivabalan Narayanan
a205dd10fa [HUDI-1338] Adding Delete support to test suite framework (#2172)
- Adding Delete support to test suite. 
         Added DeleteNode 
         Added support to generate delete records
2020-11-01 00:15:41 -04:00
Prashant Wason
6310a2307a [HUDI-1351] Improvements to the hudi test suite for scalability and repeated testing. (#2197)
1. Added the --clean-input and --clean-output parameters to clean the input and output directories before starting the job
2. Added the --delete-old-input parameter to deleted older batches for data already ingested. This helps keep number of redundant files low.
3. Added the --input-parallelism parameter to restrict the parallelism when generating input data. This helps keeping the number of generated input files low.
4. Added an option start_offset to Dag Nodes. Without ability to specify start offsets, data is generated into existing partitions. With start offset, DAG can control on which partition, the data is to be written.
5. Fixed generation of records for correct number of partitions
  - In the existing implementation, the partition is chosen as a random long. This does not guarantee exact number of requested partitions to be created.
6. Changed variable blacklistedFields to be a Set as that is faster than List for membership checks.
7. Fixed integer division for Math.ceil. If two integers are divided, the result is not double unless one of the integer is casted to double.
2020-10-29 06:50:37 -07:00
n3nash
e109a61803 1. Fix merge on read DAG to make docker demo pass (#2092)
1. Fix merge on read DAG to make docker demo pass (#2092)
2. Fix repeat_count, rollback node
2020-10-28 22:34:26 -04:00
Prashant Wason
49e855c348 [HUDI-1326] Added an API to force publish metrics and flush them. (#2152)
* [HUDI-1326] Added an API to force publish metrics and flush them.

Using the added API, publish metrics after each level of the DAG completed in hudi-test-suite.

* Code cleanups

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-24 16:47:24 -07:00
Prashant Wason
788d236c44 [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)
1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs.
2. Fix CleanNode constructor which is not correctly implemented
3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts.
4. Fixed generation of records of specific size
   - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size.
   - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map.
5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp).
6. Making --workload-generator-classname an optional parameter as most probably the default will be used
2020-10-07 08:33:51 -04:00
Mathieu
1f7add9291 [HUDI-1089] Refactor hudi-client to support multi-engine (#1827)
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules 
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-01 14:25:29 -07:00
vinoyang
5aaaf8bff1 [MINOR] Change the log level of the dag scheduler for the test suite (#2134) 2020-09-30 17:17:44 +08:00
vinoyang
c0c0095fa9 [MINOR] Reformat prepare_integration_suite script (#2126) 2020-09-28 14:12:57 -07:00
hongdd
2eaba0962a [HUDI-544] Archived commits command code cleanup (#1242)
* Archived commits command code cleanup
2020-09-25 09:36:41 -07:00
wenningd
d37977b310 [MINOR] Remove useless config for bootstrap integ testing (#2102)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-09-22 13:29:59 -07:00
Balajee Nagasubramaniam
fec7cd3c97 [HUDI-1130] hudi-test-suite support for schema evolution (can be triggered on any insert/upsert DAG node). 2020-09-08 22:43:59 -07:00
Abhishek Modi
53d1e55110 Test Suite should work with Docker + Unit Tests 2020-09-08 22:41:14 -07:00
Dongwook
8d19ebfd0f [HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703)
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.

This patch makes the following fixes in this regard. 
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
2020-09-01 12:55:31 -04:00
Prashant Wason
6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804)
* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning


Other changes: 
 - Remove databricks/avro from pom
 - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
 - Breaking up limitFileSize(), per parquet and hfile base files
 - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
 - Throw UnsupportedException in HFileReader.getRecordKeys()
 - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-31 08:05:59 -07:00
Mathieu
fa81248247 [HUDI-531] Add java doc for hudi test suite general classes (#1900) 2020-08-28 08:44:40 +08:00
hongdd
dedc4517dd [HUDI-978] Specify version information for each component separately (#1772) 2020-08-26 21:08:09 +08:00
Mathieu
df8f099c99 [HUDI-532] Add java doc for the test classes of hudi test suite (#1901) 2020-08-26 08:49:01 +08:00
Satish Kotha
ea983ff912 [HUDI-1137] Add option to configure different path selector 2020-08-24 13:26:44 -07:00
Bhavani Sudha Saktheeswaran
4226d75144 Moving to 0.6.1-SNAPSHOT on master branch. 2020-08-14 12:54:15 -07:00
Balaji Varadarajan
b8f4a30efd Fix Integration test flakiness in HoodieJavaStreamingApp (#1967) 2020-08-14 01:42:15 -07:00
Sivabalan Narayanan
9c24151929 [HUDI-1175] Commenting out testsuite tests from Integration tests until we investigate the CI flakiness (#1945) 2020-08-10 21:00:57 -07:00
Yungthuis
8b66524090 [MINOR] Remove unused import (#1932)
Co-authored-by: tom_glb <goodMorning_glb@hotmail.com>
2020-08-08 12:04:31 +08:00
Balaji Varadarajan
7a2429f5ba [HUDI-575] Spark Streaming with async compaction support (#1752) 2020-08-05 07:50:15 -07:00
Balaji Varadarajan
61e027fadd [MINOR] Adding timeout for each command execution in docker and capture output. This will help get stdout/stderr of stuck commands (#1918) 2020-08-05 07:46:34 -07:00
vinoth chandar
539621bd33 [HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876)
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
 - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
 - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-421] Bootstrap Write Client with tests
 - [HUDI-425] Added HoodieDeltaStreamer support
 - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
 - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
 - [HUDI-424] Simplify Record reader implementation
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables

Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
2020-08-03 20:19:21 -07:00
Bhavani Sudha Saktheeswaran
4ebd2db05b [MINOR] Suppressing full hive log and fetching only exceptions with context (#1903)
Co-authored-by: Bhavani Sudha Saktheeswaran <bsaktheeswaran@moveworks.ai>
2020-08-02 19:44:51 -07:00
Mathieu
30dcd5cf06 [MINOR] Remove redundant import in hudi-integ-test (#1899) 2020-08-02 21:30:23 +08:00
n3nash
727f1df62c [MINOR] Suppressing spark logs for hudi-integ and hudi-utilities (#1894) 2020-07-31 19:01:25 -07:00
Nishith Agarwal
2fc2b01d86 [HUDI-394] Provide a basic implementation of test suite 2020-07-30 21:21:15 -07:00
hongdd
fa419213f6 [HUDI-703] Add test for HoodieSyncCommand (#1774) 2020-07-28 08:31:43 +08:00
Raymond Xu
31247e9b34 [HUDI-896] Report test coverage by modules & parallelize CI (#1753)
- use codecov flags for each module to report coverage
- parallelize CI jobs for shorter time
- add a testcase for MetricsReporterFactory (to trigger codecov comment)
2020-06-27 23:16:12 -07:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
leesf
492f324bc7 [HUDI-938] Removing incubating/incubator from project (#1658) 2020-05-24 18:28:13 +08:00
Joey
2600d2de8d [MINOR] Fix apache-rat violations (#1639)
* MINOR Fix apache-rat violations. Also, enabling RAT for hudi-utilities and hudi-integ-test
2020-05-18 11:16:49 -07:00
Mathieu
25a0080b2f [HUDI-714]Add javadoc and comments to hudi write method link (#1409)
* [HUDI-714] Add javadoc and comments to hudi write method link
2020-05-16 08:36:51 -04:00
Udit Mehrotra
404c7e82d9 [HUDI-884] Shade avro and parquet-avro in hudi-hive-sync-bundle (#1618)
Co-authored-by: Mehrotra <uditme@amazon.com>
2020-05-12 11:40:31 -07:00
Raymond Xu
d65efe659d [HUDI-780] Migrate test cases to Junit 5 (#1504) 2020-04-15 12:35:01 -07:00
Satish Kotha
3c803421e0 rename variable per review comments 2020-04-08 21:56:59 -07:00
Satish Kotha
1f6be820f3 [HUDI-758] Modify Integration test to include incremental queries for MOR tables 2020-04-08 21:56:59 -07:00
Pratyaksh Sharma
d610252d6b [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment (#1150)
* [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment
2020-04-07 16:10:26 -07:00
Ramachandran Madtas Subramaniam
639ec20412 [HUDI-562] Enable testing at debug log level
This is to ensure that tests will execute all code paths, even the ones
written under DEBUG log levels. This will improve coverage as well as
ensure there are no surprised when DEBUG log level is enabled in
production.
2020-04-02 11:14:35 -07:00
Suneel Marthi
8c3001363d HUDI-479: Eliminate or Minimize use of Guava if possible (#1159) 2020-03-28 03:11:32 -04:00
Suneel Marthi
99b7e9eb9e [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java (#1350)
* [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java
2020-03-13 20:28:05 -04:00