1
0
Commit Graph

61 Commits

Author SHA1 Message Date
Dongwook
8d19ebfd0f [HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703)
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.

This patch makes the following fixes in this regard. 
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
2020-09-01 12:55:31 -04:00
Prashant Wason
6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804)
* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning


Other changes: 
 - Remove databricks/avro from pom
 - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
 - Breaking up limitFileSize(), per parquet and hfile base files
 - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
 - Throw UnsupportedException in HFileReader.getRecordKeys()
 - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-31 08:05:59 -07:00
Mathieu
fa81248247 [HUDI-531] Add java doc for hudi test suite general classes (#1900) 2020-08-28 08:44:40 +08:00
hongdd
dedc4517dd [HUDI-978] Specify version information for each component separately (#1772) 2020-08-26 21:08:09 +08:00
Mathieu
df8f099c99 [HUDI-532] Add java doc for the test classes of hudi test suite (#1901) 2020-08-26 08:49:01 +08:00
Satish Kotha
ea983ff912 [HUDI-1137] Add option to configure different path selector 2020-08-24 13:26:44 -07:00
Bhavani Sudha Saktheeswaran
4226d75144 Moving to 0.6.1-SNAPSHOT on master branch. 2020-08-14 12:54:15 -07:00
Balaji Varadarajan
b8f4a30efd Fix Integration test flakiness in HoodieJavaStreamingApp (#1967) 2020-08-14 01:42:15 -07:00
Sivabalan Narayanan
9c24151929 [HUDI-1175] Commenting out testsuite tests from Integration tests until we investigate the CI flakiness (#1945) 2020-08-10 21:00:57 -07:00
Yungthuis
8b66524090 [MINOR] Remove unused import (#1932)
Co-authored-by: tom_glb <goodMorning_glb@hotmail.com>
2020-08-08 12:04:31 +08:00
Balaji Varadarajan
7a2429f5ba [HUDI-575] Spark Streaming with async compaction support (#1752) 2020-08-05 07:50:15 -07:00
Balaji Varadarajan
61e027fadd [MINOR] Adding timeout for each command execution in docker and capture output. This will help get stdout/stderr of stuck commands (#1918) 2020-08-05 07:46:34 -07:00
vinoth chandar
539621bd33 [HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876)
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
 - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
 - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-421] Bootstrap Write Client with tests
 - [HUDI-425] Added HoodieDeltaStreamer support
 - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
 - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
 - [HUDI-424] Simplify Record reader implementation
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables

Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
2020-08-03 20:19:21 -07:00
Bhavani Sudha Saktheeswaran
4ebd2db05b [MINOR] Suppressing full hive log and fetching only exceptions with context (#1903)
Co-authored-by: Bhavani Sudha Saktheeswaran <bsaktheeswaran@moveworks.ai>
2020-08-02 19:44:51 -07:00
Mathieu
30dcd5cf06 [MINOR] Remove redundant import in hudi-integ-test (#1899) 2020-08-02 21:30:23 +08:00
n3nash
727f1df62c [MINOR] Suppressing spark logs for hudi-integ and hudi-utilities (#1894) 2020-07-31 19:01:25 -07:00
Nishith Agarwal
2fc2b01d86 [HUDI-394] Provide a basic implementation of test suite 2020-07-30 21:21:15 -07:00
hongdd
fa419213f6 [HUDI-703] Add test for HoodieSyncCommand (#1774) 2020-07-28 08:31:43 +08:00
Raymond Xu
31247e9b34 [HUDI-896] Report test coverage by modules & parallelize CI (#1753)
- use codecov flags for each module to report coverage
- parallelize CI jobs for shorter time
- add a testcase for MetricsReporterFactory (to trigger codecov comment)
2020-06-27 23:16:12 -07:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
leesf
492f324bc7 [HUDI-938] Removing incubating/incubator from project (#1658) 2020-05-24 18:28:13 +08:00
Joey
2600d2de8d [MINOR] Fix apache-rat violations (#1639)
* MINOR Fix apache-rat violations. Also, enabling RAT for hudi-utilities and hudi-integ-test
2020-05-18 11:16:49 -07:00
Mathieu
25a0080b2f [HUDI-714]Add javadoc and comments to hudi write method link (#1409)
* [HUDI-714] Add javadoc and comments to hudi write method link
2020-05-16 08:36:51 -04:00
Udit Mehrotra
404c7e82d9 [HUDI-884] Shade avro and parquet-avro in hudi-hive-sync-bundle (#1618)
Co-authored-by: Mehrotra <uditme@amazon.com>
2020-05-12 11:40:31 -07:00
Raymond Xu
d65efe659d [HUDI-780] Migrate test cases to Junit 5 (#1504) 2020-04-15 12:35:01 -07:00
Satish Kotha
3c803421e0 rename variable per review comments 2020-04-08 21:56:59 -07:00
Satish Kotha
1f6be820f3 [HUDI-758] Modify Integration test to include incremental queries for MOR tables 2020-04-08 21:56:59 -07:00
Pratyaksh Sharma
d610252d6b [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment (#1150)
* [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment
2020-04-07 16:10:26 -07:00
Ramachandran Madtas Subramaniam
639ec20412 [HUDI-562] Enable testing at debug log level
This is to ensure that tests will execute all code paths, even the ones
written under DEBUG log levels. This will improve coverage as well as
ensure there are no surprised when DEBUG log level is enabled in
production.
2020-04-02 11:14:35 -07:00
Suneel Marthi
8c3001363d HUDI-479: Eliminate or Minimize use of Guava if possible (#1159) 2020-03-28 03:11:32 -04:00
Suneel Marthi
99b7e9eb9e [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java (#1350)
* [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java
2020-03-13 20:28:05 -04:00
lamber-ken
90227eeda7 [HUDI-673] Rename hudi-hive-bundle to hudi-hive-sync-bundle 2020-03-07 21:44:35 +08:00
yanghua
0dc8e493aa Moving to 0.6.0-SNAPSHOT on master branch. 2020-03-01 15:08:30 +08:00
Suneel Marthi
594da28fbf [HUDI-595] code cleanup, refactoring code out of PR# 1159 (#1302) 2020-02-04 21:52:03 +08:00
Suneel Marthi
5b7bb142dc [HUDI-583] Code Cleanup, remove redundant code, and other changes (#1237) 2020-02-02 18:03:44 +08:00
lamber-ken
cdb028f1f3 [MINOR] Fix missing groupId / version property of dependency 2020-01-25 09:19:55 -08:00
leesf
6e59c1c777 Moving to 0.5.2-SNAPSHOT on master branch. 2020-01-20 10:51:33 -08:00
wenningd
292c1e2ff4 [HUDI-238] Make Hudi support Scala 2.12 (#1226)
* [HUDI-238] Rename scala related artifactId & add maven profile to support Scala 2.12
2020-01-17 14:02:21 -08:00
vinoth chandar
c2c0f6b13d [HUDI-509] Renaming code in sync with cWiki restructuring (#1212)
- Storage Type replaced with Table Type (remaining instances)
 - View types replaced with query types;
 - ReadOptimized view referred as Snapshot Query
 - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views
 - HoodieDataFile renamed to HoodieBaseFile
 - Hive Sync tool will register RO tables for MOR with a `_ro` suffix
 - Datasource/Deltastreamer options renamed accordingly
 - Support fallback to old config values as well, so migration is painless
 - Config for controlling _ro suffix addition
 - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView
2020-01-16 23:58:47 -08:00
Udit Mehrotra
ad50008a59 [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types
- Upgrade Spark to 2.4.4, Parquet to 1.10.1, Avro to 1.8.2
- Remove spark-avro from hudi-spark-bundle. Users need to provide --packages org.apache.spark:spark-avro:2.4.4 when running spark-shell or spark-submit
- Replace com.databricks:spark-avro with org.apache.spark:spark-avro
- Shade avro in hudi-hadoop-mr-bundle to make sure it does not conflict with hive's avro version.
2020-01-12 15:03:11 -08:00
Pratyaksh Sharma
dde21e7315 [HUDI-402]: code clean up in test cases 2019-12-31 11:10:49 -08:00
lamber-ken
ba514cfea0 [MINOR] Remove redundant plus operator (#1097) 2019-12-12 05:42:05 +08:00
谢磊
f9139c0f61 [HUDI-366] Refactor some module codes based on new ImportOrder code style rule (#1055)
[HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test / hudi- utilities based on new ImportOrder code style rule
2019-11-27 21:32:43 +08:00
Sivabalan Narayanan
c3355109b1 [HUDI-328] Adding delete api to HoodieWriteClient (#1004)
[HUDI-328]  Adding delete api to HoodieWriteClient and Spark DataSource
2019-11-22 15:05:25 -08:00
Mehrotra
92c69f5703 Migrate integration tests to spark 2.4.4 2019-11-13 16:53:24 -08:00
Sivabalan Narayanan
23b303e4b1 [HUDI-218] Adding Presto support to Integration Test (#1003) 2019-11-11 06:21:49 -08:00
Wenning Ding
bd77dc792c Add MOR integration testing 2019-11-02 19:49:04 -07:00
Balaji Varadarajan
a6390aefc4 [HUDI-312] Make docker hdfs cluster ephemeral. This is needed to fix flakiness in integration tests. Also, Fix DeltaStreamer hanging issue due to uncaught exception 2019-11-01 11:49:59 -07:00
leesf
b19bed442d [HUDI-296] Explore use of spotless to auto fix formatting errors (#945)
- Add spotless format fixing to project
- One time reformatting for conformity
- Build fails for formatting changes and mvn spotless:apply autofixes them
2019-10-10 05:19:40 -07:00
Balaji Varadarajan
9b66ea41fd [HUDI-121] Remove leftover notice file and replace com.uber.hoodie with org.apache.hudi in log4j properties 2019-10-04 09:18:57 -07:00