1
0
Commit Graph

186 Commits

Author SHA1 Message Date
Shen Hong
49407169ac [HUDI-1209] Properties File must be optional when running deltastreamer (#2085) 2020-10-21 17:49:28 -07:00
Ho Tien Vu
af5ef4d49d [HUDI-1330] handle prefix filtering at directory level (#2157)
The current DFSPathSelector only ignore prefix(_, .) at the file level while files under subdirectories
e.g. (.checkpoint/*) are still considered which result in bad-format exception during reading.
2020-10-20 23:20:19 -07:00
Ho Tien Vu
fd269ddeb0 [MINOR] Make sure factory method is used to instanciate DFSPathSelector (#2187)
* Move createSourceSelector into DFSPathSelector factory method
* Replace constructor call with factory method
* Added some javadoc
2020-10-20 17:52:31 +08:00
lw0090
2126f13e13 [HUDI-791] Replace null by Option in Delta Streamer (#2171) 2020-10-11 18:29:57 -07:00
Prashant Wason
788d236c44 [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)
1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs.
2. Fix CleanNode constructor which is not correctly implemented
3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts.
4. Fixed generation of records of specific size
   - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size.
   - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map.
5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp).
6. Making --workload-generator-classname an optional parameter as most probably the default will be used
2020-10-07 08:33:51 -04:00
Pratyaksh Sharma
524193eb4b [HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode (#1566)
Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>
2020-10-06 20:34:03 -07:00
Shen Hong
b335459c80 [HUDI-1208] Ordering Field should be optional when precombine is turned off (#2088) 2020-10-04 11:34:21 -07:00
Mathieu
1f7add9291 [HUDI-1089] Refactor hudi-client to support multi-engine (#1827)
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules 
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-01 14:25:29 -07:00
liujinhui
20b9b399c9 [HUDI-1233] Deltastreamer Kafka consumption delay reporting indicators (#2074) 2020-09-29 13:44:31 +08:00
Alexander Filipchik
c8e19e2def [HUDI-801] Adding a way to post process schema after it is fetched (#1524)
* [HUDI-801] Adding a way to post process schema after it is fetched

Co-authored-by: Alex Filipchik <alex.filipchik@csscompany.com>
Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>
2020-09-19 11:18:36 -07:00
shenh062326
581d54097c [HUDI-1143] Change timestamp field in HoodieTestDataGenerator from double to long 2020-09-15 20:58:29 -07:00
liujinhui
6c84ef20ac [HUDI-1282] Check whether the topic exists before deltastrmer consumes Kafka (#2090) 2020-09-16 10:43:52 +08:00
Abhishek Modi
53d1e55110 Test Suite should work with Docker + Unit Tests 2020-09-08 22:41:14 -07:00
wenningd
2fee087f0f [HUDI-1181] Fix decimal type display issue for record key field (#1953)
* [HUDI-1181] Fix decimal type display issue for record key field

* Remove getNestedFieldVal method from DataSourceUtils

* resolve comments

Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-09-08 17:50:54 -07:00
Dongwook
8d19ebfd0f [HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703)
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.

This patch makes the following fixes in this regard. 
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
2020-09-01 12:55:31 -04:00
Satish Kotha
ea983ff912 [HUDI-1137] Add option to configure different path selector 2020-08-24 13:26:44 -07:00
Mathieu
b883b6d268 [HUDI-1122] Introduce a kafka implementation of hoodie write commit ca… (#1886) 2020-08-20 23:00:59 +08:00
Bhavani Sudha Saktheeswaran
4226d75144 Moving to 0.6.1-SNAPSHOT on master branch. 2020-08-14 12:54:15 -07:00
vinoth chandar
9bde6d616c [HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to mark public APIs (#1965)
- Maturity levels one of : evolving, stable, deprecated
- Took a pass and marked out most of the existing public API
2020-08-13 23:28:17 -07:00
Udit Mehrotra
e4a2d98f79 [HUDI-426] Bootstrap datasource integration (#1702) 2020-08-09 14:06:13 -07:00
wenningd
9fe2d2b14a [HUDI-427] [HUDI-971] Implement CLI support for performing bootstrap (#1869)
* [HUDI-971] Clean partitions & fileIds returned by HFileBootstrapIndex
* [HUDI-427] Implement CLI support for performing bootstrap

Co-authored-by: Wenning Ding <wenningd@amazon.com>
Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>
2020-08-08 12:37:29 -07:00
Mathieu
b51646dcc7 [HUDI-1151] Fix NPE when no new data in kafka using HoodieDeltaStreamer (#1921) 2020-08-07 00:03:20 +08:00
lw0090
51ea27d665 [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync, hudi-dla-sync (#1810)
- Generalize the hive-sync module for syncing to multiple metastores
- Added new options for datasource
- Added new command line for delta streamer 

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-05 21:34:55 -07:00
Balaji Varadarajan
7a2429f5ba [HUDI-575] Spark Streaming with async compaction support (#1752) 2020-08-05 07:50:15 -07:00
Sreeram Ramji
217a84192c [HUDI-1140] Fix Jcommander issue for --hoodie-conf in DeltaStreamer (#1898) 2020-08-04 21:42:51 -07:00
Sivabalan Narayanan
ab11ba43e1 [REVERT] "[HUDI-1058] Make delete marker configurable (#1819)" (#1914)
This reverts commit 433d7d2c98.
2020-08-04 15:20:38 -07:00
vinoth chandar
539621bd33 [HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876)
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
 - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
 - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-421] Bootstrap Write Client with tests
 - [HUDI-425] Added HoodieDeltaStreamer support
 - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
 - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
 - [HUDI-424] Simplify Record reader implementation
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables

Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
2020-08-03 20:19:21 -07:00
Shen Hong
433d7d2c98 [HUDI-1058] Make delete marker configurable (#1819) 2020-08-03 11:06:31 -04:00
n3nash
727f1df62c [MINOR] Suppressing spark logs for hudi-integ and hudi-utilities (#1894) 2020-07-31 19:01:25 -07:00
Nishith Agarwal
2fc2b01d86 [HUDI-394] Provide a basic implementation of test suite 2020-07-30 21:21:15 -07:00
Raymond Xu
ca36c44cb3 [HUDI-995] Move TestRawTripPayload and HoodieTestDataGenerator to hudi-common (#1873) 2020-07-27 19:21:45 +08:00
Sivabalan Narayanan
5b6026ba43 [HUDI-802] Fixing deletes for inserts in same batch in write path (#1792)
* Fixing deletes for inserts in same batch in write path
* Fixing delta streamer tests
* Adding tests for OverwriteWithLatestAvroPayload
2020-07-22 19:39:57 -07:00
lw0090
1ec89e9a94 [HUDI-839] Introducing support for rollbacks using marker files (#1756)
* [HUDI-839] Introducing rollback strategy using marker files

 - Adds a new mechanism for rollbacks where it's based on the marker files generated during the write
 - Consequently, marker file/dir deletion now happens post commit, instead of during finalize 
 - Marker files are also generated for AppendHandle, making it consistent throughout the write path 
 - Until upgrade-downgrade mechanism can upgrade non-marker based inflight writes to marker based, this should only be turned on for new datasets.
 - Added marker dir deletion after successful commit/rollback, individual files are not deleted during finalize
 - Fail safe for deleting marker directories, now during timeline archival process
 - Added check to ensure completed instants are not rolled back using marker based strategy. This will be incorrect
 - Reworked tests to rollback inflight instants, instead of completed instants whenever necessary
 - Added an unit test for MarkerBasedRollbackStrategy


Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-07-20 22:41:42 -07:00
Prashant Wason
b71f25f210 [HUDI-92] Provide reasonable names for Spark DAG stages in HUDI. (#1289) 2020-07-19 10:29:25 -07:00
GuoPhilipse
abfebd30f3 [MINOR] Update parameter description (#1821) 2020-07-11 22:57:12 +08:00
Pratyaksh Sharma
9627a385fe [HUDI-916]: Added support for multiple input formats in TimestampBasedKeyGenerator (#1648) 2020-07-10 15:28:45 -04:00
Pratyaksh Sharma
c7f1a781ab [HUDI-728]: Implemented custom key generator (#1433) 2020-07-09 07:35:07 -04:00
Trevor
d58644b657 [HUDI-1062]Remove unnecessary maxEvent check and add some log in KafkaOffsetGen (#1779) 2020-07-08 21:07:34 -07:00
Raymond Xu
3b9a30528b [HUDI-996] Add functional test suite for hudi-utilities (#1746)
- Share resources for functional tests
- Add suite for functional test classes from hudi-utilities
2020-07-05 16:44:31 -07:00
Balaji Varadarajan
8919be6a5d [HUDI-855] Run Cleaner async with writing (#1577)
- Cleaner can now run concurrently with write operation 
- Configs to turn on/off

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-06-28 02:04:50 -07:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
Shen Hong
89e37d5273 [HUDI-908] Add some data types to HoodieTestDataGenerator and fix some some bugs. (#1690) 2020-06-22 08:13:28 -07:00
Raymond Xu
8a9fdd603e [HUDI-1023] Add validation error messages in delta sync (#1710)
- Remove explicitly specifying BLOOM_INDEX since thats the default anyway
2020-06-19 12:12:35 -07:00
Raymond Xu
ab724af5c4 [MINOR] Rename TestSourceConfig to SourceConfigs (#1749) 2020-06-19 12:08:19 -07:00
Litianye
ede6c9bda4 [HUDI-1006] Deltastreamer use kafkaSource with offset reset strategy:latest can't consume data (#1719) 2020-06-14 18:01:44 +08:00
liujinhui
97ab97b726 [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug (#1652) 2020-06-08 20:46:47 +08:00
Raymond Xu
742c204099 [HUDI-811] Restructure test packages in hudi-client/cli (#1689) 2020-06-02 10:25:42 +08:00
Raymond Xu
03f136361a [HUDI-811] Restructure test packages in hudi-common (#1644)
* [HUDI-811] Restructure test packages in hudi-common
2020-05-27 16:28:17 +08:00
Raymond Xu
6c450957ce [HUDI-690] Filter out inflight compaction in exporter (#1667) 2020-05-26 09:23:34 -07:00
Balaji Varadarajan
74ecc27e92 [HUDI-846][HUDI-848] Enable Incremental cleaning and embedded timeline-server by default (#1634) 2020-05-20 05:29:43 -07:00