1
0
Commit Graph

83 Commits

Author SHA1 Message Date
Balaji Varadarajan
b8f4a30efd Fix Integration test flakiness in HoodieJavaStreamingApp (#1967) 2020-08-14 01:42:15 -07:00
vinoth chandar
9bde6d616c [HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to mark public APIs (#1965)
- Maturity levels one of : evolving, stable, deprecated
- Took a pass and marked out most of the existing public API
2020-08-13 23:28:17 -07:00
Sivabalan Narayanan
379cf0786f [HUDI-1013] Adding Bulk Insert V2 implementation (#1834)
- Adding ability to use native spark row writing for bulk_insert
 - Controlled by `ENABLE_ROW_WRITER_OPT_KEY` datasource write option
 - Introduced KeyGeneratorInterface in hudi-client, moved KeyGenerator back to hudi-spark
 - Simplified the new API additions to just two new methods : getRecordKey(row), getPartitionPath(row)
 - Fixed all built-in key generators with new APIs
 - Made the field position map lazily created upon the first call to row based apis
 - Implemented native row based key generators for CustomKeyGenerator
 - Fixed all the tests, with these new APIs

Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-13 00:33:39 -07:00
Udit Mehrotra
e4a2d98f79 [HUDI-426] Bootstrap datasource integration (#1702) 2020-08-09 14:06:13 -07:00
Gary Li
4f74a84607 [HUDI-69] Support Spark Datasource for MOR table - RDD approach (#1848)
- This PR implements Spark Datasource for MOR table in the RDD approach.
- Implemented SnapshotRelation
- Implemented HudiMergeOnReadRDD
- Implemented separate Iterator to handle merge and unmerge record reader.
- Added TestMORDataSource to verify this feature.
- Clean up test file name, add tests for mixed query type tests
 - We can now revert the change made in DefaultSource

Co-authored-by: Vinoth Chandar <vchandar@confluent.io>
2020-08-07 00:28:14 -07:00
lw0090
51ea27d665 [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync, hudi-dla-sync (#1810)
- Generalize the hive-sync module for syncing to multiple metastores
- Added new options for datasource
- Added new command line for delta streamer 

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-05 21:34:55 -07:00
Balaji Varadarajan
7a2429f5ba [HUDI-575] Spark Streaming with async compaction support (#1752) 2020-08-05 07:50:15 -07:00
Sivabalan Narayanan
ab11ba43e1 [REVERT] "[HUDI-1058] Make delete marker configurable (#1819)" (#1914)
This reverts commit 433d7d2c98.
2020-08-04 15:20:38 -07:00
vinoth chandar
539621bd33 [HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876)
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
 - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
 - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-421] Bootstrap Write Client with tests
 - [HUDI-425] Added HoodieDeltaStreamer support
 - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
 - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
 - [HUDI-424] Simplify Record reader implementation
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables

Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
2020-08-03 20:19:21 -07:00
Shen Hong
433d7d2c98 [HUDI-1058] Make delete marker configurable (#1819) 2020-08-03 11:06:31 -04:00
Y Ethan Guo
ccd70a7e48 [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert (#1149)
* [HUDI-472] Introduce the configuration and new modes of record sorting for bulk_insert(#1149). Three sorting modes are implemented: global sort ("global_sort"), local sort inside each RDD partition ("partition_sort") and no sort ("none")
2020-07-31 09:52:42 -04:00
Nishith Agarwal
2fc2b01d86 [HUDI-394] Provide a basic implementation of test suite 2020-07-30 21:21:15 -07:00
Udit Mehrotra
1aae437257 [HUDI-1102] Add common useful Spark related and Table path detection utilities (#1841)
Co-authored-by: Mehrotra <uditme@amazon.com>
2020-07-18 16:16:32 -07:00
miaomiaomiao
10e457278b [HUDI-1078]Fix IllegalArgumentException in Delete data demo of Quick-Start Guide (#1808) 2020-07-13 11:38:06 -04:00
Pratyaksh Sharma
9627a385fe [HUDI-916]: Added support for multiple input formats in TimestampBasedKeyGenerator (#1648) 2020-07-10 15:28:45 -04:00
Pratyaksh Sharma
c7f1a781ab [HUDI-728]: Implemented custom key generator (#1433) 2020-07-09 07:35:07 -04:00
mabin001
8c4ff185f1 [HUDI-1064]Trim hoodie table name (#1805) 2020-07-07 19:10:16 +08:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
Shen Hong
89e37d5273 [HUDI-908] Add some data types to HoodieTestDataGenerator and fix some some bugs. (#1690) 2020-06-22 08:13:28 -07:00
Bhavani Sudha Saktheeswaran
9697fbf71e [HUDI-936] Fix fetch ordering val in HoodieSparkSqlWriter to remove unnecessary conversion to String (#1659) 2020-05-26 21:09:02 -07:00
rolandjohann
459356e292 [HUDI-863] get decimal properties from derived spark DataType (#1596) 2020-05-18 04:28:27 -07:00
Mathieu
25a0080b2f [HUDI-714]Add javadoc and comments to hudi write method link (#1409)
* [HUDI-714] Add javadoc and comments to hudi write method link
2020-05-16 08:36:51 -04:00
Gary Li
a64afdfd17 HUDI-528 Handle empty commit in incremental pulling (#1612) 2020-05-14 22:55:25 -07:00
cxzl25
32bada29dc [HUDI-889] Writer supports useJdbc configuration when hive synchronization is enabled (#1627) 2020-05-14 00:20:13 +08:00
Shen Hong
295d00beea [HUDI-880] Replace part of spark context by hadoop configuration in HoodieTable. (#1614) 2020-05-11 23:33:57 -07:00
AakashPradeep
5e0f5e5521 [HUDI-852] adding check for table name for Append Save mode (#1580)
* adding check for table name for Append Save mode

* adding existing table validation for delete and upsert operation

Co-authored-by: Aakash Pradeep <apradeep@twilio.com>
2020-05-03 23:09:17 -07:00
Dongwook
ddd105bb31 [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource (#1500) 2020-04-20 08:38:18 -07:00
Pratyaksh Sharma
d610252d6b [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment (#1150)
* [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment
2020-04-07 16:10:26 -07:00
vinoth chandar
eaf6cc2d90 [HUDI-756] Organize Cleaning Action execution into a single package in hudi-client (#1485)
- Introduced a thin abstraction ActionExecutor, that all actions will implement
- Pulled cleaning code from table, writeclient into a single package
- CleanHelper is now CleanPlanner, HoodieCleanClient is no longer around
- Minor refactor of HoodieTable factory method
- HoodieTable.create() methods with and without metaclient passed in
- HoodieTable constructor now does not do a redundant instantiation
- Fixed existing unit tests to work at the HoodieWriteClient level
2020-04-04 00:07:34 -07:00
wenningd
ce0a4c64d0 [HUDI-713] Fix conversion of Spark array of struct type to Avro schema (#1406)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-03-30 15:52:15 -07:00
Suneel Marthi
fa36082554 [HUDI-746] Reduce build warnings < 10 (#1465) 2020-03-30 11:46:52 +08:00
vinoth chandar
e057c27603 [HUDI-744] Restructure hudi-common and clean up files under util packages (#1462)
- Brings more order and cohesion to the classes in hudi-common
 - Utils classes related to a particular concept (avro, timeline,...) are placed near to the package
 - common.fs package now contains all the filesystem level classes including wrapper filesystem
 - bloom.filter package renamed to just bloom
 - config package contains classes that help store properties
 - common.fs.inline package contains all the inline filesystem classes/impl
 - common.table.timeline now consolidates all timeline related classes
 - common.table.view consolidates all the classes related to filesystem view metadata
 - common.table.timeline.versioning contains all classes related to versioning of timeline
 - Fix few unit tests as a result
 - Moved the test packages around to match the source file move
 - Rename AvroUtils to TimelineMetadataUtils & minor fixes/typos
2020-03-29 10:58:49 -07:00
Suneel Marthi
8c3001363d HUDI-479: Eliminate or Minimize use of Guava if possible (#1159) 2020-03-28 03:11:32 -04:00
Zhiyuan Zhao
0241b21f77 [HUDI-65] commitTime rename to instantTime (#1431) 2020-03-22 18:06:00 -07:00
Udit Mehrotra
c40a0d4e91 [HUDI-656][Performance] Return a dummy Spark relation after writing the DataFrame (#1394)
Co-authored-by: Mehrotra <uditme@amazon.com>
2020-03-11 20:27:46 -07:00
hongdd
f93e64fee4 [HUDI-681]Remove embeddedTimelineService from HoodieReadClient (#1388)
* [HUDI-681]Remove embeddedTimelineService from HoodieReadClient
2020-03-09 18:31:04 +08:00
vinoyang
ee5b32f5d4 [HUDI-652] Decouple HoodieReadClient and AbstractHoodieClient to break the inheritance chain (#1372)
* Removed timeline server support
* Removed try-with-resource
2020-03-06 09:59:35 -08:00
Udit Mehrotra
2d04014581 [HUDI-607] Fix to allow creation/syncing of Hive tables partitioned by Date type columns (#1330) 2020-03-01 10:42:58 -08:00
vinoth chandar
71170fafe7 [HUDI-554] Cleanup package structure in hudi-client (#1346)
- Just package, class moves and renames with the following intent
 - `client` now has all the various client classes, that do the transaction management
 - `func` renamed to `execution` and some helpers moved to `client/utils`
 - All compaction code under `io` now under `table/compact`
 - Rollback code under `table/rollback` and in general all code for individual operations under `table`
 - `exception` `config`, `metrics` left untouched
 - Moved the tests also accordingly
 - Fixed some flaky tests
2020-02-27 08:05:58 -08:00
lamber-ken
11fb2c2614 [HUDI-580] Fix incorrect license header in files 2020-02-25 08:54:26 -08:00
YanJia-Gary-Li
4e7fcde4a6 [HUDI-597] Enable incremental pulling from defined partitions (#1348) 2020-02-24 11:46:30 -08:00
Suneel Marthi
f9d2f66dc1 [HUDI-622]: Remove VisibleForTesting annotation and import from code (#1343)
* HUDI:622: Remove VisibleForTesting annotation and import from code
2020-02-20 15:17:53 +08:00
Suneel Marthi
5b7bb142dc [HUDI-583] Code Cleanup, remove redundant code, and other changes (#1237) 2020-02-02 18:03:44 +08:00
leesf
652224edc8 [HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator (#1281)
* [HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator

* add tests
2020-01-29 16:26:26 -08:00
Y Ethan Guo
d0ee95ed16 [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion (#1246) 2020-01-18 16:40:56 -08:00
vinoth chandar
c2c0f6b13d [HUDI-509] Renaming code in sync with cWiki restructuring (#1212)
- Storage Type replaced with Table Type (remaining instances)
 - View types replaced with query types;
 - ReadOptimized view referred as Snapshot Query
 - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views
 - HoodieDataFile renamed to HoodieBaseFile
 - Hive Sync tool will register RO tables for MOR with a `_ro` suffix
 - Datasource/Deltastreamer options renamed accordingly
 - Support fallback to old config values as well, so migration is painless
 - Config for controlling _ro suffix addition
 - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView
2020-01-16 23:58:47 -08:00
Scheller
1daba24065 Add GlobalDeleteKeyGenerator
Adds new GlobalDeleteKeyGenerator for record_key deletes with global indices. Also refactors key generators into their own package.
2020-01-15 17:01:29 -08:00
Sivabalan Narayanan
2248fd9aea Fixing checkstyle issues 2020-01-15 14:21:26 -08:00
Sivabalan Narayanan
2b2f23aa60 Fixing delete util method 2020-01-15 14:21:26 -08:00
Sivabalan Narayanan
87fdb769f0 Adding util methods to assist in adding deletion support to Quick Start 2020-01-15 14:21:26 -08:00