lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Shen Hong	49407169ac	[HUDI-1209] Properties File must be optional when running deltastreamer (#2085 )	2020-10-21 17:49:28 -07:00
Ho Tien Vu	af5ef4d49d	[HUDI-1330] handle prefix filtering at directory level (#2157 ) The current DFSPathSelector only ignore prefix(_, .) at the file level while files under subdirectories e.g. (.checkpoint/*) are still considered which result in bad-format exception during reading.	2020-10-20 23:20:19 -07:00
Ho Tien Vu	fd269ddeb0	[MINOR] Make sure factory method is used to instanciate DFSPathSelector (#2187 ) * Move createSourceSelector into DFSPathSelector factory method * Replace constructor call with factory method * Added some javadoc	2020-10-20 17:52:31 +08:00
lw0090	2126f13e13	[HUDI-791] Replace null by Option in Delta Streamer (#2171 )	2020-10-11 18:29:57 -07:00
Prashant Wason	788d236c44	[HUDI-1303] Some improvements for the HUDI Test Suite. (#2128 ) 1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs. 2. Fix CleanNode constructor which is not correctly implemented 3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts. 4. Fixed generation of records of specific size - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size. - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map. 5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp). 6. Making --workload-generator-classname an optional parameter as most probably the default will be used	2020-10-07 08:33:51 -04:00
Pratyaksh Sharma	524193eb4b	[HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode (#1566 ) Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>	2020-10-06 20:34:03 -07:00
Shen Hong	b335459c80	[HUDI-1208] Ordering Field should be optional when precombine is turned off (#2088 )	2020-10-04 11:34:21 -07:00
Mathieu	1f7add9291	[HUDI-1089] Refactor hudi-client to support multi-engine (#1827 ) - This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules - Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc - Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common` - Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies - To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-10-01 14:25:29 -07:00
liujinhui	20b9b399c9	[HUDI-1233] Deltastreamer Kafka consumption delay reporting indicators (#2074 )	2020-09-29 13:44:31 +08:00
Alexander Filipchik	c8e19e2def	[HUDI-801] Adding a way to post process schema after it is fetched (#1524 ) * [HUDI-801] Adding a way to post process schema after it is fetched Co-authored-by: Alex Filipchik <alex.filipchik@csscompany.com> Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>	2020-09-19 11:18:36 -07:00
shenh062326	581d54097c	[HUDI-1143] Change timestamp field in HoodieTestDataGenerator from double to long	2020-09-15 20:58:29 -07:00
liujinhui	6c84ef20ac	[HUDI-1282] Check whether the topic exists before deltastrmer consumes Kafka (#2090 )	2020-09-16 10:43:52 +08:00
Abhishek Modi	53d1e55110	Test Suite should work with Docker + Unit Tests	2020-09-08 22:41:14 -07:00
wenningd	2fee087f0f	[HUDI-1181] Fix decimal type display issue for record key field (#1953 ) * [HUDI-1181] Fix decimal type display issue for record key field * Remove getNestedFieldVal method from DataSourceUtils * resolve comments Co-authored-by: Wenning Ding <wenningd@amazon.com>	2020-09-08 17:50:54 -07:00
Dongwook	8d19ebfd0f	[HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703 ) For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases. This patch makes the following fixes in this regard. - Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism" - Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"	2020-09-01 12:55:31 -04:00
Satish Kotha	ea983ff912	[HUDI-1137] Add option to configure different path selector	2020-08-24 13:26:44 -07:00
Mathieu	b883b6d268	[HUDI-1122] Introduce a kafka implementation of hoodie write commit ca… (#1886 )	2020-08-20 23:00:59 +08:00
Bhavani Sudha Saktheeswaran	4226d75144	Moving to 0.6.1-SNAPSHOT on master branch.	2020-08-14 12:54:15 -07:00
vinoth chandar	9bde6d616c	[HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to mark public APIs (#1965 ) - Maturity levels one of : evolving, stable, deprecated - Took a pass and marked out most of the existing public API	2020-08-13 23:28:17 -07:00
Udit Mehrotra	e4a2d98f79	[HUDI-426] Bootstrap datasource integration (#1702 )	2020-08-09 14:06:13 -07:00
wenningd	9fe2d2b14a	[HUDI-427] [HUDI-971] Implement CLI support for performing bootstrap (#1869 ) * [HUDI-971] Clean partitions & fileIds returned by HFileBootstrapIndex * [HUDI-427] Implement CLI support for performing bootstrap Co-authored-by: Wenning Ding <wenningd@amazon.com> Co-authored-by: Balaji Varadarajan <vbalaji@apache.org>	2020-08-08 12:37:29 -07:00
Mathieu	b51646dcc7	[HUDI-1151] Fix NPE when no new data in kafka using HoodieDeltaStreamer (#1921 )	2020-08-07 00:03:20 +08:00
lw0090	51ea27d665	[HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync, hudi-dla-sync (#1810 ) - Generalize the hive-sync module for syncing to multiple metastores - Added new options for datasource - Added new command line for delta streamer Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-08-05 21:34:55 -07:00
Balaji Varadarajan	7a2429f5ba	[HUDI-575] Spark Streaming with async compaction support (#1752 )	2020-08-05 07:50:15 -07:00
Sreeram Ramji	217a84192c	[HUDI-1140] Fix Jcommander issue for --hoodie-conf in DeltaStreamer (#1898 )	2020-08-04 21:42:51 -07:00
Sivabalan Narayanan	ab11ba43e1	[REVERT] "[HUDI-1058] Make delete marker configurable (#1819 )" (#1914 ) This reverts commit `433d7d2c98`.	2020-08-04 15:20:38 -07:00
vinoth chandar	539621bd33	[HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876 ) - [HUDI-418] Bootstrap Index Implementation using HFile with unit-test - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices - [HUDI-421] Bootstrap Write Client with tests - [HUDI-425] Added HoodieDeltaStreamer support - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly - [HUDI-424] Simplify Record reader implementation - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables Co-authored-by: Mehrotra <uditme@amazon.com> Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Balaji Varadarajan <varadarb@uber.com>	2020-08-03 20:19:21 -07:00
Shen Hong	433d7d2c98	[HUDI-1058] Make delete marker configurable (#1819 )	2020-08-03 11:06:31 -04:00
n3nash	727f1df62c	[MINOR] Suppressing spark logs for hudi-integ and hudi-utilities (#1894 )	2020-07-31 19:01:25 -07:00
Nishith Agarwal	2fc2b01d86	[HUDI-394] Provide a basic implementation of test suite	2020-07-30 21:21:15 -07:00
Raymond Xu	ca36c44cb3	[HUDI-995] Move TestRawTripPayload and HoodieTestDataGenerator to hudi-common (#1873 )	2020-07-27 19:21:45 +08:00
Sivabalan Narayanan	5b6026ba43	[HUDI-802] Fixing deletes for inserts in same batch in write path (#1792 ) * Fixing deletes for inserts in same batch in write path * Fixing delta streamer tests * Adding tests for OverwriteWithLatestAvroPayload	2020-07-22 19:39:57 -07:00
lw0090	1ec89e9a94	[HUDI-839] Introducing support for rollbacks using marker files (#1756 ) * [HUDI-839] Introducing rollback strategy using marker files - Adds a new mechanism for rollbacks where it's based on the marker files generated during the write - Consequently, marker file/dir deletion now happens post commit, instead of during finalize - Marker files are also generated for AppendHandle, making it consistent throughout the write path - Until upgrade-downgrade mechanism can upgrade non-marker based inflight writes to marker based, this should only be turned on for new datasets. - Added marker dir deletion after successful commit/rollback, individual files are not deleted during finalize - Fail safe for deleting marker directories, now during timeline archival process - Added check to ensure completed instants are not rolled back using marker based strategy. This will be incorrect - Reworked tests to rollback inflight instants, instead of completed instants whenever necessary - Added an unit test for MarkerBasedRollbackStrategy Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-07-20 22:41:42 -07:00
Prashant Wason	b71f25f210	[HUDI-92] Provide reasonable names for Spark DAG stages in HUDI. (#1289 )	2020-07-19 10:29:25 -07:00
GuoPhilipse	abfebd30f3	[MINOR] Update parameter description (#1821 )	2020-07-11 22:57:12 +08:00
Pratyaksh Sharma	9627a385fe	[HUDI-916]: Added support for multiple input formats in TimestampBasedKeyGenerator (#1648 )	2020-07-10 15:28:45 -04:00
Pratyaksh Sharma	c7f1a781ab	[HUDI-728]: Implemented custom key generator (#1433 )	2020-07-09 07:35:07 -04:00
Trevor	d58644b657	[HUDI-1062]Remove unnecessary maxEvent check and add some log in KafkaOffsetGen (#1779 )	2020-07-08 21:07:34 -07:00
Raymond Xu	3b9a30528b	[HUDI-996] Add functional test suite for hudi-utilities (#1746 ) - Share resources for functional tests - Add suite for functional test classes from hudi-utilities	2020-07-05 16:44:31 -07:00
Balaji Varadarajan	8919be6a5d	[HUDI-855] Run Cleaner async with writing (#1577 ) - Cleaner can now run concurrently with write operation - Configs to turn on/off Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-06-28 02:04:50 -07:00
Prashant Wason	2603cfb33e	[HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687 ) Notable changes: 1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format 2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock) 3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format 4. HiveSyncTool accepts the base file format as a CLI parameter 5. HoodieDeltaStreamer accepts the base file format as a CLI parameter 6. HoodieSparkSqlWriter accepts the base file format as a parameter	2020-06-25 23:46:55 -07:00
Shen Hong	89e37d5273	[HUDI-908] Add some data types to HoodieTestDataGenerator and fix some some bugs. (#1690 )	2020-06-22 08:13:28 -07:00
Raymond Xu	8a9fdd603e	[HUDI-1023] Add validation error messages in delta sync (#1710 ) - Remove explicitly specifying BLOOM_INDEX since thats the default anyway	2020-06-19 12:12:35 -07:00
Raymond Xu	ab724af5c4	[MINOR] Rename `TestSourceConfig` to `SourceConfigs` (#1749 )	2020-06-19 12:08:19 -07:00
Litianye	ede6c9bda4	[HUDI-1006] Deltastreamer use kafkaSource with offset reset strategy:latest can't consume data (#1719 )	2020-06-14 18:01:44 +08:00
liujinhui	97ab97b726	[HUDI-918] Fix kafkaOffsetGen can not read kafka data bug (#1652 )	2020-06-08 20:46:47 +08:00
Raymond Xu	742c204099	[HUDI-811] Restructure test packages in hudi-client/cli (#1689 )	2020-06-02 10:25:42 +08:00
Raymond Xu	03f136361a	[HUDI-811] Restructure test packages in hudi-common (#1644 ) * [HUDI-811] Restructure test packages in hudi-common	2020-05-27 16:28:17 +08:00
Raymond Xu	6c450957ce	[HUDI-690] Filter out inflight compaction in exporter (#1667 )	2020-05-26 09:23:34 -07:00
Balaji Varadarajan	74ecc27e92	[HUDI-846][HUDI-848] Enable Incremental cleaning and embedded timeline-server by default (#1634 )	2020-05-20 05:29:43 -07:00

1 2 3 4

186 Commits