lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
vinoyang	eafd7bf289	[MINOR] Fix wrong javadoc and refactor some naming issues (#2156 )	2020-10-09 15:09:26 -07:00
dugenkui	00271af64e	[MINOR] Fix typo (#2159 ) * fix typo * fix typo	2020-10-09 14:52:55 -07:00
Raymond Xu	1d1d91d444	[HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2143 ) * [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable Remove APIs in `HoodieTestUtils` - listAllDataFilesAndLogFilesInPath - listAllLogFilesInPath - listAllDataFilesInPath - writeRecordsToLogFiles - createCleanFiles - createPendingCleanFiles Migrate the callers to use `HoodieTestTable` and `HoodieWriteableTestTable` with new APIs added - listAllBaseAndLogFiles - listAllLogFiles - listAllBaseFiles - withLogAppends - addClean - addInflightClean Also added related APIs in `FileCreateUtils` - createCleanFile - createRequestedCleanFile - createInflightCleanFile	2020-10-09 10:21:27 +08:00
Prashant Wason	788d236c44	[HUDI-1303] Some improvements for the HUDI Test Suite. (#2128 ) 1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs. 2. Fix CleanNode constructor which is not correctly implemented 3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts. 4. Fixed generation of records of specific size - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size. - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map. 5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp). 6. Making --workload-generator-classname an optional parameter as most probably the default will be used	2020-10-07 08:33:51 -04:00
Pratyaksh Sharma	524193eb4b	[HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode (#1566 ) Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>	2020-10-06 20:34:03 -07:00
rmpifer	fed01cd3c9	[MINOR] Update spark master default to yarn (#2148 )	2020-10-05 15:22:28 -07:00
lw0090	fdae388626	[HUDI-1203] add port configuration for EmbeddedTimelineService (#2142 )	2020-10-05 11:36:54 -07:00
Shen Hong	b335459c80	[HUDI-1208] Ordering Field should be optional when precombine is turned off (#2088 )	2020-10-04 11:34:21 -07:00
Pratyaksh Sharma	080ba3ed54	[HUDI-1199] relocated jetty in hudi-utilities-bundle pom (#1990 ) * [HUDI-1199]: relocated jetty in hudi-utilities-bundle pom * [HUDI-1199]: re trigger travis build	2020-10-04 11:22:01 -07:00
Prashant Wason	6c610b91ef	[HUDI-1305] Added an API to shutdown and remove the metrics reporter. (#2132 ) This helps in removing reporter once the test has complete. Prevents log pollution from un-necessary metric logs. - Added an API to shutdown the metrics reporter after tests.	2020-10-04 09:30:04 -07:00
Mathieu	1f7add9291	[HUDI-1089] Refactor hudi-client to support multi-engine (#1827 ) - This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules - Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc - Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common` - Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies - To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-10-01 14:25:29 -07:00
vinoyang	5aaaf8bff1	[MINOR] Change the log level of the dag scheduler for the test suite (#2134 )	2020-09-30 17:17:44 +08:00
satishkotha	a99e93bed5	[HUDI-1072] Introduce REPLACE top level action. Implement insert_overwrite operation on top of replace action (#2048 )	2020-09-29 17:04:25 -07:00
hongdd	32c9cad52c	[HUDI-840] Avoid blank file created by HoodieLogFormatWriter (#1567 )	2020-09-29 08:02:15 -07:00
liujinhui	20b9b399c9	[HUDI-1233] Deltastreamer Kafka consumption delay reporting indicators (#2074 )	2020-09-29 13:44:31 +08:00
vinoyang	c0c0095fa9	[MINOR] Reformat prepare_integration_suite script (#2126 )	2020-09-28 14:12:57 -07:00
liujinhui	a86f5574ed	[HUDI-1192] Make create hive database automatically configurable (#1968 )	2020-09-27 14:10:13 +08:00
leesf	b0f1b736f8	[MINOR] Fix checkstyle (#2117 )	2020-09-26 22:25:19 +08:00
Raymond Xu	1be0b06ef8	[HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2112 ) Remove APIs in HoodieTestUtils - HoodieTestUtils#createInflightCommitFiles - HoodieTestUtils#getCommitFilePath - HoodieTestUtils#doesCommitExist and migrate usages to HoodieTestTable in - hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRollbacksCommand.java - hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestUpgradeDowngradeCommand.java - hudi-cli/src/test/java/org/apache/hudi/cli/integ/ITTestCommitsCommand.java - hudi-cli/src/test/java/org/apache/hudi/cli/testutils/HoodieTestCommitMetadataGenerator.java - hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java	2020-09-26 21:21:47 +08:00
dugenkui	ae68b2b355	[MINOR] fix typos (#2116 )	2020-09-26 20:40:33 +08:00
Mathieu	1dd6635fbb	[MINOR] Fix ClassCastException when use QuickstartUtils generate data (#2105 )	2020-09-25 10:13:39 -07:00
hongdd	2eaba0962a	[HUDI-544] Archived commits command code cleanup (#1242 ) * Archived commits command code cleanup	2020-09-25 09:36:41 -07:00
dugenkui	6837118c21	[MINOR] Improve description (#2113 )	2020-09-25 22:21:37 +08:00
vinoth chandar	83d2e03cf7	[MINOR] Adding scripts to checkout and push to PRs (#2109 ) - Tested the checkout_pr.sh locally - Tested a dryrun of pr_push_command.sh	2020-09-24 15:01:32 -07:00
wenningd	d37977b310	[MINOR] Remove useless config for bootstrap integ testing (#2102 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2020-09-22 13:29:59 -07:00
lw0090	fcc497eff1	[HUDI-1268] fix UpgradeDowngrade fs Rename issue for hdfs and aliyun oss (#2099 )	2020-09-22 09:57:20 -07:00
Kaiux	8087016504	[HUDI-1213] Set Default for the bootstrap config : hoodie.bootstrap.full.input.provider (#2087 )	2020-09-22 03:28:19 -07:00
Alexander Filipchik	c8e19e2def	[HUDI-801] Adding a way to post process schema after it is fetched (#1524 ) * [HUDI-801] Adding a way to post process schema after it is fetched Co-authored-by: Alex Filipchik <alex.filipchik@csscompany.com> Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>	2020-09-19 11:18:36 -07:00
Raymond Xu	7c45894f43	[HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2094 ) Migrate deprecated APIs in HoodieTestUtils to HoodieTestTable for test classes - TestClientRollback - TestCopyOnWriteRollbackActionExecutor Use FileCreateUtils APIs in CompactionTestUtils. Then remove unused deprecated APIs after migration.	2020-09-19 17:55:24 +08:00
Pratyaksh Sharma	73e5b4c7bb	[HUDI-796] Add deduping logic for upserts case (#1558 )	2020-09-18 19:37:52 +08:00
Udit Mehrotra	bf65269f66	[HUDI-1230] Fix for preventing MOR datasource jobs from hanging via spark-submit (#2046 )	2020-09-17 20:03:35 -07:00
Raymond Xu	3201665295	[HUDI-995] Use HoodieTestTable in more classes (#2079 ) * [HUDI-995] Use HoodieTestTable in more classes Migrate test data prep logic in - TestStatsCommand - TestHoodieROTablePathFilter Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness - Move relevant APIs to HoodieTestTable - Migrate usages After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils	2020-09-17 09:29:07 -07:00
shenh062326	581d54097c	[HUDI-1143] Change timestamp field in HoodieTestDataGenerator from double to long	2020-09-15 20:58:29 -07:00
liujinhui	6c84ef20ac	[HUDI-1282] Check whether the topic exists before deltastrmer consumes Kafka (#2090 )	2020-09-16 10:43:52 +08:00
Balaji Varadarajan	5e61454a6c	[HUDI-802] AWSDmsTransformer does not handle insert and delete of a row in a single batch correctly (#2084 )	2020-09-11 16:11:42 -07:00
Karl-WangSK	a1cff8abae	[HUDI-1255] Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage (#2056 ) Add new Payload(OverwriteNonDefaultsWithLatestAvroPayload) for updating specified fields in storage ## Brief change log update current value for several fields that you want to change. The default payload OverwriteWithLatestAvroPayload overwrite the whole record when compared to `orderingVal`.This doesn't meet our need when we just want to change specified fields. For example: (suppose Default value is null) ``` current Value Field: name age gender Value: karl 20 male ``` ``` insert Value Field: name age gender Value: null 30 null ``` ``` After insert: Field: name age gender Value: karl 30 male ``` ## Verify this pull request Added TestOverwriteNonDefaultsWithLatestAvroPayload to verify the change.	2020-09-09 21:54:21 -07:00
linshan-ma	063a98fc2b	[HUDI-1254] TypedProperties can not get values by initializing an existing properties (#2059 )	2020-09-09 23:42:41 +08:00
Balajee Nagasubramaniam	fec7cd3c97	[HUDI-1130] hudi-test-suite support for schema evolution (can be triggered on any insert/upsert DAG node).	2020-09-08 22:43:59 -07:00
Abhishek Modi	53d1e55110	Test Suite should work with Docker + Unit Tests	2020-09-08 22:41:14 -07:00
wenningd	2fee087f0f	[HUDI-1181] Fix decimal type display issue for record key field (#1953 ) * [HUDI-1181] Fix decimal type display issue for record key field * Remove getNestedFieldVal method from DataSourceUtils * resolve comments Co-authored-by: Wenning Ding <wenningd@amazon.com>	2020-09-08 17:50:54 -07:00
Gary Li	e3cf34dff9	Merge pull request #2077 from chuangehh/typofix [MINOR] Fix typo in the pom files	2020-09-08 00:02:08 -07:00
chuangehh	51b16bd36f	[MINOR] fix typo	2020-09-08 11:55:38 +08:00
Prashant Wason	fe7c9e71eb	[MINOR] Fix BindException when running tests of shared machines. (#2070 ) When unit tests are run on shared machines (e.g. jenkins cluster), the unit tests sometimes fail due to BindException in starting HDFS Cluster. This is because the port chosen may have been bound by another process using the same machine. The fix is to retry the port selection a few times.	2020-09-07 19:30:45 -07:00
Raymond Xu	83e39e2b17	[HUDI-781] Add HoodieWriteableTestTable (#2040 ) - Introduce HoodieWriteableTestTable for writing records into files - Migrate writeParquetFiles() in HoodieClientTestUtils to HoodieWriteableTestTable - Adopt HoodieWrittableTestTable for test cases in - ITTestRepairsCommand.java - TestHoodieIndex.java - TestHoodieKeyLocationFetchHandle.java - TestHoodieGlobalBloomIndex.java - TestHoodieBloomIndex.java - Renamed HoodieTestTable and FileCreateUtils APIs - dataFile changed to baseFile	2020-09-07 17:54:36 +08:00
Sreeram Ramji	6537af2676	[HUDI-1153] Spark DataSource and Streaming Write must fail when operation type is misconfigured (#2014 )	2020-09-04 09:08:30 -07:00
Dongwook	8d19ebfd0f	[HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703 ) For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases. This patch makes the following fixes in this regard. - Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism" - Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"	2020-09-01 12:55:31 -04:00
Gary Li	48a58c98a1	[MINOR] fix get classname for hive sync (#2008 )	2020-08-31 16:26:10 -07:00
Prashant Wason	6461927eac	[HUDI-960] Implementation of the HFile base and log file format. (#1804 ) * [HUDI-960] Implementation of the HFile base and log file format. 1. Includes HFileWriter and HFileReader 2. Includes HFileInputFormat for both snapshot and realtime input format for Hive 3. Unit test for new code 4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported) Advantage: HFile file format saves data as binary key-value pairs. This implementation chooses the following values: 1. Key = Hoodie Record Key (as bytes) 2. Value = Avro encoded GenericRecord (as bytes) HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format. Limitations: HFile storage format has certain limitations when used as a general purpose data storage format. 1. Does not have a implemented reader for Presto and SparkSQL 2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning Other changes: - Remove databricks/avro from pom - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc - Breaking up limitFileSize(), per parquet and hfile base files - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction - Throw UnsupportedException in HFileReader.getRecordKeys() - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise) * Fixing checkstyle Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-08-31 08:05:59 -07:00
Mathieu	6df8f88d86	[HUDI-1252] Remove unused class NoOpBulkInsertPartitioner in DataSourceTestUtils (#2054 )	2020-08-31 03:03:10 -07:00
Thinking Chen	6b417d1a86	[HUDI-1225] Fix: Avro Date logical type not handled correctly when converting to Spark Row (#2047 )	2020-08-29 01:16:42 -07:00

1 2 3 4 5 ...

1174 Commits