1
0
Commit Graph

1212 Commits

Author SHA1 Message Date
wangxianghu
a23230c8c2 [HUDI-1400] Replace Operation enum with WriteOperationType (#2259) 2020-11-19 13:40:04 +08:00
wangxianghu
4d05680038 [HUDI-1327] Introduce base implemetation of hudi-flink-client (#2176) 2020-11-18 17:57:11 +08:00
Karl_Wang
430d4b428e [HUDI-1377] remove duplicate code (#2235) 2020-11-10 10:08:08 -08:00
Balaji Varadarajan
42b6aeca28 [HUDI-1358] Fix Memory Leak in HoodieLogFormatWriter (#2217) 2020-11-09 19:26:13 -08:00
wenningd
0364498ae3 [HUDI-1375] Fix bug in HoodieAvroUtils.removeMetadataFields() method (#2232)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2020-11-05 17:30:17 -08:00
satishkotha
33ec88fc38 [HUDI-1352] Add FileSystemView APIs to query pending clustering operations (#2202) 2020-11-05 08:49:58 -08:00
lw0090
5f5c15b0d9 [HUDI-892] RealtimeParquetInputFormat skip adding projection columns if there are no log files (#2190)
* [HUDI-892] RealtimeParquetInputFormat skip adding projection columns if there are no log files
* [HUDI-892]  for test
* [HUDI-892]  fix bug generate array from split
* [HUDI-892] revert test log
2020-11-02 20:00:12 -08:00
wangxianghu
d160abb437 [HUDI-912] Refactor and relocate KeyGenerator to support more engines (#2200)
* [HUDI-912] Refactor and relocate KeyGenerator to support more engines

* Rename KeyGenerators
2020-11-02 13:12:51 -08:00
Venkatesh Rudraraju
59f995a3f5 Use RateLimiter instead of sleep. Repartition WriteStatus to optimize Hbase index writes (#1484) 2020-11-02 08:33:27 -08:00
Sivabalan Narayanan
a205dd10fa [HUDI-1338] Adding Delete support to test suite framework (#2172)
- Adding Delete support to test suite. 
         Added DeleteNode 
         Added support to generate delete records
2020-11-01 00:15:41 -04:00
Prashant Wason
6310a2307a [HUDI-1351] Improvements to the hudi test suite for scalability and repeated testing. (#2197)
1. Added the --clean-input and --clean-output parameters to clean the input and output directories before starting the job
2. Added the --delete-old-input parameter to deleted older batches for data already ingested. This helps keep number of redundant files low.
3. Added the --input-parallelism parameter to restrict the parallelism when generating input data. This helps keeping the number of generated input files low.
4. Added an option start_offset to Dag Nodes. Without ability to specify start offsets, data is generated into existing partitions. With start offset, DAG can control on which partition, the data is to be written.
5. Fixed generation of records for correct number of partitions
  - In the existing implementation, the partition is chosen as a random long. This does not guarantee exact number of requested partitions to be created.
6. Changed variable blacklistedFields to be a Set as that is faster than List for membership checks.
7. Fixed integer division for Math.ceil. If two integers are divided, the result is not double unless one of the integer is casted to double.
2020-10-29 06:50:37 -07:00
liujinhui
736a940854 [HUDI-1274] Make hive synchronization supports hourly partition (#2122) 2020-10-29 11:29:50 +08:00
n3nash
e109a61803 1. Fix merge on read DAG to make docker demo pass (#2092)
1. Fix merge on read DAG to make docker demo pass (#2092)
2. Fix repeat_count, rollback node
2020-10-28 22:34:26 -04:00
wangxianghu
e206ddd431 [MINOR] Private the NoArgsConstructor of SparkMergeHelper and code clean (#2194) 2020-10-26 12:22:11 +08:00
lw0090
8545ea3856 [HUDI-1118] Cleanup rollback files residing in .hoodie folder (#2205) 2020-10-25 21:04:56 -07:00
Prashant Wason
49e855c348 [HUDI-1326] Added an API to force publish metrics and flush them. (#2152)
* [HUDI-1326] Added an API to force publish metrics and flush them.

Using the added API, publish metrics after each level of the DAG completed in hudi-test-suite.

* Code cleanups

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-24 16:47:24 -07:00
Raymond Xu
14c4611857 [MINOR] Fix caller to SparkBulkInsertCommitActionExecutor (#2195)
Fixed calling the wrong constructor
2020-10-21 19:50:10 -07:00
Shen Hong
49407169ac [HUDI-1209] Properties File must be optional when running deltastreamer (#2085) 2020-10-21 17:49:28 -07:00
Pratyaksh Sharma
e4931744eb [HUDI-1200] fixed NPE in CustomKeyGenerator (#2093)
- config field is no longer transient in key generator
- verified that the key generator object is shipped from the driver to executors, just the one time and reused for each record
2020-10-20 23:36:25 -07:00
Ho Tien Vu
af5ef4d49d [HUDI-1330] handle prefix filtering at directory level (#2157)
The current DFSPathSelector only ignore prefix(_, .) at the file level while files under subdirectories
e.g. (.checkpoint/*) are still considered which result in bad-format exception during reading.
2020-10-20 23:20:19 -07:00
Ho Tien Vu
fd269ddeb0 [MINOR] Make sure factory method is used to instanciate DFSPathSelector (#2187)
* Move createSourceSelector into DFSPathSelector factory method
* Replace constructor call with factory method
* Added some javadoc
2020-10-20 17:52:31 +08:00
Bhavani Sudha Saktheeswaran
6490b029dd [HUDI-1345] Remove Hbase and htrace relocation from utilities bundle (#2185) 2020-10-19 16:11:08 -05:00
lw0090
4d80e1e221 [HUDI-284] add more test for UpdateSchemaEvolution (#2127)
Unit test different schema evolution scenarios.
2020-10-19 07:38:04 -07:00
Guy Khazma
35d406de40 [HUDI-1344] IBM Cloud Object Storage Support (#2182) 2020-10-18 17:24:53 +08:00
lw0090
ec6267c303 [HUDI-307] add test to check timestamp date decimal type write and read consistent (#2177) 2020-10-18 17:18:50 +08:00
rmpifer
a44f66869f [HUDI-1289] Remove relocation of pattern for hbase dependencies and add shading of guava in hadoop, spark, and presto bundles (#2147)
- Update hudi-spark-bundle pom to not relocate hbase and htrace pattern
- Remove codec relocation as this is not included in bundle which was causing error
2020-10-14 17:04:35 -07:00
satishkotha
7fa641ea9a [HUDI-1302] Add support for timestamp field in HiveSync (#2129) 2020-10-13 22:58:00 -07:00
wangxianghu
c7d962efff [HUDI-1328] Introduce HoodieFlinkEngineContext to hudi-flink-client (#2161) 2020-10-14 09:30:49 +08:00
lw0090
b66c3ef23a [HUDI-1298] Add better error messages when IOException occurs during log file reading (#2133) 2020-10-13 00:45:10 -07:00
satishkotha
0d407342ef [HUDI-1304] Add unit test for testing compaction on replaced file groups (#2150) 2020-10-12 16:48:29 -07:00
Raymond Xu
c5e10d668f [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2167)
Remove APIs in `HoodieTestUtils`
- `createCommitFiles`
- `createDataFile`
- `createNewLogFile`
- `createCompactionRequest`

Migrated usages in `TestCleaner#testPendingCompactions`.

Also improved some API names in `HoodieTestTable`.
2020-10-12 14:39:10 +08:00
hj2016
c0472d3317 [HUDI-1184] Fix the support of hbase index partition path change (#1978)
When the hbase index is used, when the record partition is changed to another partition, the path does not change according to the value of the partition column

Co-authored-by: huangjing <huangjing@clinbrain.com>
2020-10-11 19:05:57 -07:00
dugenkui
b58daf29ba [MINOR] remove unused generics type (#2163) 2020-10-11 18:38:42 -07:00
lw0090
2126f13e13 [HUDI-791] Replace null by Option in Delta Streamer (#2171) 2020-10-11 18:29:57 -07:00
dugenkui
032bc3b08f [MINOR] NPE Optimization for Option (#2158) 2020-10-11 17:55:41 -07:00
dugenkui
d4d4c8c899 [MINOR] Fix typo and others (#2164)
* remove HoodieSerializationException that will never be throw
* remove unused method, make HoodieException more readable
* fix typo
2020-10-11 17:52:44 -07:00
lw0090
86db4da33c [HUDI-1339] delete useless import in hudi-spark module (#2173) 2020-10-11 17:10:52 -07:00
lw0090
585ce0094d [HUDI-1301] use spark INCREMENTAL mode query hudi dataset support schema version. (#2125) 2020-10-10 20:53:41 +08:00
vinoyang
eafd7bf289 [MINOR] Fix wrong javadoc and refactor some naming issues (#2156) 2020-10-09 15:09:26 -07:00
dugenkui
00271af64e [MINOR] Fix typo (#2159)
* fix typo

* fix typo
2020-10-09 14:52:55 -07:00
Raymond Xu
1d1d91d444 [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2143)
* [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable

Remove APIs in `HoodieTestUtils`
- listAllDataFilesAndLogFilesInPath
- listAllLogFilesInPath
- listAllDataFilesInPath
- writeRecordsToLogFiles
- createCleanFiles
- createPendingCleanFiles

Migrate the callers to use `HoodieTestTable` and `HoodieWriteableTestTable` with new APIs added
- listAllBaseAndLogFiles
- listAllLogFiles
- listAllBaseFiles
- withLogAppends
- addClean
- addInflightClean

Also added related APIs in `FileCreateUtils`
- createCleanFile
- createRequestedCleanFile
- createInflightCleanFile
2020-10-09 10:21:27 +08:00
Prashant Wason
788d236c44 [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)
1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs.
2. Fix CleanNode constructor which is not correctly implemented
3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts.
4. Fixed generation of records of specific size
   - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size.
   - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map.
5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp).
6. Making --workload-generator-classname an optional parameter as most probably the default will be used
2020-10-07 08:33:51 -04:00
Pratyaksh Sharma
524193eb4b [HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode (#1566)
Co-authored-by: Balaji Varadarajan <balaji.varadarajan@robinhood.com>
2020-10-06 20:34:03 -07:00
rmpifer
fed01cd3c9 [MINOR] Update spark master default to yarn (#2148) 2020-10-05 15:22:28 -07:00
lw0090
fdae388626 [HUDI-1203] add port configuration for EmbeddedTimelineService (#2142) 2020-10-05 11:36:54 -07:00
Shen Hong
b335459c80 [HUDI-1208] Ordering Field should be optional when precombine is turned off (#2088) 2020-10-04 11:34:21 -07:00
Pratyaksh Sharma
080ba3ed54 [HUDI-1199] relocated jetty in hudi-utilities-bundle pom (#1990)
* [HUDI-1199]: relocated jetty in hudi-utilities-bundle pom

* [HUDI-1199]: re trigger travis build
2020-10-04 11:22:01 -07:00
Prashant Wason
6c610b91ef [HUDI-1305] Added an API to shutdown and remove the metrics reporter. (#2132)
This helps in removing reporter once the test has complete. Prevents log pollution from un-necessary metric logs.

- Added an API to shutdown the metrics reporter after tests.
2020-10-04 09:30:04 -07:00
Mathieu
1f7add9291 [HUDI-1089] Refactor hudi-client to support multi-engine (#1827)
- This change breaks `hudi-client` into `hudi-client-common` and `hudi-spark-client` modules 
- Simple usages of Spark using jsc.parallelize() has been redone using EngineContext#map, EngineContext#flatMap etc
- Code changes in the PR, break classes into `BaseXYZ` parent classes with no spark dependencies living in `hudi-client-common`
- Classes on `hudi-spark-client` are named `SparkXYZ` extending the parent classes with all the Spark dependencies
- To simplify/cleanup, HoodieIndex#fetchRecordLocation has been removed and its usages in tests replaced with alternatives

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-10-01 14:25:29 -07:00
vinoyang
5aaaf8bff1 [MINOR] Change the log level of the dag scheduler for the test suite (#2134) 2020-09-30 17:17:44 +08:00