1
0
Commit Graph

63 Commits

Author SHA1 Message Date
n3nash
e109a61803 1. Fix merge on read DAG to make docker demo pass (#2092)
1. Fix merge on read DAG to make docker demo pass (#2092)
2. Fix repeat_count, rollback node
2020-10-28 22:34:26 -04:00
Prashant Wason
788d236c44 [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)
1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs.
2. Fix CleanNode constructor which is not correctly implemented
3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts.
4. Fixed generation of records of specific size
   - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size.
   - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map.
5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp).
6. Making --workload-generator-classname an optional parameter as most probably the default will be used
2020-10-07 08:33:51 -04:00
shenh062326
581d54097c [HUDI-1143] Change timestamp field in HoodieTestDataGenerator from double to long 2020-09-15 20:58:29 -07:00
Abhishek Modi
53d1e55110 Test Suite should work with Docker + Unit Tests 2020-09-08 22:41:14 -07:00
Dongwook
8d19ebfd0f [HUDI-993] Let delete API use "hoodie.delete.shuffle.parallelism" (#1703)
For Delete API, "hoodie.delete.shuffle.parallelism" isn't used as opposed to "hoodie.upsert.shuffle.parallelism" is used for upsert, this creates the performance difference between delete by upsert API with "EmptyHoodieRecordPayload" and delete API for certain cases.

This patch makes the following fixes in this regard. 
- Let deduplicateKeys method use "hoodie.delete.shuffle.parallelism"
- Repartition inputRDD as "hoodie.delete.shuffle.parallelism" in case "hoodie.combine.before.delete=false"
2020-09-01 12:55:31 -04:00
Bhavani Sudha Saktheeswaran
4226d75144 Moving to 0.6.1-SNAPSHOT on master branch. 2020-08-14 12:54:15 -07:00
Sivabalan Narayanan
9c24151929 [HUDI-1175] Commenting out testsuite tests from Integration tests until we investigate the CI flakiness (#1945) 2020-08-10 21:00:57 -07:00
lw0090
51ea27d665 [HUDI-875] Abstract hudi-sync-common, and support hudi-hive-sync, hudi-dla-sync (#1810)
- Generalize the hive-sync module for syncing to multiple metastores
- Added new options for datasource
- Added new command line for delta streamer 

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-05 21:34:55 -07:00
vinoth chandar
539621bd33 [HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876)
- [HUDI-418] Bootstrap Index Implementation using HFile with unit-test
 - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests
 - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-421] Bootstrap Write Client with tests
 - [HUDI-425] Added HoodieDeltaStreamer support
 - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap
 - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly
 - [HUDI-424] Simplify Record reader implementation
 - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices
 - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables

Co-authored-by: Mehrotra <uditme@amazon.com>
Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Balaji Varadarajan <varadarb@uber.com>
2020-08-03 20:19:21 -07:00
n3nash
727f1df62c [MINOR] Suppressing spark logs for hudi-integ and hudi-utilities (#1894) 2020-07-31 19:01:25 -07:00
Nishith Agarwal
2fc2b01d86 [HUDI-394] Provide a basic implementation of test suite 2020-07-30 21:21:15 -07:00
hongdd
fa419213f6 [HUDI-703] Add test for HoodieSyncCommand (#1774) 2020-07-28 08:31:43 +08:00
sathyaprakashg
df2e0c760e HUDI-942 Increase default value number of delta commits for inline compaction (#1664)
Co-authored-by: Sathyaprakash Govindasamy <sathyaprakashg@zillowgroup.com>
2020-06-10 16:16:44 -07:00
Vinoth Govindarajan
8cb86b4d36 Added python3 to the spark_base docker image to support pyspark (#1632) 2020-05-31 22:53:50 -07:00
Satish Kotha
1f6be820f3 [HUDI-758] Modify Integration test to include incremental queries for MOR tables 2020-04-08 21:56:59 -07:00
lamber-ken
90227eeda7 [HUDI-673] Rename hudi-hive-bundle to hudi-hive-sync-bundle 2020-03-07 21:44:35 +08:00
lamber-ken
ccbf543607 [HUDI-654] Rename hudi-hive to hudi-hive-sync 2020-03-06 22:13:16 +08:00
yanghua
0dc8e493aa Moving to 0.6.0-SNAPSHOT on master branch. 2020-03-01 15:08:30 +08:00
lamber-ken
323fffad0d [HUDI-606] Improve execute build_local_docker_images.sh script 2020-02-26 19:38:19 +08:00
lamber-ken
11fb2c2614 [HUDI-580] Fix incorrect license header in files 2020-02-25 08:54:26 -08:00
lamber-ken
cdb028f1f3 [MINOR] Fix missing groupId / version property of dependency 2020-01-25 09:19:55 -08:00
leesf
fc8d4a71ad [MINOR] fix license issue (#1273) 2020-01-23 02:03:49 +08:00
leesf
ed54eb20a5 [MINOR] Add missing licenses (#1271) 2020-01-22 08:06:45 -05:00
lamber-ken
a54535ed5a [MINOR] Fix invalid maven repo address (#1265) 2020-01-21 04:41:59 -08:00
leesf
6e59c1c777 Moving to 0.5.2-SNAPSHOT on master branch. 2020-01-20 10:51:33 -08:00
wenningd
292c1e2ff4 [HUDI-238] Make Hudi support Scala 2.12 (#1226)
* [HUDI-238] Rename scala related artifactId & add maven profile to support Scala 2.12
2020-01-17 14:02:21 -08:00
vinoth chandar
c2c0f6b13d [HUDI-509] Renaming code in sync with cWiki restructuring (#1212)
- Storage Type replaced with Table Type (remaining instances)
 - View types replaced with query types;
 - ReadOptimized view referred as Snapshot Query
 - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views
 - HoodieDataFile renamed to HoodieBaseFile
 - Hive Sync tool will register RO tables for MOR with a `_ro` suffix
 - Datasource/Deltastreamer options renamed accordingly
 - Support fallback to old config values as well, so migration is painless
 - Config for controlling _ro suffix addition
 - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView
2020-01-16 23:58:47 -08:00
yuehan124
c78092d2d3 [HUDI-501] Execute docker/setup_demo.sh in any directory 2020-01-06 10:26:06 -08:00
lamber-ken
d9fbe33339 [HOTFIX] Fix error configuration item of dockerfile-maven-plugin 2019-11-19 16:30:03 -08:00
Balaji Varadarajan
f7c2f8cedc [HUDI-329] Presto Containers for integration test must allow newly built local jars to override 2019-11-13 17:35:34 -08:00
Mehrotra
92c69f5703 Migrate integration tests to spark 2.4.4 2019-11-13 16:53:24 -08:00
Sivabalan Narayanan
23b303e4b1 [HUDI-218] Adding Presto support to Integration Test (#1003) 2019-11-11 06:21:49 -08:00
Balaji Varadarajan
a6390aefc4 [HUDI-312] Make docker hdfs cluster ephemeral. This is needed to fix flakiness in integration tests. Also, Fix DeltaStreamer hanging issue due to uncaught exception 2019-11-01 11:49:59 -07:00
leesf
b19bed442d [HUDI-296] Explore use of spotless to auto fix formatting errors (#945)
- Add spotless format fixing to project
- One time reformatting for conformity
- Build fails for formatting changes and mvn spotless:apply autofixes them
2019-10-10 05:19:40 -07:00
Balaji Varadarajan
9b66ea41fd [HUDI-121] Remove leftover notice file and replace com.uber.hoodie with org.apache.hudi in log4j properties 2019-10-04 09:18:57 -07:00
Balaji Varadarajan
6da2f9ac7c [HUDI-287] Address comments during review of release candidate
1. Remove LICENSE and NOTICE files in hoodie child modules.
  2. Remove developers and contributor section from pom
  3. Also ensure any failures in validation script is reported appropriately
  4. Make hoodie parent pom consistent with that of its parent apache-21 (https://github.com/apache/maven-apache-parent/blob/apache-21/pom.xml)
2019-10-03 09:00:07 -07:00
Balaji Varadarajan
6e8a28bcae HUDI-121 : Address comments during RC2 voting
1. Remove dnl utils jar from git
2. Add LICENSE Headers in missing files
3. Fix NOTICE and LICENSE in all HUDI packages and in top-level
4. Fix License wording in certain HUDI source files
5. Include non java/scala code in RAT licensing check
6. Use whitelist to include dependencies as part of timeline-server bundling
2019-09-30 15:42:15 -07:00
Balaji Varadarajan
c1e7d0e5a6 [HUDI-121] Update Release notes and fix master version 2019-09-17 09:50:30 -07:00
Balaji Varadarajan
7190c022bb [HUDI-249] Updating Notice files 2019-09-13 13:50:58 -07:00
Balaji Varadarajan
d2525c31b7 Moving to 0.6.0-SNAPSHOT on master branch. 2019-09-13 09:58:29 -07:00
Balaji Varadarajan
58623631d4 [HUDI-249] Update Release-notes. Add sign-artifacts to POM and release related scripts. Add missing license headers 2019-09-13 08:41:29 -07:00
leesf
8b150a3c6b [HUDI-230] Add missing Apache License in some files 2019-08-30 09:38:28 -07:00
Balaji Varadarajan
5f9fa82f47 HUDI-124 : Exclude jdk.tools from hadoop-common and update Notice files (#858) 2019-08-28 16:20:47 -07:00
Vinoth Chandar
78e0721507 [HUDI-159] Precursor cleanup to reduce build warnings 2019-08-26 19:41:00 -07:00
vinoth chandar
6edf0b9def [HUDI-68] Pom cleanup & demo automation (#846)
- [HUDI-172] Cleanup Maven POM/Classpath
  - Fix ordering of dependencies in poms, to enable better resolution
  - Idea is to place more specific ones at the top
  - And place dependencies which use them below them
- [HUDI-68] : Automate demo steps on docker setup
 - Move hive queries from hive cli to beeline
 - Standardize on taking query input from text command files
 - Deltastreamer ingest, also does hive sync in a single step
 - Spark Incremental Query materialized as a derived Hive table using datasource
 - Fix flakiness in HDFS spin up and output comparison
 - Code cleanup around streamlining and loc reduction
 - Also fixed pom to not shade some hive classs in spark, to enable hive sync
2019-08-22 20:18:50 -07:00
Bhavani Sudha Saktheeswaran
92eed6aca8 [HUDI-82] Adds Presto integration in Docker demo (#847) 2019-08-22 19:40:36 -07:00
Balaji Varadarajan
a4f9d7575f HUDI-123 Rename code packages/constants to org.apache.hudi (#830)
- Rename com.uber.hoodie to org.apache.hudi
- Flag to pass com.uber.hoodie Input formats for hoodie-sync
- Works with HUDI demo. 
- Also tested for backwards compatibility with datasets built by com.uber.hoodie packages
- Migration guide : https://cwiki.apache.org/confluence/display/HUDI/Migration+Guide+From+com.uber.hoodie+to+org.apache.hudi
2019-08-11 17:48:17 -07:00
Balaji Varadarajan
ec965892b0 HUDI-149 - Remove platform dependencies and update NOTICE plugin 2019-08-05 08:57:15 -07:00
Balaji Varadarajan
479908fd20 HUDI-125 : Change License for all source files and update RAT configurations 2019-06-09 11:41:55 -07:00
Balaji Varadarajan
30b0f2636f Changes related to Licensing work
1. Go through dependencies list one round to ensure compliance. Generated current NOTICE list in all submodules (other apache projects like flink does this).
   To be on conservative side regarding licensing, NOTICE.txt lists all dependencies including transitive. Pending Compliance questions reported in https://issues.apache.org/jira/browse/LEGAL-461
2. Automate generating NOTICE.txt files to allow future package compliance issues be identified early as part of code-review process.
3. Added NOTICE.txt and LICENSE.txt to all HUDI jars
2019-06-07 17:58:57 -07:00