1
0
Commit Graph

110 Commits

Author SHA1 Message Date
Balaji Varadarajan
9f18a1ca80 Fixing bugs found during running hoodie demo (#760) 2019-06-28 17:49:23 -07:00
Ho Tien Vu
e48e35385a Added preemptive check for 'spark.scheduler.mode'
When running docker demo, NoSuchElementException was thrown because spark.scheduler.mode is not set.
Also we want to check before initializing the Spark Context to avoid polute the SparkConf
with unused config.
2019-06-25 13:39:41 -07:00
Balaji Varadarajan
a0d7ab2384 HUDI-70 : Making DeltaStreamer run in continuous mode with concurrent compaction 2019-06-18 17:48:14 -07:00
Nishith Agarwal
129e433641 - Ugrading to Hive 2.x
- Eliminating in-memory deltaRecordsMap
- Use writerSchema to generate generic record needed by custom payloads
- changes to make tests work with hive 2.x
2019-06-13 12:46:14 -07:00
Balaji Varadarajan
51d122b5c3 Close Hoodie Clients which are opened to properly shutdown embedded timeline service 2019-06-11 20:22:14 -07:00
Balaji Varadarajan
479908fd20 HUDI-125 : Change License for all source files and update RAT configurations 2019-06-09 11:41:55 -07:00
Balaji Varadarajan
30b0f2636f Changes related to Licensing work
1. Go through dependencies list one round to ensure compliance. Generated current NOTICE list in all submodules (other apache projects like flink does this).
   To be on conservative side regarding licensing, NOTICE.txt lists all dependencies including transitive. Pending Compliance questions reported in https://issues.apache.org/jira/browse/LEGAL-461
2. Automate generating NOTICE.txt files to allow future package compliance issues be identified early as part of code-review process.
3. Added NOTICE.txt and LICENSE.txt to all HUDI jars
2019-06-07 17:58:57 -07:00
Vinoth Chandar
7b4a28ecf8 Move depedency repos to https urls 2019-05-31 20:37:03 -07:00
Vinoth Chandar
acd74129cd Create hoodie-utilities-bundle to host the shaded jar
- hoodie-utilities can now be pulled in as compile time dependency
  - Lets users test their DeltaStreamer transformers for e.g
  - Tested the docker demo works & takes in the bundle
  - Doc changes to follow, to move DeltaStreamer commands to bundle jar
2019-05-30 22:46:24 -07:00
vinothchandar
66c0b81b49 [maven-release-plugin] prepare for next development iteration 2019-05-28 19:17:26 -07:00
vinothchandar
227785c022 [maven-release-plugin] prepare release hoodie-0.4.7 2019-05-28 19:17:15 -07:00
leiline
f120427607 HUDI-105 : Fix up offsets not available on leader exception (#650)
* Fix up offsets not available on leader exception
2019-05-23 19:32:31 -07:00
Balaji Varadarajan
2fe526d548 Allow users to set hoodie configs figs for Compactor, Cleaner and HDFSParquetImporter utility scripts 2019-05-23 17:35:53 -07:00
Balaji Varadarajan
145034c5fa Spark Stage retry handling 2019-05-21 14:49:51 -07:00
Balaji Varadarajan
a7e6cf5197 Support nested types for recordKey, partitionPath and combineKey 2019-05-18 07:14:58 -07:00
Balaji Varadarajan
64fec64097 Timeline Service with Incremental View Syncing support 2019-05-16 13:25:33 -07:00
vinothchandar
446f99aa0f [maven-release-plugin] prepare for next development iteration 2019-05-14 07:29:22 -07:00
vinothchandar
cc38abecc8 [maven-release-plugin] prepare release hoodie-0.4.6 2019-05-14 07:29:11 -07:00
Balaji Varadarajan
6e1e626357 Minor CLI documentation change in delta-streamer 2019-05-14 04:05:47 -07:00
Balaji Varadarajan
9cce9abf4d Fix various errors found by long running delta-streamer tests
1. Parquet Avro schema mismatch errors when ingesting are sometimes silently ignored due to race-condition in BoundedInMemoryExecutor. This was reproducible when running long-running delta-streamer with wrong schema and it caused data-loss
  2. Fix behavior of Delta-Streamer to error out by default if there are any error records
  3. Fix a bug in tracking write errors in WriteStats. Earlier the write errors were tracking sampled errors as opposed to total errors.
  4. Delta Streamer does not commit the changes done as part of inline compaction as auto-commit is force disabled. Fix this behavior to always auto-commit inline compaction as it would not otherwise commit.
2019-05-13 10:47:34 -07:00
Abhishek Sharma
e2dcef8606 HUDI-101: added exclusion filters for signature files. 2019-05-07 18:35:18 -07:00
Omkar Joshi
738635306b migrating kryo's dependency from twitter chill to plain kryo library 2019-05-06 20:32:00 -07:00
vinothchandar
57a8b9cc8c Making DataSource/DeltaStreamer use defaults for combining
- Addresses issue where insert will combine and remove duplicates within batch
 - Setting default insert combining to false (write client default)
 - Set to true if filtering duplicates on insert/bulk_insert
2019-05-01 13:21:21 -07:00
Omkar Joshi
e35d24f31d Revert "Replacing Apache commons-lang3 object serializer with Kryo serializer"
This reverts commit a6c45feb2c.
2019-04-17 09:23:37 -07:00
Bhavani Sudha Saktheeswaran
83b6aa5e91 Fix multiple issues when using build_local_docker_images for setting up the demo
Details here - https://issues.apache.org/jira/browse/HUDI-98
2019-04-15 10:10:05 -07:00
Vinoth Chandar
b34a204a52 Fixing small file handling, inline compaction defaults
- Small file limit is now 100MB by default
 - Turned on inline compaction by default for MOR
 - Changes take effect on DataSource and DeltaStreamer
2019-04-03 10:56:10 -07:00
Omkar Joshi
a6c45feb2c Replacing Apache commons-lang3 object serializer with Kryo serializer 2019-03-18 14:12:25 -07:00
Balaji Varadarajan
3ae6cb4ed5 FileSystem View must treat same fileIds present in different partitions as different file-groups and handle pending compaction correctly 2019-03-01 10:49:04 -08:00
vinothchandar
687395e40f [maven-release-plugin] prepare for next development iteration 2019-02-27 07:16:27 -08:00
vinothchandar
bbf40ef987 [maven-release-plugin] prepare release hoodie-0.4.5 2019-02-27 07:16:15 -08:00
Balaji Varadarajan
3a0044216c New Features in DeltaStreamer :
(1) Apply transformation when using delta-streamer to ingest data.
 (2) Add Hudi Incremental Source for Delta Streamer
 (3) Allow delta-streamer config-property to be passed as command-line
 (4) Add Hive Integration to Delta-Streamer and address Review comments
 (5) Ensure MultiPartKeysValueExtractor  handle hive style partition description
 (6) Reuse same spark session on both source and transformer
 (7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
 (8) Reuse Binary Avro coders
 (9) Add push down filter for Incremental source
 (10) Add Hoodie DeltaStreamer metrics to track total time taken
2019-02-11 18:22:05 -08:00
arukavytsia
6946dd7557 General enhancements 2018-12-18 12:52:39 -08:00
Balaji Varadarajan
07324e7a20 Compaction validate, unschedule and repair 2018-10-25 14:12:47 -07:00
Vinoth Chandar
1fca9b21cc Add --filter-dupes to DeltaStreamer
- Optionally filter out duplicates before inserting data
 - Unit tests
2018-10-04 11:25:18 +05:30
vinothchandar
7ba842c0fe [maven-release-plugin] prepare for next development iteration 2018-09-28 11:27:00 +05:30
vinothchandar
5847b61f44 [maven-release-plugin] prepare release hoodie-0.4.4 2018-09-28 11:26:15 +05:30
vinothchandar
9ca6f91e97 Perform consistency checks during write finalize
- Check to ensure written files are listable on storage
 - Docs reflected to capture how this helps with s3 storage
 - Unit tests added, corrections to existing tests
 - Fix DeltaStreamer to manage archived commits in a separate folder
2018-09-28 08:04:41 +05:30
Balaji Varadarajan
4c74dd4cad Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors 2018-09-26 21:10:20 +05:30
Vinoth Chandar
bd5af89f12 [maven-release-plugin] rollback the release of hoodie-0.4.4 2018-09-13 15:01:53 +05:30
Vinoth Chandar
d1cc864a43 [maven-release-plugin] prepare for next development iteration 2018-09-12 23:59:47 +05:30
Vinoth Chandar
b748bc836d [maven-release-plugin] prepare release hoodie-0.4.4 2018-09-12 23:59:34 +05:30
Vinoth Chandar
a5359662be Moving depedencies off cdh to apache + Hive2 support
- Tests redone in the process
 - Main changes are to RealtimeRecordReader and how it treats maps/arrays
 - Make hive sync work with Hive 1/2 and CDH environments
 - Fixes to make corner cases for Hive queries
 - Spark Hive integration - Working version across Apache and CDH versions
 - Known Issue - https://github.com/uber/hudi/issues/439
2018-09-11 11:03:30 +05:30
Vinoth Chandar
d58ddbd999 Reworking the deltastreamer tool
- Standardize version of jackson
 - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration
 - Remove dependency on ConstructorUtils
 - Throw error if ordering value is not present, during key generation
 - Switch to shade plugin for hoodie-utilities
 - Added support for consumption for Confluent avro kafka serdes
 - Support for Confluent schema registry
 - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions
 - Added support for BULK_INSERT operations as well
 - Pass in the payload class config properly into HoodieWriteClient
 - Fix documentation based on new usage
 - Adding tests on deltastreamer, sources and all new util classes.
2018-09-08 10:24:32 +08:00
Balaji Varadarajan
e2dee68ccd Simplify and fix CLI to schedule and run compactions 2018-09-07 05:28:13 +08:00
Nishith Agarwal
459e523d9e 1. Small file size handling for inserts into log files. In summary, the total size of the log file is compared with the parquet max file size and if there is scope to add inserts the add it. 2018-09-06 08:52:08 +08:00
Vinoth Chandar
89cd6b0726 [maven-release-plugin] prepare for next development iteration 2018-08-22 21:30:05 -07:00
Vinoth Chandar
8d305c5a86 [maven-release-plugin] prepare release hoodie-0.4.3 2018-08-22 21:29:53 -07:00
Balaji Varadarajan
594059a19c Add CLI support inspect, schedule and run compaction 2018-08-07 08:19:50 -07:00
Vinoth Chandar
34827d50e1 [maven-release-plugin] prepare for next development iteration 2018-06-11 08:59:13 -07:00
Vinoth Chandar
43ef385730 [maven-release-plugin] prepare release hoodie-0.4.2 2018-06-11 08:59:02 -07:00