1
0
Commit Graph

170 Commits

Author SHA1 Message Date
Balaji Varadarajan
8adaca3454 Table rollback for inflight compactions MUST not delete instant files at any time to avoid race conditions 2019-02-11 18:30:21 -08:00
Balaji Varadarajan
3a0044216c New Features in DeltaStreamer :
(1) Apply transformation when using delta-streamer to ingest data.
 (2) Add Hudi Incremental Source for Delta Streamer
 (3) Allow delta-streamer config-property to be passed as command-line
 (4) Add Hive Integration to Delta-Streamer and address Review comments
 (5) Ensure MultiPartKeysValueExtractor  handle hive style partition description
 (6) Reuse same spark session on both source and transformer
 (7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
 (8) Reuse Binary Avro coders
 (9) Add push down filter for Incremental source
 (10) Add Hoodie DeltaStreamer metrics to track total time taken
2019-02-11 18:22:05 -08:00
Nishith Agarwal
7985eb72b5 Fixing behavior of Merge/CreateHandle for invalid/wrong schema records 2019-01-28 16:01:03 -08:00
Nishith Agarwal
994d42d307 cleaner should now use commit timeline and not include deltacomits 2019-01-28 10:46:33 -08:00
Nishith Agarwal
68723764ed Adding compaction to HoodieClient example 2019-01-28 10:23:44 -08:00
Nishith Agarwal
169e3f66bb Filtering partition paths before performing a list status on all partitions 2019-01-25 11:34:00 -08:00
Nishith Agarwal
110df7190b Enabling hard deletes for MergeOnRead table type 2018-12-31 12:49:58 -08:00
arukavytsia
6946dd7557 General enhancements 2018-12-18 12:52:39 -08:00
Balaji Varadarajan
30c5f8b7bd Ensure Hoodie works for non-partitioned Hive table 2018-12-12 13:35:16 -08:00
xubo245
466ff73ffb fix some spell errorin Hudi 2018-12-12 13:06:25 -08:00
Nishith Agarwal
7243ce40c9 Serializing the complete payload object instead of serializing just the GenericRecord
Removing Converter hierarchy as we now depend purely on JavaSerialization and require the payload to be java serializable
2018-12-04 11:43:41 -08:00
Nishith Agarwal
e83dde3b95 Returning empty Statues for an empty spark partition caused due to incorrect bin packing 2018-12-04 11:41:38 -08:00
Balaji Varadarajan
f999e4960c Avoid WriteStatus collect() call when committing batch 2018-11-28 10:41:49 -08:00
Vinoth Chandar
fa65db9c4c Explicitly handle lack of append() support during LogWriting 2018-11-27 17:58:43 -08:00
Nishith Agarwal
d0fde47458 Fixing number of insert buckets to be generated by rounding off to the closest greater integer 2018-11-15 10:04:45 -08:00
Vinoth Chandar
1362942aa3 Enabling auto tuning of insert splits by default 2018-11-08 09:48:23 -08:00
Balaji Varadarajan
07324e7a20 Compaction validate, unschedule and repair 2018-10-25 14:12:47 -07:00
jiale.tan
1628d044ac feat(SparkDataSource): add additional feature to drop later arriving dups 2018-10-16 11:52:50 -07:00
jiale.tan
98fd97b65f feature(HoodieGlobalBloomIndex): adds a new type of bloom index to allow global record key lookup 2018-09-29 19:55:20 +05:30
vinothchandar
9ca6f91e97 Perform consistency checks during write finalize
- Check to ensure written files are listable on storage
 - Docs reflected to capture how this helps with s3 storage
 - Unit tests added, corrections to existing tests
 - Fix DeltaStreamer to manage archived commits in a separate folder
2018-09-28 08:04:41 +05:30
Balaji Varadarajan
4c74dd4cad Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors 2018-09-26 21:10:20 +05:30
Balaji Varadarajan
5cb28e7b1f Explicitly release resources in LogFileReader and TestHoodieClientBase 2018-09-20 13:24:57 +05:30
Balaji Varadarajan
605af8a82f Reduce minimum delta-commits required for compaction 2018-09-12 01:23:28 +05:30
Vinoth Chandar
eca49a255e Rebasing and fixing conflicts against master 2018-09-11 11:03:30 +05:30
Nishith Agarwal
2b1af18941 Adding check for rolling stats not present to handle backwards compatibility of existing timeline 2018-09-10 11:53:46 +08:00
Vinoth Chandar
d58ddbd999 Reworking the deltastreamer tool
- Standardize version of jackson
 - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration
 - Remove dependency on ConstructorUtils
 - Throw error if ordering value is not present, during key generation
 - Switch to shade plugin for hoodie-utilities
 - Added support for consumption for Confluent avro kafka serdes
 - Support for Confluent schema registry
 - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions
 - Added support for BULK_INSERT operations as well
 - Pass in the payload class config properly into HoodieWriteClient
 - Fix documentation based on new usage
 - Adding tests on deltastreamer, sources and all new util classes.
2018-09-08 10:24:32 +08:00
Nishith Agarwal
0fe92dee55 Fix a failing test case intermittenly in TestMergeOnRead due to incorrect prev commit time 2018-09-08 09:39:18 +08:00
Nishith Agarwal
459e523d9e 1. Small file size handling for inserts into log files. In summary, the total size of the log file is compared with the parquet max file size and if there is scope to add inserts the add it. 2018-09-06 08:52:08 +08:00
Nishith Agarwal
324de298bc Removing dependency on apache-commons lang 3, adding necessary classes as needed 2018-09-06 08:26:48 +08:00
Kaushik Devarajaiah
e624480259 Throttling to limit QPS from HbaseIndex 2018-08-21 21:10:38 -07:00
Nishith Agarwal
3746ace76a Fixing Null pointer exception in finally block 2018-08-21 21:07:53 -07:00
Nishith Agarwal
88274b8261 Adding another metric to HoodieWriteStat to determine if there were inserts converted to updates, added one test for this 2018-08-14 06:22:16 -07:00
Balaji Varadarajan
ea23c9b7a0 Minor bug fixes found during testing 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
2e12c86d01 Ensure Compaction Operation compacts the data file as defined in the workload 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
2f8ce93030 Async Compaction Main API changes 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
9b78523d62 Ensure Cleaner and Archiver do not delete file-slices and workload marked for compaction 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
0a0451a765 Ensure Compaction workload is stored in write-once meta-data files separate from timeline files.
This avoids concurrency issues when compactor(s) and ingestor are running in parallel.
    In the Next PR -> Safety concern regarding Cleaner retaining all meta-data and file-slices for pending compactions will be addressed
2018-08-07 08:19:50 -07:00
Balaji Varadarajan
9d99942564 Track fileIds with pending compaction in FileSystemView to provide correct API semantics 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
1b61f04e05 (1) Define CompactionWorkload in avro to allow storing them in instant files.
(2) Split APIs in HoodieRealtimeCompactor to separate generating compaction workload from running compaction
2018-08-07 08:19:50 -07:00
Balaji Varadarajan
6d01ae8ca0 FileSystemView and Timeline level changes to support Async Compaction 2018-08-07 08:19:50 -07:00
Nishith Agarwal
44caf0d40c Fixing missing hoodie record location in HoodieRecord when record is read from disk after being spilled 2018-07-18 12:53:35 -07:00
Omkar Joshi
f62890ca1f adding setters so that subclasses can set it 2018-07-18 12:53:11 -07:00
Nishith Agarwal
34ab54a9d3 Fixing bug introducted in rollback for MOR table type with inserts into log files 2018-07-17 17:20:34 -07:00
Nishith Agarwal
a6fe96fdfe Changing Day based compaction strategy to be IO agnostic 2018-06-18 15:22:56 -07:00
Nishith Agarwal
3da063f83b Adding ability for inserts to be written to log files 2018-06-11 14:08:59 -07:00
Vinoth Chandar
85dd265b7b Improving out of box experience for data source
- Fixes #246
 - Bump up default parallelism to 1500, to handle large upserts
 - Add docs on s3 confuration & tuning tips with tested spark knobs
 - Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset
 - Improve speed of ROTablePathFilter by removing directory check
 - Move to spark-avro 4.0 to handle issue with nested fields with same name
 - Keep AvroConversionUtils in sync with spark-avro 4.0
2018-06-10 19:16:44 -07:00
Sunil Ramaiah
a97814462d Added a filter function to filter the record keys in a parquet file 2018-05-17 19:01:11 -07:00
Nishith Agarwal
23d53763c4 enabling global index for MOR 2018-05-16 10:36:25 -07:00
Balaji Varadarajan
dfc0c61eb7 Support union mode in HoodieRealtimeRecordReader for pure insert workloads
Also Replace BufferedIteratorPayload abstraction with function passing
2018-05-10 17:39:56 -07:00
Nishith Agarwal
93f345a032 Minor fixes for MergeOnRead MVP release readiness 2018-05-09 07:23:58 -07:00