1
0
Commit Graph

56 Commits

Author SHA1 Message Date
Nishith Agarwal
e484e91807 adding new config to separate shuffle and write parallelism 2017-08-18 16:05:25 -07:00
Nishith Agarwal
5ee4ac40ae Use CompletedFileSystemView instead of CompactedView considering deltacommits 2017-08-07 12:26:42 -07:00
Vinoth Chandar
86209640f7 Adding range based pruning to bloom index
- keys compared lexicographically using String::compareTo
 - Range metadata additionally written into parquet file footers
 - Trim fat & few optimizations to speed up indexing
 - Add param to control whether input shall be cached, to speed up lookup
 - Add param to turn on/off range pruning
 - Auto compute of parallelism now simply factors in amount of comparisons done
 - More accurate parallelism computation when range pruning is on
 - tests added & hardened, docs updated
2017-08-04 13:22:13 -07:00
Nishith Agarwal
0b26b60a5c fix for cleaning log files(mor) 2017-08-02 11:54:42 -07:00
Nishith Agarwal
19c22b231e 1. Use HoodieLogFormat to archive commits and other actions 2. Introduced avro schema for commits and compactions and an avro wrapper schema 2017-07-26 14:27:44 -07:00
Nishith Agarwal
616c9a68c3 Enabled deletes in merge_on_read 2017-07-26 13:37:27 -07:00
Prasanna Rajaperumal
5cc071f74e Savepoint should not create a hole in the commit timeline 2017-06-27 16:36:09 -07:00
Vinoth Chandar
754ab88a2d Introduce ReadOptimizedView & RealtimeView out of TableFileSystemView
- Usage now marks code as clearly using either RO or RT views, for future evolution
  - Tests on all of FileGroups and FileSlices
2017-06-22 17:16:13 -07:00
Vinoth Chandar
c00f1a9ed9 Refactoring HoodieTableFileSystemView using FileGroups/FileSlices
- Merged all filter* and get* methods
 - new constructor takes filestatus[]
 - All existing tests pass
 - FileGroup is all files that belong to a fileID within a partition
 - FileSlice is a generation of data and log files, starting at a base commit
2017-06-22 17:16:13 -07:00
Vinoth Chandar
23e7badd8a Rename IO Handles & introduce stub for BucketedIndex
- UpdateHandle -> MergeHandle, InsertHandle -> CreateHandle
 - Also bunch of code cleanup in different places
2017-06-22 17:16:13 -07:00
Kaushik Devarajaiah
3aa8083913 Correct clean bug that causes clean failure when partitionPaths are empty 2017-06-20 15:45:32 -07:00
gekath
52c507f83e Writes relative paths to .commit files
Handle case where path is read in as null from commit file

Merged with updated release
2017-06-16 12:51:19 -07:00
gekath
db7311f85e Writes relative paths to .commit files instead of absolute paths
Clean up code

Removed commented out code

Fixed merge conflict with master
2017-06-16 12:51:19 -07:00
Kaushik Devarajaiah
521555c576 Parallelize file version deletes during clean and related tests 2017-06-15 18:20:42 -07:00
Prasanna Rajaperumal
dda28c0b4b Rollback inflight commits as well when rolling back to savepoint 2017-06-14 11:03:27 -07:00
Prasanna Rajaperumal
db6150c5ef Refactor hoodie-hive 2017-06-09 13:06:33 -07:00
Prasanna Rajaperumal
bae98efeee Delete other instant files (.clean) as well during commit archival 2017-05-24 13:51:49 -07:00
Prasanna Rajaperumal
240c91241b Implement HoodieLogFormat replacing Avro as the default log format 2017-05-23 08:35:11 -07:00
Nishith Agarwal
3c984447da view scheme added 2017-05-22 12:27:40 -07:00
Prasanna Rajaperumal
70dd7a25ea Clean should not create a .inflight file 2017-05-22 10:48:35 -07:00
Zeeshan Qureshi
43a55b09fd Add GCS to supported filesystems 2017-05-18 10:30:34 -07:00
Vinoth Chandar
da17c5c607 Introduce getCommitsAndCompactionsTimeline() explicitly & adjust usage across code base 2017-05-01 21:48:27 -07:00
Vinoth Chandar
bae0528013 Cleanup calls to HoodieTimeline.compareTimeStamps 2017-05-01 21:48:27 -07:00
Prasanna Rajaperumal
8974e11161 Make sure properties set in HoodieWriteConfig is propogated down to individual configs. Fix a race condition which lets InputFormat to think file size is 0 when it is actually not 2017-04-27 10:52:25 -07:00
Prasanna Rajaperumal
91b088f29f Implement Compaction policy abstraction. Implement LogSizeBased Bounded IO Compaction as the default strategy 2017-04-20 16:59:06 -07:00
Vinoth Chandar
2b6322318c CR feedback 2017-04-03 18:28:01 -07:00
Vinoth Chandar
e0fc4ec38e Documentation update + helper method for WriteConfig builder 2017-04-03 18:28:01 -07:00
Vinoth Chandar
dce35ff0d7 Adding a config to control whether date partitioning can be assumed
- false by default
 - CAUTION: If you have an existing tables without partition metadata, you need to set this to "true"
2017-04-03 18:28:01 -07:00
Vinoth Chandar
3129770fd0 Create .hoodie_partition_metadata in each partition, linking back to basepath
- Concurreny handled via taskID, failure recovery handled via renames
 - Falls back to search 3 levels up
 - Cli tool has command to add this to existing tables
2017-04-03 18:28:01 -07:00
Prasanna Rajaperumal
1e802ad4f2 Move HoodieAvroReader to hoodie-common, it will be used for compaction and in the record reader 2017-04-03 13:58:35 -07:00
Prasanna Rajaperumal
aee136777b Fixes needed to run merge-on-read testing on production scale data 2017-04-02 22:25:47 -07:00
Yash Sharma
d6f94b998d Hoodie operability with S3 2017-03-28 05:08:54 -07:00
prazanna
0e3f635adb remove hardcoding of autoClean 2017-03-23 15:54:26 -07:00
fishie9
b7047ab4fb Pass in String StroageLevel for WriteStatus (#113) 2017-03-23 04:31:30 -07:00
prazanna
f1b7afad21 Add config for index parallelism and make clean public (#109)
* Add config for index parallelism and make clean public

* Review comments on clean api modification
2017-03-21 17:36:46 -07:00
ovj
21898907c1 tool for importing hive tables (in parquet format) into hoodie dataset (#89)
* tool for importing hive tables (in parquet format) into hoodie dataset

* review fixes

* review fixes

* review fixes
2017-03-21 14:42:13 -07:00
prazanna
d835710c51 Metadata timeline marks an already complete instant as complete again (#98) 2017-03-17 12:42:26 -07:00
Prasanna Rajaperumal
d83b671ada Implement Savepoints and required metadata timeline - Part 2 2017-03-13 23:09:29 -07:00
prazanna
6f36e1eaaf Implement Savepoints and required metadata timeline (#86)
- Introduce avro to save clean metadata with details about the last commit that was retained
- Save rollback metadata in the meta timeline
- Create savepoint metadata and add API to createSavepoint, deleteSavepoint and rollbackToSavepoint
- Savepointed commit should not be rolledback or cleaned or archived
- introduce cli commands to show, create and rollback to savepoints
- Write unit tests to test savepoints and rollbackToSavepoints
2017-03-13 15:12:03 -07:00
vinoth chandar
69d3950a32 Revamped Deltastreamer (#93)
* Add analytics to site

* Fix ugly favicon

* New & Improved HoodieDeltaStreamer

 - Can incrementally consume from HDFS or Kafka, with exactly-once semantics!
 - Supports Json/Avro data, Source can also do custom things
 - Source is totally pluggable, via reflection
 - Key generation is pluggable, currently added SimpleKeyGenerator
 - Schema provider is pluggable, currently Filebased schemas
 - Configurable field to break ties during preCombine
 - Finally, can also plugin the HoodieRecordPayload, to get other merge types than overwriting
 - Handles efficient avro serialization in Spark

 Pending :
 - Rewriting of HiveIncrPullSource
 - Hive sync via hoodie-hive
 - Cleanup & tests

* Minor fixes from master rebase

* Implementation of HiveIncrPullSource
 - Copies commit by commit from source to target

* Adding TimestampBasedKeyGenerator
 - Supports unix time & date strings
2017-03-13 12:41:29 -07:00
siddharthagunda
348a48aa80 Add delete support to Hoodie (#85) 2017-03-04 01:33:49 -08:00
vinoth chandar
116a78094f Cleanup code based on Java8 Lambdas (#84) 2017-02-27 15:52:13 -08:00
Prasanna Rajaperumal
1132f3533d Merge and pull master commits 2017-02-21 17:53:28 -08:00
prazanna
eb46e7c72b Implement Merge on Read Storage (#76)
1. Create HoodieTable abstraction for commits and fileSystemView
2. HoodieMergeOnReadTable created
3. View is now always obtained from the table and the correct view based on the table type is returned
2017-02-21 16:24:38 -08:00
prazanna
11d2fd3428 Introduce RealtimeTableView and Implement HoodieRealtimeTableCompactor (#73) 2017-02-21 16:24:18 -08:00
Prasanna Rajaperumal
ccd8cb2407 Take 2: Refactor hoodie-common and create right abstractions for Hoodie Storage V2.0
- Refactored timelines to be a single timeline for all active events and one for archived events. CommitTimeline and other timelines can be inferred by applying a filter on the activeTimelime
- Introduced HoodieInstant to abstract different types of action, commit time and if isInFlight
- Implemented other review comments
2017-02-21 16:23:53 -08:00
Prasanna Rajaperumal
8ee777a9bb Refactor hoodie-common and create right abstractions for Hoodie Storage V2.0
The following is the gist of changes done

- All low-level operation of creating a commit code was in HoodieClient which made it hard to share code if there was a compaction commit.
- HoodieTableMetadata contained a mix of metadata and filtering files. (Also few operations required FileSystem to be passed in because those were called from TaskExecutors and others had FileSystem as a global variable). Since merge-on-read requires a lot of that code, but will have to change slightly on how it operates on the metadata and how it filters the files. The two set of operation are split into HoodieTableMetaClient and TableFileSystemView.
- Everything (active commits, archived commits, cleaner log, save point log and in future delta and compaction commits) in HoodieTableMetaClient is a HoodieTimeline. Timeline is a series of instants, which has an in-built concept of inflight and completed commit markers.
- A timeline can be queries for ranges, contains and also use to create new datapoint (create a new commit etc). Commit (and all the above metadata) creation/deletion is streamlined in a timeline
- Multiple timelines can be merged into a single timeline, giving us an audit timeline to whatever happened in a hoodie dataset. This also helps with #55.
- Move to java 8 and introduce java 8 succinct syntax in refactored code
2017-02-21 16:23:53 -08:00
Prasanna Rajaperumal
0e234ac0ef Moving to Spark 2.1.0 2017-02-20 16:47:52 -08:00
Prasanna Rajaperumal
4a47d26818 Fixing a javadoc lint issue 2017-02-20 15:57:58 -08:00
Prasanna Rajaperumal
7178cb5a3f Fixing a javadoc lint issue 2017-02-20 15:41:32 -08:00