1
0
Commit Graph

58 Commits

Author SHA1 Message Date
Jian Xu
dfd1979c51 Handle inflight clean instants during Hoodie instants archiving 2018-03-05 15:01:58 -08:00
Jian Xu
5d5c306e64 Add new APIs in HoodieReadClient and HoodieWriteClient 2018-02-28 13:58:12 -08:00
Nishith Agarwal
30049383f5 Small File Size correction handling for MOR table type 2018-02-07 11:01:10 -08:00
Nishith Agarwal
2116815261 Fixing Rollback for compaction/commit operation, added check for null commit
- Fallback to old way of rollback by listing all partitions
	- Added null check to ensure only partitions which are to be rolledback are considered
	- Added location (committime) to workload stat
	- Added checks in CompactedScanner to guard against task retries
	- Introduce new logic for rollback (bounded by instant_time and target_instant time)
        - Reversed logfiles order
2018-02-06 16:55:23 -08:00
Nishith Agarwal
be0b1f3e57 Adding global indexing to HbaseIndex implementation
- Adding tests or HbaseIndex
	- Enabling global index functionality
2018-02-05 15:21:22 -08:00
Jian Xu
15e669c60c Incorporating code review feedback for finalizeWrite for COW #4 2018-02-02 11:38:25 -08:00
Jian Xu
3736243fb3 Rebases with latest upstream 2018-02-02 11:38:25 -08:00
Jian Xu
363e35bb0f Add finalizeWrite support for HoodieMergeHandle 2018-02-02 11:38:25 -08:00
Jian Xu
2fe4fef625 Incorporating code review feedback for finalizeWrite for COW 2018-02-02 11:38:25 -08:00
Jian Xu
c874248f23 Add FinalizeWrite in HoodieCreateHandle for COW tables 2018-02-02 11:38:25 -08:00
vinothchandar
21ce846f18 Remove stateful fs member from HoodieTestUtils & FSUtils 2018-01-17 23:34:21 -08:00
vinothchandar
cf7f7aabb9 Nicer handling of timeline archival for Cloud storage
- When append() is not supported, rollover to new file always (instead of failing)
 - Provide way to configure archive log folder (avoids small files inside .hoodie)
 - Datasets written via Spark datasource archive to .hoodie/archived
 - HoodieClientExample will now retain only 2,3 commits to exercise archival path during dev cycles
 - Few tweaks to code structure around CommitArchiveLog
2018-01-17 23:34:21 -08:00
Vinoth Chandar
0cd186c899 Multi FS Support
- Reviving PR 191, to make FileSystem creation off actual path
 - Streamline all filesystem access to HoodieTableMetaClient
 - Hadoop Conf from Spark Context serialized & passed to executor code too
 - Pick up env vars prefixed with HOODIE_ENV_ into Configuration object
 - Cleanup usage of FSUtils.getFS, piggybacking off HoodieTableMetaClient.getFS
 - Adding s3a to supported schemes & support escaping "." in env vars
 - Tests use HoodieTestUtils.getDefaultHadoopConf
2018-01-17 23:34:21 -08:00
Nishith Agarwal
44839b88c6 Removing compaction action type and associated compaction timeline operations, replace with commit action type 2018-01-09 09:56:15 -08:00
Nishith Agarwal
4aed5c7338 Adding a new Partition/Time based compaction strategy 2017-12-05 16:30:38 -08:00
Nishith Agarwal
9b610f82c7 Separating out compaction() API 2017-11-14 22:56:29 -08:00
Vinoth Chandar
e45679f5e2 Reformatting code per Google Code Style all over 2017-11-12 23:19:02 -08:00
Nishith Agarwal
c7d63a7622 1) Separated rollback as a table operation 2) Implement rollback for MOR 2017-10-12 07:36:46 -07:00
Vinoth Chandar
274aaf49fe Incorporating code review feedback for DataSource 2017-10-02 20:44:53 -07:00
Vinoth Chandar
64e0573aca Adding hoodie-spark to support Spark Datasource for Hoodie
- Write with COW/MOR paths work fully
 - Read with RO view works on both storages*
 - Incremental view supported on COW
 - Refactored out HoodieReadClient methods, to just contain key based access
 - HoodieDataSourceHelpers class can be now used to construct inputs to datasource
 - Tests in hoodie-client using new helpers and mechanisms
 - Basic tests around save modes & insert/upserts (more to follow)
 - Bumped up scala to 2.11, since 2.10 is deprecated & complains with scalatest
 - Updated documentation to describe usage
 - New sample app written using the DataSource API
2017-10-02 20:44:53 -07:00
Kaushik Devarajaiah
c98ee057fc capture record metadata before deflating for record counting 2017-10-02 10:46:06 -07:00
Omkar Joshi
ec40d04d51 Fixing UpsertPartitioner to ensure that input records are deterministically assigned to output partitions 2017-09-07 17:03:56 -07:00
Nishith Agarwal
e2d13c6305 Fix build failing issues 2017-09-07 10:54:36 -07:00
Vinoth Chandar
45dd8980c3 Temporary fix for build break after rebase 2017-08-04 17:36:39 -07:00
Vinoth Chandar
86209640f7 Adding range based pruning to bloom index
- keys compared lexicographically using String::compareTo
 - Range metadata additionally written into parquet file footers
 - Trim fat & few optimizations to speed up indexing
 - Add param to control whether input shall be cached, to speed up lookup
 - Add param to turn on/off range pruning
 - Auto compute of parallelism now simply factors in amount of comparisons done
 - More accurate parallelism computation when range pruning is on
 - tests added & hardened, docs updated
2017-08-04 13:22:13 -07:00
Nishith Agarwal
0b26b60a5c fix for cleaning log files(mor) 2017-08-02 11:54:42 -07:00
Nishith Agarwal
19c22b231e 1. Use HoodieLogFormat to archive commits and other actions 2. Introduced avro schema for commits and compactions and an avro wrapper schema 2017-07-26 14:27:44 -07:00
Nishith Agarwal
616c9a68c3 Enabled deletes in merge_on_read 2017-07-26 13:37:27 -07:00
Prasanna Rajaperumal
5cc071f74e Savepoint should not create a hole in the commit timeline 2017-06-27 16:36:09 -07:00
Vinoth Chandar
754ab88a2d Introduce ReadOptimizedView & RealtimeView out of TableFileSystemView
- Usage now marks code as clearly using either RO or RT views, for future evolution
  - Tests on all of FileGroups and FileSlices
2017-06-22 17:16:13 -07:00
Vinoth Chandar
c00f1a9ed9 Refactoring HoodieTableFileSystemView using FileGroups/FileSlices
- Merged all filter* and get* methods
 - new constructor takes filestatus[]
 - All existing tests pass
 - FileGroup is all files that belong to a fileID within a partition
 - FileSlice is a generation of data and log files, starting at a base commit
2017-06-22 17:16:13 -07:00
Vinoth Chandar
23e7badd8a Rename IO Handles & introduce stub for BucketedIndex
- UpdateHandle -> MergeHandle, InsertHandle -> CreateHandle
 - Also bunch of code cleanup in different places
2017-06-22 17:16:13 -07:00
Kaushik Devarajaiah
3aa8083913 Correct clean bug that causes clean failure when partitionPaths are empty 2017-06-20 15:45:32 -07:00
gekath
52c507f83e Writes relative paths to .commit files
Handle case where path is read in as null from commit file

Merged with updated release
2017-06-16 12:51:19 -07:00
gekath
db7311f85e Writes relative paths to .commit files instead of absolute paths
Clean up code

Removed commented out code

Fixed merge conflict with master
2017-06-16 12:51:19 -07:00
Kaushik Devarajaiah
521555c576 Parallelize file version deletes during clean and related tests 2017-06-15 18:20:42 -07:00
Vinoth Chandar
da17c5c607 Introduce getCommitsAndCompactionsTimeline() explicitly & adjust usage across code base 2017-05-01 21:48:27 -07:00
Vinoth Chandar
bae0528013 Cleanup calls to HoodieTimeline.compareTimeStamps 2017-05-01 21:48:27 -07:00
Prasanna Rajaperumal
7bca428a0a Test to check if properties set are properly propogated 2017-04-28 12:47:14 -07:00
Prasanna Rajaperumal
3f97bdcccf Test to check if properties set are properly propogated 2017-04-28 12:40:58 -07:00
Prasanna Rajaperumal
91b088f29f Implement Compaction policy abstraction. Implement LogSizeBased Bounded IO Compaction as the default strategy 2017-04-20 16:59:06 -07:00
Vinoth Chandar
dce35ff0d7 Adding a config to control whether date partitioning can be assumed
- false by default
 - CAUTION: If you have an existing tables without partition metadata, you need to set this to "true"
2017-04-03 18:28:01 -07:00
Vinoth Chandar
f9fd16069d FSUtils.getAllPartitionsPaths() works based on .hoodie_partition_metadata
- clean/rollback/write paths covered by existing tests
 - Snapshot copier fixed to copy metadata file also, and test fixed
 - Existing tables need to be repaired by addition of metadata, before this can be rolled out
2017-04-03 18:28:01 -07:00
Prasanna Rajaperumal
aee136777b Fixes needed to run merge-on-read testing on production scale data 2017-04-02 22:25:47 -07:00
ovj
21898907c1 tool for importing hive tables (in parquet format) into hoodie dataset (#89)
* tool for importing hive tables (in parquet format) into hoodie dataset

* review fixes

* review fixes

* review fixes
2017-03-21 14:42:13 -07:00
Prasanna Rajaperumal
d83b671ada Implement Savepoints and required metadata timeline - Part 2 2017-03-13 23:09:29 -07:00
prazanna
6f36e1eaaf Implement Savepoints and required metadata timeline (#86)
- Introduce avro to save clean metadata with details about the last commit that was retained
- Save rollback metadata in the meta timeline
- Create savepoint metadata and add API to createSavepoint, deleteSavepoint and rollbackToSavepoint
- Savepointed commit should not be rolledback or cleaned or archived
- introduce cli commands to show, create and rollback to savepoints
- Write unit tests to test savepoints and rollbackToSavepoints
2017-03-13 15:12:03 -07:00
siddharthagunda
348a48aa80 Add delete support to Hoodie (#85) 2017-03-04 01:33:49 -08:00
Prasanna Rajaperumal
1132f3533d Merge and pull master commits 2017-02-21 17:53:28 -08:00
prazanna
eb46e7c72b Implement Merge on Read Storage (#76)
1. Create HoodieTable abstraction for commits and fileSystemView
2. HoodieMergeOnReadTable created
3. View is now always obtained from the table and the correct view based on the table type is returned
2017-02-21 16:24:38 -08:00