1
0
Commit Graph

86 Commits

Author SHA1 Message Date
Nishith Agarwal
d495484399 Write smaller sized multiple blocks to log file instead of a large one
- Use SizeEstimator to size number of records to write
	- Configurable block size
   	- Configurable log file size
2018-02-23 07:31:39 -08:00
Vinoth Chandar
eb3d0c470f Fix formatting in HoodieWriteClient 2018-02-14 10:03:20 -08:00
Nishith Agarwal
7076c2e9f0 refactor classes to accept Map passed by RealtimeCompactor to avoid multiple map creations in HoodieMergeHandle 2018-02-07 11:16:01 -08:00
Nishith Agarwal
30049383f5 Small File Size correction handling for MOR table type 2018-02-07 11:01:10 -08:00
Nishith Agarwal
2116815261 Fixing Rollback for compaction/commit operation, added check for null commit
- Fallback to old way of rollback by listing all partitions
	- Added null check to ensure only partitions which are to be rolledback are considered
	- Added location (committime) to workload stat
	- Added checks in CompactedScanner to guard against task retries
	- Introduce new logic for rollback (bounded by instant_time and target_instant time)
        - Reversed logfiles order
2018-02-06 16:55:23 -08:00
Nishith Agarwal
be0b1f3e57 Adding global indexing to HbaseIndex implementation
- Adding tests or HbaseIndex
	- Enabling global index functionality
2018-02-05 15:21:22 -08:00
Jian Xu
15e669c60c Incorporating code review feedback for finalizeWrite for COW #4 2018-02-02 11:38:25 -08:00
Jian Xu
3736243fb3 Rebases with latest upstream 2018-02-02 11:38:25 -08:00
Jian Xu
363e35bb0f Add finalizeWrite support for HoodieMergeHandle 2018-02-02 11:38:25 -08:00
Jian Xu
acae6586f3 Incorporating code review feedback for finalizeWrite for COW #3 2018-02-02 11:38:25 -08:00
Jian Xu
37f2cdd7e4 Incorporating code review feedback for finalizeWrite for COW #2 2018-02-02 11:38:25 -08:00
Jian Xu
2fe4fef625 Incorporating code review feedback for finalizeWrite for COW 2018-02-02 11:38:25 -08:00
Jian Xu
c874248f23 Add FinalizeWrite in HoodieCreateHandle for COW tables 2018-02-02 11:38:25 -08:00
Nishith Agarwal
e10100fe32 Reducing list status calls from listing logfile versions, some associated refactoring 2018-01-29 08:26:39 -08:00
Nishith Agarwal
937ae322ba Reducing memory footprint required in HoodieAvroDataBlock and HoodieAppendHandle 2018-01-29 08:22:29 -08:00
vinothchandar
cf7f7aabb9 Nicer handling of timeline archival for Cloud storage
- When append() is not supported, rollover to new file always (instead of failing)
 - Provide way to configure archive log folder (avoids small files inside .hoodie)
 - Datasets written via Spark datasource archive to .hoodie/archived
 - HoodieClientExample will now retain only 2,3 commits to exercise archival path during dev cycles
 - Few tweaks to code structure around CommitArchiveLog
2018-01-17 23:34:21 -08:00
Vinoth Chandar
0cd186c899 Multi FS Support
- Reviving PR 191, to make FileSystem creation off actual path
 - Streamline all filesystem access to HoodieTableMetaClient
 - Hadoop Conf from Spark Context serialized & passed to executor code too
 - Pick up env vars prefixed with HOODIE_ENV_ into Configuration object
 - Cleanup usage of FSUtils.getFS, piggybacking off HoodieTableMetaClient.getFS
 - Adding s3a to supported schemes & support escaping "." in env vars
 - Tests use HoodieTestUtils.getDefaultHadoopConf
2018-01-17 23:34:21 -08:00
Nishith Agarwal
44839b88c6 Removing compaction action type and associated compaction timeline operations, replace with commit action type 2018-01-09 09:56:15 -08:00
Nishith Agarwal
4aed5c7338 Adding a new Partition/Time based compaction strategy 2017-12-05 16:30:38 -08:00
Nishith Agarwal
9b610f82c7 Separating out compaction() API 2017-11-14 22:56:29 -08:00
Vinoth Chandar
e45679f5e2 Reformatting code per Google Code Style all over 2017-11-12 23:19:02 -08:00
Nishith Agarwal
abe964bebd Implementing custom payload/merge hooks abstractions for application specific merge logic 2017-11-07 18:55:55 -08:00
Nishith Agarwal
c7d63a7622 1) Separated rollback as a table operation 2) Implement rollback for MOR 2017-10-12 07:36:46 -07:00
Vinoth Chandar
64e0573aca Adding hoodie-spark to support Spark Datasource for Hoodie
- Write with COW/MOR paths work fully
 - Read with RO view works on both storages*
 - Incremental view supported on COW
 - Refactored out HoodieReadClient methods, to just contain key based access
 - HoodieDataSourceHelpers class can be now used to construct inputs to datasource
 - Tests in hoodie-client using new helpers and mechanisms
 - Basic tests around save modes & insert/upserts (more to follow)
 - Bumped up scala to 2.11, since 2.10 is deprecated & complains with scalatest
 - Updated documentation to describe usage
 - New sample app written using the DataSource API
2017-10-02 20:44:53 -07:00
Kaushik Devarajaiah
c98ee057fc capture record metadata before deflating for record counting 2017-10-02 10:46:06 -07:00
Vinoth Chandar
f2980052cd Revert effects of PR #259 2017-09-28 10:29:58 -07:00
Vinoth Chandar
9f98ae643b Adding canIndexLogFiles(), isImplicitWithStorage(), isGlobal() to HoodieIndex 2017-09-28 10:19:29 -07:00
Eric Sayle
6230e15191 Update deprecated hash function
Guava deprecated hashString(String) in v15, and removed it in v16.
Replace call with hashUnencodedString(String), which replace it, to
be compatible with newer versions of Guava.
2017-09-18 17:39:19 -07:00
Omkar Joshi
5c639c0b05 Adding support for UserDefinedBulkInsertPartitioner 2017-09-08 20:55:13 -07:00
Omkar Joshi
ec40d04d51 Fixing UpsertPartitioner to ensure that input records are deterministically assigned to output partitions 2017-09-07 17:03:56 -07:00
Nishith Agarwal
e484e91807 adding new config to separate shuffle and write parallelism 2017-08-18 16:05:25 -07:00
Nishith Agarwal
5ee4ac40ae Use CompletedFileSystemView instead of CompactedView considering deltacommits 2017-08-07 12:26:42 -07:00
Vinoth Chandar
86209640f7 Adding range based pruning to bloom index
- keys compared lexicographically using String::compareTo
 - Range metadata additionally written into parquet file footers
 - Trim fat & few optimizations to speed up indexing
 - Add param to control whether input shall be cached, to speed up lookup
 - Add param to turn on/off range pruning
 - Auto compute of parallelism now simply factors in amount of comparisons done
 - More accurate parallelism computation when range pruning is on
 - tests added & hardened, docs updated
2017-08-04 13:22:13 -07:00
Nishith Agarwal
0b26b60a5c fix for cleaning log files(mor) 2017-08-02 11:54:42 -07:00
Nishith Agarwal
19c22b231e 1. Use HoodieLogFormat to archive commits and other actions 2. Introduced avro schema for commits and compactions and an avro wrapper schema 2017-07-26 14:27:44 -07:00
Nishith Agarwal
616c9a68c3 Enabled deletes in merge_on_read 2017-07-26 13:37:27 -07:00
Prasanna Rajaperumal
5cc071f74e Savepoint should not create a hole in the commit timeline 2017-06-27 16:36:09 -07:00
Vinoth Chandar
754ab88a2d Introduce ReadOptimizedView & RealtimeView out of TableFileSystemView
- Usage now marks code as clearly using either RO or RT views, for future evolution
  - Tests on all of FileGroups and FileSlices
2017-06-22 17:16:13 -07:00
Vinoth Chandar
c00f1a9ed9 Refactoring HoodieTableFileSystemView using FileGroups/FileSlices
- Merged all filter* and get* methods
 - new constructor takes filestatus[]
 - All existing tests pass
 - FileGroup is all files that belong to a fileID within a partition
 - FileSlice is a generation of data and log files, starting at a base commit
2017-06-22 17:16:13 -07:00
Vinoth Chandar
23e7badd8a Rename IO Handles & introduce stub for BucketedIndex
- UpdateHandle -> MergeHandle, InsertHandle -> CreateHandle
 - Also bunch of code cleanup in different places
2017-06-22 17:16:13 -07:00
Kaushik Devarajaiah
3aa8083913 Correct clean bug that causes clean failure when partitionPaths are empty 2017-06-20 15:45:32 -07:00
gekath
52c507f83e Writes relative paths to .commit files
Handle case where path is read in as null from commit file

Merged with updated release
2017-06-16 12:51:19 -07:00
gekath
db7311f85e Writes relative paths to .commit files instead of absolute paths
Clean up code

Removed commented out code

Fixed merge conflict with master
2017-06-16 12:51:19 -07:00
Kaushik Devarajaiah
521555c576 Parallelize file version deletes during clean and related tests 2017-06-15 18:20:42 -07:00
Prasanna Rajaperumal
dda28c0b4b Rollback inflight commits as well when rolling back to savepoint 2017-06-14 11:03:27 -07:00
Prasanna Rajaperumal
db6150c5ef Refactor hoodie-hive 2017-06-09 13:06:33 -07:00
Prasanna Rajaperumal
bae98efeee Delete other instant files (.clean) as well during commit archival 2017-05-24 13:51:49 -07:00
Prasanna Rajaperumal
240c91241b Implement HoodieLogFormat replacing Avro as the default log format 2017-05-23 08:35:11 -07:00
Nishith Agarwal
3c984447da view scheme added 2017-05-22 12:27:40 -07:00
Prasanna Rajaperumal
70dd7a25ea Clean should not create a .inflight file 2017-05-22 10:48:35 -07:00