1
0
Commit Graph

92 Commits

Author SHA1 Message Date
jiale.tan
98fd97b65f feature(HoodieGlobalBloomIndex): adds a new type of bloom index to allow global record key lookup 2018-09-29 19:55:20 +05:30
vinothchandar
9ca6f91e97 Perform consistency checks during write finalize
- Check to ensure written files are listable on storage
 - Docs reflected to capture how this helps with s3 storage
 - Unit tests added, corrections to existing tests
 - Fix DeltaStreamer to manage archived commits in a separate folder
2018-09-28 08:04:41 +05:30
Balaji Varadarajan
5cb28e7b1f Explicitly release resources in LogFileReader and TestHoodieClientBase 2018-09-20 13:24:57 +05:30
Vinoth Chandar
eca49a255e Rebasing and fixing conflicts against master 2018-09-11 11:03:30 +05:30
Nishith Agarwal
2b1af18941 Adding check for rolling stats not present to handle backwards compatibility of existing timeline 2018-09-10 11:53:46 +08:00
Vinoth Chandar
d58ddbd999 Reworking the deltastreamer tool
- Standardize version of jackson
 - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration
 - Remove dependency on ConstructorUtils
 - Throw error if ordering value is not present, during key generation
 - Switch to shade plugin for hoodie-utilities
 - Added support for consumption for Confluent avro kafka serdes
 - Support for Confluent schema registry
 - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions
 - Added support for BULK_INSERT operations as well
 - Pass in the payload class config properly into HoodieWriteClient
 - Fix documentation based on new usage
 - Adding tests on deltastreamer, sources and all new util classes.
2018-09-08 10:24:32 +08:00
Nishith Agarwal
0fe92dee55 Fix a failing test case intermittenly in TestMergeOnRead due to incorrect prev commit time 2018-09-08 09:39:18 +08:00
Nishith Agarwal
459e523d9e 1. Small file size handling for inserts into log files. In summary, the total size of the log file is compared with the parquet max file size and if there is scope to add inserts the add it. 2018-09-06 08:52:08 +08:00
Nishith Agarwal
324de298bc Removing dependency on apache-commons lang 3, adding necessary classes as needed 2018-09-06 08:26:48 +08:00
Kaushik Devarajaiah
e624480259 Throttling to limit QPS from HbaseIndex 2018-08-21 21:10:38 -07:00
Nishith Agarwal
88274b8261 Adding another metric to HoodieWriteStat to determine if there were inserts converted to updates, added one test for this 2018-08-14 06:22:16 -07:00
Balaji Varadarajan
2f8ce93030 Async Compaction Main API changes 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
9b78523d62 Ensure Cleaner and Archiver do not delete file-slices and workload marked for compaction 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
0a0451a765 Ensure Compaction workload is stored in write-once meta-data files separate from timeline files.
This avoids concurrency issues when compactor(s) and ingestor are running in parallel.
    In the Next PR -> Safety concern regarding Cleaner retaining all meta-data and file-slices for pending compactions will be addressed
2018-08-07 08:19:50 -07:00
Balaji Varadarajan
1b61f04e05 (1) Define CompactionWorkload in avro to allow storing them in instant files.
(2) Split APIs in HoodieRealtimeCompactor to separate generating compaction workload from running compaction
2018-08-07 08:19:50 -07:00
Balaji Varadarajan
6d01ae8ca0 FileSystemView and Timeline level changes to support Async Compaction 2018-08-07 08:19:50 -07:00
Nishith Agarwal
44caf0d40c Fixing missing hoodie record location in HoodieRecord when record is read from disk after being spilled 2018-07-18 12:53:35 -07:00
Nishith Agarwal
34ab54a9d3 Fixing bug introducted in rollback for MOR table type with inserts into log files 2018-07-17 17:20:34 -07:00
Nishith Agarwal
a6fe96fdfe Changing Day based compaction strategy to be IO agnostic 2018-06-18 15:22:56 -07:00
Nishith Agarwal
3da063f83b Adding ability for inserts to be written to log files 2018-06-11 14:08:59 -07:00
Balaji Varadarajan
dfc0c61eb7 Support union mode in HoodieRealtimeRecordReader for pure insert workloads
Also Replace BufferedIteratorPayload abstraction with function passing
2018-05-10 17:39:56 -07:00
Nishith Agarwal
04655e9e85 Adding metrics for MOR and COW 2018-04-26 09:32:45 -07:00
Sunil Ramaiah
b9b9b24993 Added more comments and removed the extra new lines 2018-04-25 13:09:15 -07:00
Sunil Ramaiah
4d1fba24c9 Fix for updating duplicate records in same/different files in same parition 2018-04-25 13:09:15 -07:00
Nishith Agarwal
c3c205fc02 Using BufferedFsInputStream to wrap FSInputStream for FSDataInputStream 2018-04-18 08:05:19 -07:00
Nishith Agarwal
720e42f52a Parallelized read-write operations in Hoodie Merge phase 2018-04-12 11:46:42 -07:00
Balaji Varadarajan
6c226ca21a Issue-329 : Refactoring TestHoodieClientOnCopyOnWriteStorage and adding test-cases 2018-04-09 16:34:58 -07:00
Balaji Varadarajan
788e4f2d2e CodeStyle formatting to conform to basic Checkstyle rules.
The code-style rules follow google style with some changes:

1. Increase line length from 100 to 120
2. Disable JavaDoc related checkstyles as this needs more manual work.

Both source and test code are checked for code-style
2018-03-30 11:09:40 -07:00
Nishith Agarwal
1b756db221 Adding config for parquet compression ratio 2018-03-25 22:17:36 -07:00
Kaushik Devarajaiah
291a88ba94 DeduplicateRecords based on recordKey if global index is used 2018-03-22 09:15:44 -07:00
Omkar Joshi
c5b4cb1b75 Spawning parallel writer thread to separate reading records from spark and writing records to parquet file 2018-03-15 16:58:14 -07:00
Jian Xu
7f079632a6 Use hadoopConf in HoodieTableMetaClient and related tests 2018-03-12 11:47:55 -07:00
Nishith Agarwal
0eaa21111a Re-factoring Compaction as first level API in WriteClient similar to upsert/insert 2018-03-07 16:16:39 -08:00
Nishith Agarwal
5405a6287b Introducing HoodieLogFormat V2 with versioning support
- HoodieLogFormat V2 has support for LogFormat evolution through versioning
			- LogVersion is associated with a LogBlock not a LogFile
			- Based on a version for a LogBlock, approporiate code path is executed
		- Implemented LazyReading of Hoodie Log Blocks with Memory / IO tradeoff
		- Implemented Reverse pointer to be able to traverse the log in reverse
		- Introduce new MAGIC for backwards compatibility with logs without versions
2018-03-06 21:14:11 -08:00
Jian Xu
dfd1979c51 Handle inflight clean instants during Hoodie instants archiving 2018-03-05 15:01:58 -08:00
Jian Xu
5d5c306e64 Add new APIs in HoodieReadClient and HoodieWriteClient 2018-02-28 13:58:12 -08:00
Nishith Agarwal
30049383f5 Small File Size correction handling for MOR table type 2018-02-07 11:01:10 -08:00
Nishith Agarwal
2116815261 Fixing Rollback for compaction/commit operation, added check for null commit
- Fallback to old way of rollback by listing all partitions
	- Added null check to ensure only partitions which are to be rolledback are considered
	- Added location (committime) to workload stat
	- Added checks in CompactedScanner to guard against task retries
	- Introduce new logic for rollback (bounded by instant_time and target_instant time)
        - Reversed logfiles order
2018-02-06 16:55:23 -08:00
Nishith Agarwal
be0b1f3e57 Adding global indexing to HbaseIndex implementation
- Adding tests or HbaseIndex
	- Enabling global index functionality
2018-02-05 15:21:22 -08:00
Jian Xu
15e669c60c Incorporating code review feedback for finalizeWrite for COW #4 2018-02-02 11:38:25 -08:00
Jian Xu
3736243fb3 Rebases with latest upstream 2018-02-02 11:38:25 -08:00
Jian Xu
363e35bb0f Add finalizeWrite support for HoodieMergeHandle 2018-02-02 11:38:25 -08:00
Jian Xu
2fe4fef625 Incorporating code review feedback for finalizeWrite for COW 2018-02-02 11:38:25 -08:00
Jian Xu
c874248f23 Add FinalizeWrite in HoodieCreateHandle for COW tables 2018-02-02 11:38:25 -08:00
vinothchandar
21ce846f18 Remove stateful fs member from HoodieTestUtils & FSUtils 2018-01-17 23:34:21 -08:00
vinothchandar
cf7f7aabb9 Nicer handling of timeline archival for Cloud storage
- When append() is not supported, rollover to new file always (instead of failing)
 - Provide way to configure archive log folder (avoids small files inside .hoodie)
 - Datasets written via Spark datasource archive to .hoodie/archived
 - HoodieClientExample will now retain only 2,3 commits to exercise archival path during dev cycles
 - Few tweaks to code structure around CommitArchiveLog
2018-01-17 23:34:21 -08:00
Vinoth Chandar
0cd186c899 Multi FS Support
- Reviving PR 191, to make FileSystem creation off actual path
 - Streamline all filesystem access to HoodieTableMetaClient
 - Hadoop Conf from Spark Context serialized & passed to executor code too
 - Pick up env vars prefixed with HOODIE_ENV_ into Configuration object
 - Cleanup usage of FSUtils.getFS, piggybacking off HoodieTableMetaClient.getFS
 - Adding s3a to supported schemes & support escaping "." in env vars
 - Tests use HoodieTestUtils.getDefaultHadoopConf
2018-01-17 23:34:21 -08:00
Nishith Agarwal
44839b88c6 Removing compaction action type and associated compaction timeline operations, replace with commit action type 2018-01-09 09:56:15 -08:00
Nishith Agarwal
4aed5c7338 Adding a new Partition/Time based compaction strategy 2017-12-05 16:30:38 -08:00
Nishith Agarwal
9b610f82c7 Separating out compaction() API 2017-11-14 22:56:29 -08:00