1
0
Commit Graph

79 Commits

Author SHA1 Message Date
Omkar Joshi
738635306b migrating kryo's dependency from twitter chill to plain kryo library 2019-05-06 20:32:00 -07:00
Nishith Agarwal
26f24b6728 Removing OLD MAGIC header since a) it's no longer used b) causes issues when the data actually has OLD MAGIC 2019-04-25 20:47:16 -07:00
Balaji Varadarajan
2f1e3e15fb Revert "Read and apply schema for each log block from the metadata header instead of the latest schema"
This reverts commit 9e7ce19b06.
2019-04-18 08:54:34 -07:00
Omkar Joshi
e35d24f31d Revert "Replacing Apache commons-lang3 object serializer with Kryo serializer"
This reverts commit a6c45feb2c.
2019-04-17 09:23:37 -07:00
Nishith Agarwal
9e7ce19b06 Read and apply schema for each log block from the metadata header instead of the latest schema 2019-04-16 17:20:03 -07:00
Jing Chen
d1d33f725e [HUDI-66] FSUtils.getRelativePartitionPath does not handle repeated folder names 2019-04-03 17:37:03 -07:00
Omkar Joshi
a6c45feb2c Replacing Apache commons-lang3 object serializer with Kryo serializer 2019-03-18 14:12:25 -07:00
Balaji Varadarajan
3ae6cb4ed5 FileSystem View must treat same fileIds present in different partitions as different file-groups and handle pending compaction correctly 2019-03-01 10:49:04 -08:00
Balaji Varadarajan
3a0044216c New Features in DeltaStreamer :
(1) Apply transformation when using delta-streamer to ingest data.
 (2) Add Hudi Incremental Source for Delta Streamer
 (3) Allow delta-streamer config-property to be passed as command-line
 (4) Add Hive Integration to Delta-Streamer and address Review comments
 (5) Ensure MultiPartKeysValueExtractor  handle hive style partition description
 (6) Reuse same spark session on both source and transformer
 (7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
 (8) Reuse Binary Avro coders
 (9) Add push down filter for Incremental source
 (10) Add Hoodie DeltaStreamer metrics to track total time taken
2019-02-11 18:22:05 -08:00
Nishith Agarwal
110df7190b Enabling hard deletes for MergeOnRead table type 2018-12-31 12:49:58 -08:00
Nishith Agarwal
7243ce40c9 Serializing the complete payload object instead of serializing just the GenericRecord
Removing Converter hierarchy as we now depend purely on JavaSerialization and require the payload to be java serializable
2018-12-04 11:43:41 -08:00
Vinoth Chandar
fa65db9c4c Explicitly handle lack of append() support during LogWriting 2018-11-27 17:58:43 -08:00
Balaji Varadarajan
07324e7a20 Compaction validate, unschedule and repair 2018-10-25 14:12:47 -07:00
Xinli shang
d904fe69ca Fix addMetadataFields() to carry over 'props' 2018-10-24 10:55:13 -07:00
Balaji Varadarajan
9710b5a3a6 Ensure Hoodie metadata folder and files are filtered out when constructing Parquet Data Source 2018-10-01 14:27:14 +05:30
jiale.tan
98fd97b65f feature(HoodieGlobalBloomIndex): adds a new type of bloom index to allow global record key lookup 2018-09-29 19:55:20 +05:30
Balaji Varadarajan
4c74dd4cad Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors 2018-09-26 21:10:20 +05:30
Yishuang Lu
faf93b6340 Fix the name of avro schema file in Test
Fixed the name of avro schema file in Test

Signed-off-by: Yishuang Lu <luystu@gmail.com>
2018-09-24 21:58:34 +05:30
Balaji Varadarajan
5cb28e7b1f Explicitly release resources in LogFileReader and TestHoodieClientBase 2018-09-20 13:24:57 +05:30
Vinoth Chandar
d58ddbd999 Reworking the deltastreamer tool
- Standardize version of jackson
 - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration
 - Remove dependency on ConstructorUtils
 - Throw error if ordering value is not present, during key generation
 - Switch to shade plugin for hoodie-utilities
 - Added support for consumption for Confluent avro kafka serdes
 - Support for Confluent schema registry
 - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions
 - Added support for BULK_INSERT operations as well
 - Pass in the payload class config properly into HoodieWriteClient
 - Fix documentation based on new usage
 - Adding tests on deltastreamer, sources and all new util classes.
2018-09-08 10:24:32 +08:00
Nishith Agarwal
459e523d9e 1. Small file size handling for inserts into log files. In summary, the total size of the log file is compared with the parquet max file size and if there is scope to add inserts the add it. 2018-09-06 08:52:08 +08:00
Nishith Agarwal
324de298bc Removing dependency on apache-commons lang 3, adding necessary classes as needed 2018-09-06 08:26:48 +08:00
Balaji Varadarajan
2e12c86d01 Ensure Compaction Operation compacts the data file as defined in the workload 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
2f8ce93030 Async Compaction Main API changes 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
9b78523d62 Ensure Cleaner and Archiver do not delete file-slices and workload marked for compaction 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
0a0451a765 Ensure Compaction workload is stored in write-once meta-data files separate from timeline files.
This avoids concurrency issues when compactor(s) and ingestor are running in parallel.
    In the Next PR -> Safety concern regarding Cleaner retaining all meta-data and file-slices for pending compactions will be addressed
2018-08-07 08:19:50 -07:00
Balaji Varadarajan
9d99942564 Track fileIds with pending compaction in FileSystemView to provide correct API semantics 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
6d01ae8ca0 FileSystemView and Timeline level changes to support Async Compaction 2018-08-07 08:19:50 -07:00
Nishith Agarwal
44caf0d40c Fixing missing hoodie record location in HoodieRecord when record is read from disk after being spilled 2018-07-18 12:53:35 -07:00
Xavier Jodoin
8ad8030f2a Fix wrong use of TemporaryFolder junit rule 2018-06-10 23:31:42 -07:00
Sunil Ramaiah
a97814462d Added a filter function to filter the record keys in a parquet file 2018-05-17 19:01:11 -07:00
Balaji Varadarajan
dfc0c61eb7 Support union mode in HoodieRealtimeRecordReader for pure insert workloads
Also Replace BufferedIteratorPayload abstraction with function passing
2018-05-10 17:39:56 -07:00
Nishith Agarwal
93f345a032 Minor fixes for MergeOnRead MVP release readiness 2018-05-09 07:23:58 -07:00
Nishith Agarwal
75df72f575 Adding a fix/workaround when fs.append() unable to return a valid outputstream 2018-05-08 18:46:17 -07:00
Nishith Agarwal
04655e9e85 Adding metrics for MOR and COW 2018-04-26 09:32:45 -07:00
Nishith Agarwal
c3c205fc02 Using BufferedFsInputStream to wrap FSInputStream for FSDataInputStream 2018-04-18 08:05:19 -07:00
Balaji Varadarajan
6c226ca21a Issue-329 : Refactoring TestHoodieClientOnCopyOnWriteStorage and adding test-cases 2018-04-09 16:34:58 -07:00
Balaji Varadarajan
788e4f2d2e CodeStyle formatting to conform to basic Checkstyle rules.
The code-style rules follow google style with some changes:

1. Increase line length from 100 to 120
2. Disable JavaDoc related checkstyles as this needs more manual work.

Both source and test code are checked for code-style
2018-03-30 11:09:40 -07:00
Nishith Agarwal
987f5d6b96 Making ExternalSpillableMap generic for any datatype
- Introduced concept of converters to be able to serde generic datatype for SpillableMap
	- Fixed/Added configs to Hoodie Configs
	- Changed HoodieMergeHandle to start using SpillableMap
2018-03-28 07:56:07 -07:00
Xavier Jodoin
fa787ab5ab Replace deprecated jackson version 2018-03-27 14:27:20 -07:00
Nishith Agarwal
123da020e2 - Fixing memory leak due to HoodieLogFileReader holding on to a logblock
- Removed inMemory HashMap usage in merge(..) code in LogScanner
2018-03-16 12:43:31 -07:00
Jian Xu
7f079632a6 Use hadoopConf in HoodieTableMetaClient and related tests 2018-03-12 11:47:55 -07:00
Nishith Agarwal
5405a6287b Introducing HoodieLogFormat V2 with versioning support
- HoodieLogFormat V2 has support for LogFormat evolution through versioning
			- LogVersion is associated with a LogBlock not a LogFile
			- Based on a version for a LogBlock, approporiate code path is executed
		- Implemented LazyReading of Hoodie Log Blocks with Memory / IO tradeoff
		- Implemented Reverse pointer to be able to traverse the log in reverse
		- Introduce new MAGIC for backwards compatibility with logs without versions
2018-03-06 21:14:11 -08:00
Jian Xu
dfd1979c51 Handle inflight clean instants during Hoodie instants archiving 2018-03-05 15:01:58 -08:00
Nishith Agarwal
6fec9655a8 Added support for Disk Spillable Compaction to prevent OOM issues 2018-02-26 16:00:35 -08:00
Nishith Agarwal
2116815261 Fixing Rollback for compaction/commit operation, added check for null commit
- Fallback to old way of rollback by listing all partitions
	- Added null check to ensure only partitions which are to be rolledback are considered
	- Added location (committime) to workload stat
	- Added checks in CompactedScanner to guard against task retries
	- Introduce new logic for rollback (bounded by instant_time and target_instant time)
        - Reversed logfiles order
2018-02-06 16:55:23 -08:00
Jian Xu
37f2cdd7e4 Incorporating code review feedback for finalizeWrite for COW #2 2018-02-02 11:38:25 -08:00
Jian Xu
c874248f23 Add FinalizeWrite in HoodieCreateHandle for COW tables 2018-02-02 11:38:25 -08:00
Nishith Agarwal
937ae322ba Reducing memory footprint required in HoodieAvroDataBlock and HoodieAppendHandle 2018-01-29 08:22:29 -08:00
vinothchandar
21ce846f18 Remove stateful fs member from HoodieTestUtils & FSUtils 2018-01-17 23:34:21 -08:00