1
0
Commit Graph

194 Commits

Author SHA1 Message Date
Balaji Varadarajan
9cce9abf4d Fix various errors found by long running delta-streamer tests
1. Parquet Avro schema mismatch errors when ingesting are sometimes silently ignored due to race-condition in BoundedInMemoryExecutor. This was reproducible when running long-running delta-streamer with wrong schema and it caused data-loss
  2. Fix behavior of Delta-Streamer to error out by default if there are any error records
  3. Fix a bug in tracking write errors in WriteStats. Earlier the write errors were tracking sampled errors as opposed to total errors.
  4. Delta Streamer does not commit the changes done as part of inline compaction as auto-commit is force disabled. Fix this behavior to always auto-commit inline compaction as it would not otherwise commit.
2019-05-13 10:47:34 -07:00
Omkar Joshi
738635306b migrating kryo's dependency from twitter chill to plain kryo library 2019-05-06 20:32:00 -07:00
Nishith Agarwal
a33a55fcb5 Caching Avro Binary encoder/decoder to avoid creating new one for every record 2019-05-06 11:28:08 -07:00
Nishith Agarwal
26f24b6728 Removing OLD MAGIC header since a) it's no longer used b) causes issues when the data actually has OLD MAGIC 2019-04-25 20:47:16 -07:00
Balaji Varadarajan
2f1e3e15fb Revert "Read and apply schema for each log block from the metadata header instead of the latest schema"
This reverts commit 9e7ce19b06.
2019-04-18 08:54:34 -07:00
Omkar Joshi
e35d24f31d Revert "Replacing Apache commons-lang3 object serializer with Kryo serializer"
This reverts commit a6c45feb2c.
2019-04-17 09:23:37 -07:00
Nishith Agarwal
9e7ce19b06 Read and apply schema for each log block from the metadata header instead of the latest schema 2019-04-16 17:20:03 -07:00
Nishith Agarwal
2577014617 1. Minor changes to fix compaction 2. Adding 2 compaction policies 2019-04-03 17:38:17 -07:00
Jing Chen
d1d33f725e [HUDI-66] FSUtils.getRelativePartitionPath does not handle repeated folder names 2019-04-03 17:37:03 -07:00
Nishith Agarwal
3d9041e216 Fixing source schema and writer schema distinction in payloads 2019-03-26 19:44:27 -07:00
Nishith Agarwal
9e59da7fd9 Refactor HoodieTable Rollback to write one rollback instant for a batch of commits to rollback 2019-03-19 10:10:16 -07:00
Nishith Agarwal
0dd4a90b03 Enable multi/nested rollbacks for MOR table type 2019-03-19 10:10:16 -07:00
Omkar Joshi
a6c45feb2c Replacing Apache commons-lang3 object serializer with Kryo serializer 2019-03-18 14:12:25 -07:00
Omkar Joshi
4a8bec7ea5 Handling duplicate record update for single partition (duplicates in single or different parquet files) 2019-03-10 20:15:17 -07:00
Balaji Varadarajan
3ae6cb4ed5 FileSystem View must treat same fileIds present in different partitions as different file-groups and handle pending compaction correctly 2019-03-01 10:49:04 -08:00
vinothchandar
687395e40f [maven-release-plugin] prepare for next development iteration 2019-02-27 07:16:27 -08:00
vinothchandar
bbf40ef987 [maven-release-plugin] prepare release hoodie-0.4.5 2019-02-27 07:16:15 -08:00
Bhavani Sudha Saktheeswaran
639c287cab Close FSDataInputStream for meta file open in HoodiePartitionMetadata 2019-02-15 22:16:31 -08:00
Balaji Varadarajan
3a0044216c New Features in DeltaStreamer :
(1) Apply transformation when using delta-streamer to ingest data.
 (2) Add Hudi Incremental Source for Delta Streamer
 (3) Allow delta-streamer config-property to be passed as command-line
 (4) Add Hive Integration to Delta-Streamer and address Review comments
 (5) Ensure MultiPartKeysValueExtractor  handle hive style partition description
 (6) Reuse same spark session on both source and transformer
 (7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
 (8) Reuse Binary Avro coders
 (9) Add push down filter for Incremental source
 (10) Add Hoodie DeltaStreamer metrics to track total time taken
2019-02-11 18:22:05 -08:00
Nishith Agarwal
d1bb804577 Passing a path filter to avoid including folders under .hoodie directory as partition paths 2019-01-11 19:21:09 -08:00
Nishith Agarwal
110df7190b Enabling hard deletes for MergeOnRead table type 2018-12-31 12:49:58 -08:00
arukavytsia
6946dd7557 General enhancements 2018-12-18 12:52:39 -08:00
Balaji Varadarajan
30c5f8b7bd Ensure Hoodie works for non-partitioned Hive table 2018-12-12 13:35:16 -08:00
xubo245
466ff73ffb fix some spell errorin Hudi 2018-12-12 13:06:25 -08:00
Nishith Agarwal
7243ce40c9 Serializing the complete payload object instead of serializing just the GenericRecord
Removing Converter hierarchy as we now depend purely on JavaSerialization and require the payload to be java serializable
2018-12-04 11:43:41 -08:00
Vinoth Chandar
fa65db9c4c Explicitly handle lack of append() support during LogWriting 2018-11-27 17:58:43 -08:00
Balaji Varadarajan
25cd05b24e Useful Hudi CLI commands to debug/analyze production workloads 2018-10-30 10:28:01 -07:00
Balaji Varadarajan
07324e7a20 Compaction validate, unschedule and repair 2018-10-25 14:12:47 -07:00
Xinli shang
d904fe69ca Fix addMetadataFields() to carry over 'props' 2018-10-24 10:55:13 -07:00
Balaji Varadarajan
9710b5a3a6 Ensure Hoodie metadata folder and files are filtered out when constructing Parquet Data Source 2018-10-01 14:27:14 +05:30
jiale.tan
98fd97b65f feature(HoodieGlobalBloomIndex): adds a new type of bloom index to allow global record key lookup 2018-09-29 19:55:20 +05:30
vinothchandar
7ba842c0fe [maven-release-plugin] prepare for next development iteration 2018-09-28 11:27:00 +05:30
vinothchandar
5847b61f44 [maven-release-plugin] prepare release hoodie-0.4.4 2018-09-28 11:26:15 +05:30
vinothchandar
9ca6f91e97 Perform consistency checks during write finalize
- Check to ensure written files are listable on storage
 - Docs reflected to capture how this helps with s3 storage
 - Unit tests added, corrections to existing tests
 - Fix DeltaStreamer to manage archived commits in a separate folder
2018-09-28 08:04:41 +05:30
Balaji Varadarajan
4c74dd4cad Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors 2018-09-26 21:10:20 +05:30
Yishuang Lu
faf93b6340 Fix the name of avro schema file in Test
Fixed the name of avro schema file in Test

Signed-off-by: Yishuang Lu <luystu@gmail.com>
2018-09-24 21:58:34 +05:30
Balaji Varadarajan
5cb28e7b1f Explicitly release resources in LogFileReader and TestHoodieClientBase 2018-09-20 13:24:57 +05:30
Balaji Varadarajan
2728f96505 Add dummy classes to dump all classes loaded as part of packaging modules to ensure javadoc and sources jars are getting created 2018-09-18 09:24:33 +05:30
Vinoth Chandar
bd5af89f12 [maven-release-plugin] rollback the release of hoodie-0.4.4 2018-09-13 15:01:53 +05:30
Vinoth Chandar
d1cc864a43 [maven-release-plugin] prepare for next development iteration 2018-09-12 23:59:47 +05:30
Vinoth Chandar
b748bc836d [maven-release-plugin] prepare release hoodie-0.4.4 2018-09-12 23:59:34 +05:30
Vinoth Chandar
a5359662be Moving depedencies off cdh to apache + Hive2 support
- Tests redone in the process
 - Main changes are to RealtimeRecordReader and how it treats maps/arrays
 - Make hive sync work with Hive 1/2 and CDH environments
 - Fixes to make corner cases for Hive queries
 - Spark Hive integration - Working version across Apache and CDH versions
 - Known Issue - https://github.com/uber/hudi/issues/439
2018-09-11 11:03:30 +05:30
Vinoth Chandar
d58ddbd999 Reworking the deltastreamer tool
- Standardize version of jackson
 - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration
 - Remove dependency on ConstructorUtils
 - Throw error if ordering value is not present, during key generation
 - Switch to shade plugin for hoodie-utilities
 - Added support for consumption for Confluent avro kafka serdes
 - Support for Confluent schema registry
 - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions
 - Added support for BULK_INSERT operations as well
 - Pass in the payload class config properly into HoodieWriteClient
 - Fix documentation based on new usage
 - Adding tests on deltastreamer, sources and all new util classes.
2018-09-08 10:24:32 +08:00
Balaji Varadarajan
fb95dbdedb CLI to create and desc hoodie table 2018-09-08 10:03:38 +08:00
Nishith Agarwal
459e523d9e 1. Small file size handling for inserts into log files. In summary, the total size of the log file is compared with the parquet max file size and if there is scope to add inserts the add it. 2018-09-06 08:52:08 +08:00
Nishith Agarwal
324de298bc Removing dependency on apache-commons lang 3, adding necessary classes as needed 2018-09-06 08:26:48 +08:00
Vinoth Chandar
89cd6b0726 [maven-release-plugin] prepare for next development iteration 2018-08-22 21:30:05 -07:00
Vinoth Chandar
8d305c5a86 [maven-release-plugin] prepare release hoodie-0.4.3 2018-08-22 21:29:53 -07:00
Nishith Agarwal
3746ace76a Fixing Null pointer exception in finally block 2018-08-21 21:07:53 -07:00
Nishith Agarwal
88274b8261 Adding another metric to HoodieWriteStat to determine if there were inserts converted to updates, added one test for this 2018-08-14 06:22:16 -07:00