1
0
Commit Graph

60 Commits

Author SHA1 Message Date
Balaji Varadarajan
a7e6cf5197 Support nested types for recordKey, partitionPath and combineKey 2019-05-18 07:14:58 -07:00
vinothchandar
446f99aa0f [maven-release-plugin] prepare for next development iteration 2019-05-14 07:29:22 -07:00
vinothchandar
cc38abecc8 [maven-release-plugin] prepare release hoodie-0.4.6 2019-05-14 07:29:11 -07:00
Balaji Varadarajan
978470af33 Rollback inflights when using Spark [Streaming] write 2019-05-02 12:51:02 -07:00
vinothchandar
57a8b9cc8c Making DataSource/DeltaStreamer use defaults for combining
- Addresses issue where insert will combine and remove duplicates within batch
 - Setting default insert combining to false (write client default)
 - Set to true if filtering duplicates on insert/bulk_insert
2019-05-01 13:21:21 -07:00
Naoki Takezoe
461ce18bd1 Fix to enable hoodie.datasource.read.incr.filters 2019-04-26 11:14:06 -07:00
lyogev
9ef51deb84 Add empty payload class to support deletes via apache spark 2019-04-17 23:00:20 -07:00
Vinoth Chandar
b34a204a52 Fixing small file handling, inline compaction defaults
- Small file limit is now 100MB by default
 - Turned on inline compaction by default for MOR
 - Changes take effect on DataSource and DeltaStreamer
2019-04-03 10:56:10 -07:00
Vinoth Chandar
e56c1612e4 Fixed HUDI-87 : Remove schemastr from BaseAvroPayload 2019-03-27 23:03:25 -07:00
Nishith Agarwal
3d9041e216 Fixing source schema and writer schema distinction in payloads 2019-03-26 19:44:27 -07:00
kaka11chen
b514e1ab18 Fix avro doesn't have short and byte byte. 2019-03-06 16:09:24 -08:00
vinothchandar
687395e40f [maven-release-plugin] prepare for next development iteration 2019-02-27 07:16:27 -08:00
vinothchandar
bbf40ef987 [maven-release-plugin] prepare release hoodie-0.4.5 2019-02-27 07:16:15 -08:00
Kent Yao
8dddecf00f handle no such element exception in HoodieSparkSqlWriter 2019-02-15 22:11:48 -08:00
Balaji Varadarajan
3a0044216c New Features in DeltaStreamer :
(1) Apply transformation when using delta-streamer to ingest data.
 (2) Add Hudi Incremental Source for Delta Streamer
 (3) Allow delta-streamer config-property to be passed as command-line
 (4) Add Hive Integration to Delta-Streamer and address Review comments
 (5) Ensure MultiPartKeysValueExtractor  handle hive style partition description
 (6) Reuse same spark session on both source and transformer
 (7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
 (8) Reuse Binary Avro coders
 (9) Add push down filter for Incremental source
 (10) Add Hoodie DeltaStreamer metrics to track total time taken
2019-02-11 18:22:05 -08:00
Kent Yao
2b55f0751f Using immutable map instead of mutables to generate parameters 2019-01-30 16:09:40 -08:00
arukavytsia
6946dd7557 General enhancements 2018-12-18 12:52:39 -08:00
Balaji Varadarajan
30c5f8b7bd Ensure Hoodie works for non-partitioned Hive table 2018-12-12 13:35:16 -08:00
xubo245
466ff73ffb fix some spell errorin Hudi 2018-12-12 13:06:25 -08:00
jiale.tan
bf65219b73 feat(SparkDataSource): add structured streaming 2018-12-04 16:33:00 -08:00
Nishith Agarwal
7243ce40c9 Serializing the complete payload object instead of serializing just the GenericRecord
Removing Converter hierarchy as we now depend purely on JavaSerialization and require the payload to be java serializable
2018-12-04 11:43:41 -08:00
jiale.tan
1628d044ac feat(SparkDataSource): add additional feature to drop later arriving dups 2018-10-16 11:52:50 -07:00
Vinoth Chandar
1fca9b21cc Add --filter-dupes to DeltaStreamer
- Optionally filter out duplicates before inserting data
 - Unit tests
2018-10-04 11:25:18 +05:30
Balaji Varadarajan
f3418e4718 Docker Container Build and Run setup with foundations for adding docker integration tests. Docker images built with Hadoop 2.8.4 Hive 2.3.3 and Spark 2.3.1 and published to docker-hub
Look at quickstart document for how to setup docker and run demo
2018-10-02 09:28:21 +05:30
vinothchandar
7ba842c0fe [maven-release-plugin] prepare for next development iteration 2018-09-28 11:27:00 +05:30
vinothchandar
5847b61f44 [maven-release-plugin] prepare release hoodie-0.4.4 2018-09-28 11:26:15 +05:30
vinothchandar
9ca6f91e97 Perform consistency checks during write finalize
- Check to ensure written files are listable on storage
 - Docs reflected to capture how this helps with s3 storage
 - Unit tests added, corrections to existing tests
 - Fix DeltaStreamer to manage archived commits in a separate folder
2018-09-28 08:04:41 +05:30
Balaji Varadarajan
4c74dd4cad Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors 2018-09-26 21:10:20 +05:30
Vinoth Chandar
f44bcc5b03 Fix bug with incrementally pulling older data 2018-09-18 02:34:00 +05:30
Vinoth Chandar
bd5af89f12 [maven-release-plugin] rollback the release of hoodie-0.4.4 2018-09-13 15:01:53 +05:30
Vinoth Chandar
d1cc864a43 [maven-release-plugin] prepare for next development iteration 2018-09-12 23:59:47 +05:30
Vinoth Chandar
b748bc836d [maven-release-plugin] prepare release hoodie-0.4.4 2018-09-12 23:59:34 +05:30
Balaji Varadarajan
18a39715c9 Bump up versions in packaging modules and remove commons-lang3 dep 2018-09-11 11:03:30 +05:30
Vinoth Chandar
a5359662be Moving depedencies off cdh to apache + Hive2 support
- Tests redone in the process
 - Main changes are to RealtimeRecordReader and how it treats maps/arrays
 - Make hive sync work with Hive 1/2 and CDH environments
 - Fixes to make corner cases for Hive queries
 - Spark Hive integration - Working version across Apache and CDH versions
 - Known Issue - https://github.com/uber/hudi/issues/439
2018-09-11 11:03:30 +05:30
Vinoth Chandar
d58ddbd999 Reworking the deltastreamer tool
- Standardize version of jackson
 - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration
 - Remove dependency on ConstructorUtils
 - Throw error if ordering value is not present, during key generation
 - Switch to shade plugin for hoodie-utilities
 - Added support for consumption for Confluent avro kafka serdes
 - Support for Confluent schema registry
 - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions
 - Added support for BULK_INSERT operations as well
 - Pass in the payload class config properly into HoodieWriteClient
 - Fix documentation based on new usage
 - Adding tests on deltastreamer, sources and all new util classes.
2018-09-08 10:24:32 +08:00
Nishith Agarwal
459e523d9e 1. Small file size handling for inserts into log files. In summary, the total size of the log file is compared with the parquet max file size and if there is scope to add inserts the add it. 2018-09-06 08:52:08 +08:00
Nishith Agarwal
324de298bc Removing dependency on apache-commons lang 3, adding necessary classes as needed 2018-09-06 08:26:48 +08:00
Vinoth Chandar
89cd6b0726 [maven-release-plugin] prepare for next development iteration 2018-08-22 21:30:05 -07:00
Vinoth Chandar
8d305c5a86 [maven-release-plugin] prepare release hoodie-0.4.3 2018-08-22 21:29:53 -07:00
Balaji Varadarajan
2e12c86d01 Ensure Compaction Operation compacts the data file as defined in the workload 2018-08-07 08:19:50 -07:00
Balaji Varadarajan
2f8ce93030 Async Compaction Main API changes 2018-08-07 08:19:50 -07:00
Nishith Agarwal
3da063f83b Adding ability for inserts to be written to log files 2018-06-11 14:08:59 -07:00
Vinoth Chandar
34827d50e1 [maven-release-plugin] prepare for next development iteration 2018-06-11 08:59:13 -07:00
Vinoth Chandar
43ef385730 [maven-release-plugin] prepare release hoodie-0.4.2 2018-06-11 08:59:02 -07:00
Vinoth Chandar
85dd265b7b Improving out of box experience for data source
- Fixes #246
 - Bump up default parallelism to 1500, to handle large upserts
 - Add docs on s3 confuration & tuning tips with tested spark knobs
 - Fix bug to not duplicate hoodie metadata fields when input dataframe is another hoodie dataset
 - Improve speed of ROTablePathFilter by removing directory check
 - Move to spark-avro 4.0 to handle issue with nested fields with same name
 - Keep AvroConversionUtils in sync with spark-avro 4.0
2018-06-10 19:16:44 -07:00
Balaji Varadarajan
788e4f2d2e CodeStyle formatting to conform to basic Checkstyle rules.
The code-style rules follow google style with some changes:

1. Increase line length from 100 to 120
2. Disable JavaDoc related checkstyles as this needs more manual work.

Both source and test code are checked for code-style
2018-03-30 11:09:40 -07:00
Nishith Agarwal
987f5d6b96 Making ExternalSpillableMap generic for any datatype
- Introduced concept of converters to be able to serde generic datatype for SpillableMap
	- Fixed/Added configs to Hoodie Configs
	- Changed HoodieMergeHandle to start using SpillableMap
2018-03-28 07:56:07 -07:00
Xavier Jodoin
fa787ab5ab Replace deprecated jackson version 2018-03-27 14:27:20 -07:00
Jian Xu
7f079632a6 Use hadoopConf in HoodieTableMetaClient and related tests 2018-03-12 11:47:55 -07:00
Vinoth Chandar
73534d467f [maven-release-plugin] prepare for next development iteration 2018-03-07 21:04:10 -08:00