lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
lyogev	9ef51deb84	Add empty payload class to support deletes via apache spark	2019-04-17 23:00:20 -07:00
Vinoth Chandar	b34a204a52	Fixing small file handling, inline compaction defaults - Small file limit is now 100MB by default - Turned on inline compaction by default for MOR - Changes take effect on DataSource and DeltaStreamer	2019-04-03 10:56:10 -07:00
Kent Yao	8dddecf00f	handle no such element exception in HoodieSparkSqlWriter	2019-02-15 22:11:48 -08:00
Balaji Varadarajan	3a0044216c	New Features in DeltaStreamer : (1) Apply transformation when using delta-streamer to ingest data. (2) Add Hudi Incremental Source for Delta Streamer (3) Allow delta-streamer config-property to be passed as command-line (4) Add Hive Integration to Delta-Streamer and address Review comments (5) Ensure MultiPartKeysValueExtractor handle hive style partition description (6) Reuse same spark session on both source and transformer (7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource (8) Reuse Binary Avro coders (9) Add push down filter for Incremental source (10) Add Hoodie DeltaStreamer metrics to track total time taken	2019-02-11 18:22:05 -08:00
Kent Yao	2b55f0751f	Using immutable map instead of mutables to generate parameters	2019-01-30 16:09:40 -08:00
Balaji Varadarajan	30c5f8b7bd	Ensure Hoodie works for non-partitioned Hive table	2018-12-12 13:35:16 -08:00
xubo245	466ff73ffb	fix some spell errorin Hudi	2018-12-12 13:06:25 -08:00
jiale.tan	bf65219b73	feat(SparkDataSource): add structured streaming	2018-12-04 16:33:00 -08:00
Nishith Agarwal	7243ce40c9	Serializing the complete payload object instead of serializing just the GenericRecord Removing Converter hierarchy as we now depend purely on JavaSerialization and require the payload to be java serializable	2018-12-04 11:43:41 -08:00
jiale.tan	1628d044ac	feat(SparkDataSource): add additional feature to drop later arriving dups	2018-10-16 11:52:50 -07:00
Balaji Varadarajan	f3418e4718	Docker Container Build and Run setup with foundations for adding docker integration tests. Docker images built with Hadoop 2.8.4 Hive 2.3.3 and Spark 2.3.1 and published to docker-hub Look at quickstart document for how to setup docker and run demo	2018-10-02 09:28:21 +05:30
Balaji Varadarajan	4c74dd4cad	Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors	2018-09-26 21:10:20 +05:30
Vinoth Chandar	f44bcc5b03	Fix bug with incrementally pulling older data	2018-09-18 02:34:00 +05:30
Balaji Varadarajan	18a39715c9	Bump up versions in packaging modules and remove commons-lang3 dep	2018-09-11 11:03:30 +05:30
Vinoth Chandar	a5359662be	Moving depedencies off cdh to apache + Hive2 support - Tests redone in the process - Main changes are to RealtimeRecordReader and how it treats maps/arrays - Make hive sync work with Hive 1/2 and CDH environments - Fixes to make corner cases for Hive queries - Spark Hive integration - Working version across Apache and CDH versions - Known Issue - https://github.com/uber/hudi/issues/439	2018-09-11 11:03:30 +05:30
Vinoth Chandar	d58ddbd999	Reworking the deltastreamer tool - Standardize version of jackson - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration - Remove dependency on ConstructorUtils - Throw error if ordering value is not present, during key generation - Switch to shade plugin for hoodie-utilities - Added support for consumption for Confluent avro kafka serdes - Support for Confluent schema registry - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions - Added support for BULK_INSERT operations as well - Pass in the payload class config properly into HoodieWriteClient - Fix documentation based on new usage - Adding tests on deltastreamer, sources and all new util classes.	2018-09-08 10:24:32 +08:00
Balaji Varadarajan	2e12c86d01	Ensure Compaction Operation compacts the data file as defined in the workload	2018-08-07 08:19:50 -07:00
Balaji Varadarajan	2f8ce93030	Async Compaction Main API changes	2018-08-07 08:19:50 -07:00
Balaji Varadarajan	788e4f2d2e	CodeStyle formatting to conform to basic Checkstyle rules. The code-style rules follow google style with some changes: 1. Increase line length from 100 to 120 2. Disable JavaDoc related checkstyles as this needs more manual work. Both source and test code are checked for code-style	2018-03-30 11:09:40 -07:00
Vinoth Chandar	0cd186c899	Multi FS Support - Reviving PR 191, to make FileSystem creation off actual path - Streamline all filesystem access to HoodieTableMetaClient - Hadoop Conf from Spark Context serialized & passed to executor code too - Pick up env vars prefixed with HOODIE_ENV_ into Configuration object - Cleanup usage of FSUtils.getFS, piggybacking off HoodieTableMetaClient.getFS - Adding s3a to supported schemes & support escaping "." in env vars - Tests use HoodieTestUtils.getDefaultHadoopConf	2018-01-17 23:34:21 -08:00
Vinoth Chandar	e45679f5e2	Reformatting code per Google Code Style all over	2017-11-12 23:19:02 -08:00
Vinoth Chandar	274aaf49fe	Incorporating code review feedback for DataSource	2017-10-02 20:44:53 -07:00
Vinoth Chandar	64e0573aca	Adding hoodie-spark to support Spark Datasource for Hoodie - Write with COW/MOR paths work fully - Read with RO view works on both storages* - Incremental view supported on COW - Refactored out HoodieReadClient methods, to just contain key based access - HoodieDataSourceHelpers class can be now used to construct inputs to datasource - Tests in hoodie-client using new helpers and mechanisms - Basic tests around save modes & insert/upserts (more to follow) - Bumped up scala to 2.11, since 2.10 is deprecated & complains with scalatest - Updated documentation to describe usage - New sample app written using the DataSource API	2017-10-02 20:44:53 -07:00

23 Commits