lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Balaji Varadarajan	5823c1ebd7	HUDI-138 - Meta Files handling also need to support consistency guard	2019-07-13 22:02:55 -07:00
Balaji Varadarajan	a0d7ab2384	HUDI-70 : Making DeltaStreamer run in continuous mode with concurrent compaction	2019-06-18 17:48:14 -07:00
Balaji Varadarajan	a1483f2c5f	HUDI-148 Small File selection logic for MOR must skip fileIds selected for pending compaction correctly	2019-06-17 18:35:17 -07:00
Balaji Varadarajan	cd7623e216	All Opened hoodie clients in tests needs to be closed TestMergeOnReadTable must use embedded timeline server	2019-06-13 12:30:07 -07:00
Balaji Varadarajan	136f8478a3	TestMergeOnReadTable must use embedded timeline server	2019-06-12 19:08:09 -07:00
Balaji Varadarajan	04fc86b43d	Turn on embedded server for all client tests	2019-06-12 18:14:55 -07:00
Vinoth Chandar	b791473a6d	Introduce HoodieReadHandle abstraction into index - Generalized BloomIndex to work with file ids instead of paths - Abstracted away Bloom filter checking into HoodieLookupHandle - Abstracted away range information retrieval into HoodieRangeInfoHandle	2019-06-12 10:46:14 -07:00
Balaji Varadarajan	479908fd20	HUDI-125 : Change License for all source files and update RAT configurations	2019-06-09 11:41:55 -07:00
Balaji Varadarajan	d0d2fa0337	Reduce logging in unit-test runs	2019-05-24 23:43:54 -07:00
Venkat	f2d91a455e	default implementation for HBase index qps allocator (#685 ) * default implementation and configs for HBase index qps allocator * Test for QPS allocator and address CR * fix QPS allocator test	2019-05-24 18:43:46 -07:00
Vinoth Chandar	4074c5eb23	Fixed HUDI-116 : Handle duplicate record keys across partitions - Join based on HoodieKey and not RecordKey during tagging - Unit tests changed to run with duplicate keys - Special casing GlobalBloom to still join by recordkey	2019-05-24 18:32:49 -07:00
Balaji Varadarajan	145034c5fa	Spark Stage retry handling	2019-05-21 14:49:51 -07:00
David Muto (pseudomuto)	3fd2fd6e9d	Remove redundant string from file comp rdd	2019-05-21 13:07:32 -07:00
Balaji Varadarajan	64fec64097	Timeline Service with Incremental View Syncing support	2019-05-16 13:25:33 -07:00
Vinoth Chandar	a0e62b7919	Bucketized Bloom Filter checking - Tackles the skew seen in sort based partitioning/checking - Parameterized the HoodieBloomIndex test - Config to turn on/off (on by default) - Unit tests & also tested at scale	2019-05-11 16:38:28 -07:00
Sivabalan Narayanan	7129dc5bb7	Improving Tag location using interval trees for index files Adding interface for index look up Adding index filtering implementations for global bloom index too	2019-04-29 11:38:23 -07:00
Nishith Agarwal	2577014617	1. Minor changes to fix compaction 2. Adding 2 compaction policies	2019-04-03 17:38:17 -07:00
Vinoth Chandar	5847f0c934	Fix HUDI-27 : Support num_cores > 1 for writing through spark - Users using spark.executor.cores > 1 used to fail due to "FileSystem closed" - This is due to HoodieWrapperFileSystem closing the wrapped filesytem obj - FileSystem.getInternal caching code races threads and closes the extra fs instance(s) - Bumped up num cores in tests to 8, speeds up tests by 3-4 mins	2019-03-28 15:56:21 -07:00
Vinoth Chandar	f1410bfdcd	Fixes HUDI-38: Reduce memory overhead of WriteStatus - For implicit indexes (e.g BloomIndex), don't buffer up written records - By default, only collect 10% of failing records to avoid OOMs - Improves debuggability via above, since data errors can now show up in collect() - Unit tests & fixing subclasses & adjusting tests	2019-03-28 10:32:59 -07:00
Vinoth Chandar	372fbc4733	Fixes HUDI-9 : Check precondition minInstantsToKeep > cleanerCommitsRetained - Added a precondition check, otherwise incr pull could miss commits - Lowered default cleaner retention to 10, to enable simpler understanding for newbies - Bumped down min/max instants to retain as well	2019-03-27 11:02:17 -07:00
Nishith Agarwal	3d9041e216	Fixing source schema and writer schema distinction in payloads	2019-03-26 19:44:27 -07:00
Nishith Agarwal	9e59da7fd9	Refactor HoodieTable Rollback to write one rollback instant for a batch of commits to rollback	2019-03-19 10:10:16 -07:00
Nishith Agarwal	0dd4a90b03	Enable multi/nested rollbacks for MOR table type	2019-03-19 10:10:16 -07:00
Balaji Varadarajan	3ae6cb4ed5	FileSystem View must treat same fileIds present in different partitions as different file-groups and handle pending compaction correctly	2019-03-01 10:49:04 -08:00
Balaji Varadarajan	8adaca3454	Table rollback for inflight compactions MUST not delete instant files at any time to avoid race conditions	2019-02-11 18:30:21 -08:00
Balaji Varadarajan	3a0044216c	New Features in DeltaStreamer : (1) Apply transformation when using delta-streamer to ingest data. (2) Add Hudi Incremental Source for Delta Streamer (3) Allow delta-streamer config-property to be passed as command-line (4) Add Hive Integration to Delta-Streamer and address Review comments (5) Ensure MultiPartKeysValueExtractor handle hive style partition description (6) Reuse same spark session on both source and transformer (7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource (8) Reuse Binary Avro coders (9) Add push down filter for Incremental source (10) Add Hoodie DeltaStreamer metrics to track total time taken	2019-02-11 18:22:05 -08:00
Nishith Agarwal	7985eb72b5	Fixing behavior of Merge/CreateHandle for invalid/wrong schema records	2019-01-28 16:01:03 -08:00
Nishith Agarwal	994d42d307	cleaner should now use commit timeline and not include deltacomits	2019-01-28 10:46:33 -08:00
Nishith Agarwal	110df7190b	Enabling hard deletes for MergeOnRead table type	2018-12-31 12:49:58 -08:00
xubo245	466ff73ffb	fix some spell errorin Hudi	2018-12-12 13:06:25 -08:00
Nishith Agarwal	d0fde47458	Fixing number of insert buckets to be generated by rounding off to the closest greater integer	2018-11-15 10:04:45 -08:00
Balaji Varadarajan	07324e7a20	Compaction validate, unschedule and repair	2018-10-25 14:12:47 -07:00
jiale.tan	1628d044ac	feat(SparkDataSource): add additional feature to drop later arriving dups	2018-10-16 11:52:50 -07:00
jiale.tan	98fd97b65f	feature(HoodieGlobalBloomIndex): adds a new type of bloom index to allow global record key lookup	2018-09-29 19:55:20 +05:30
vinothchandar	9ca6f91e97	Perform consistency checks during write finalize - Check to ensure written files are listable on storage - Docs reflected to capture how this helps with s3 storage - Unit tests added, corrections to existing tests - Fix DeltaStreamer to manage archived commits in a separate folder	2018-09-28 08:04:41 +05:30
Balaji Varadarajan	5cb28e7b1f	Explicitly release resources in LogFileReader and TestHoodieClientBase	2018-09-20 13:24:57 +05:30
Vinoth Chandar	eca49a255e	Rebasing and fixing conflicts against master	2018-09-11 11:03:30 +05:30
Nishith Agarwal	2b1af18941	Adding check for rolling stats not present to handle backwards compatibility of existing timeline	2018-09-10 11:53:46 +08:00
Vinoth Chandar	d58ddbd999	Reworking the deltastreamer tool - Standardize version of jackson - DFSPropertiesConfiguration replaces usage of commons PropertiesConfiguration - Remove dependency on ConstructorUtils - Throw error if ordering value is not present, during key generation - Switch to shade plugin for hoodie-utilities - Added support for consumption for Confluent avro kafka serdes - Support for Confluent schema registry - KafkaSource now deals with skews nicely, by doing round robin allocation of source limit across partitions - Added support for BULK_INSERT operations as well - Pass in the payload class config properly into HoodieWriteClient - Fix documentation based on new usage - Adding tests on deltastreamer, sources and all new util classes.	2018-09-08 10:24:32 +08:00
Nishith Agarwal	0fe92dee55	Fix a failing test case intermittenly in TestMergeOnRead due to incorrect prev commit time	2018-09-08 09:39:18 +08:00
Nishith Agarwal	459e523d9e	1. Small file size handling for inserts into log files. In summary, the total size of the log file is compared with the parquet max file size and if there is scope to add inserts the add it.	2018-09-06 08:52:08 +08:00
Nishith Agarwal	324de298bc	Removing dependency on apache-commons lang 3, adding necessary classes as needed	2018-09-06 08:26:48 +08:00
Kaushik Devarajaiah	e624480259	Throttling to limit QPS from HbaseIndex	2018-08-21 21:10:38 -07:00
Nishith Agarwal	88274b8261	Adding another metric to HoodieWriteStat to determine if there were inserts converted to updates, added one test for this	2018-08-14 06:22:16 -07:00
Balaji Varadarajan	2f8ce93030	Async Compaction Main API changes	2018-08-07 08:19:50 -07:00
Balaji Varadarajan	9b78523d62	Ensure Cleaner and Archiver do not delete file-slices and workload marked for compaction	2018-08-07 08:19:50 -07:00
Balaji Varadarajan	0a0451a765	Ensure Compaction workload is stored in write-once meta-data files separate from timeline files. This avoids concurrency issues when compactor(s) and ingestor are running in parallel. In the Next PR -> Safety concern regarding Cleaner retaining all meta-data and file-slices for pending compactions will be addressed	2018-08-07 08:19:50 -07:00
Balaji Varadarajan	1b61f04e05	(1) Define CompactionWorkload in avro to allow storing them in instant files. (2) Split APIs in HoodieRealtimeCompactor to separate generating compaction workload from running compaction	2018-08-07 08:19:50 -07:00
Balaji Varadarajan	6d01ae8ca0	FileSystemView and Timeline level changes to support Async Compaction	2018-08-07 08:19:50 -07:00
Nishith Agarwal	44caf0d40c	Fixing missing hoodie record location in HoodieRecord when record is read from disk after being spilled	2018-07-18 12:53:35 -07:00

1 2 3

125 Commits