1
0
Commit Graph

418 Commits

Author SHA1 Message Date
Vinoth Chandar
7002ca6775 Update release notes for 0.4.6 release 2019-05-14 05:16:58 -07:00
Balaji Varadarajan
6e1e626357 Minor CLI documentation change in delta-streamer 2019-05-14 04:05:47 -07:00
Nishith Agarwal
af46078a82 converting map task memory from mb to bytes 2019-05-13 21:23:30 -07:00
Balaji Varadarajan
9cce9abf4d Fix various errors found by long running delta-streamer tests
1. Parquet Avro schema mismatch errors when ingesting are sometimes silently ignored due to race-condition in BoundedInMemoryExecutor. This was reproducible when running long-running delta-streamer with wrong schema and it caused data-loss
  2. Fix behavior of Delta-Streamer to error out by default if there are any error records
  3. Fix a bug in tracking write errors in WriteStats. Earlier the write errors were tracking sampled errors as opposed to total errors.
  4. Delta Streamer does not commit the changes done as part of inline compaction as auto-commit is force disabled. Fix this behavior to always auto-commit inline compaction as it would not otherwise commit.
2019-05-13 10:47:34 -07:00
Vinoth Chandar
a0e62b7919 Bucketized Bloom Filter checking
- Tackles the skew seen in sort based partitioning/checking
 - Parameterized the HoodieBloomIndex test
 - Config to turn on/off (on by default)
 - Unit tests & also tested at scale
2019-05-11 16:38:28 -07:00
David Muto (pseudomuto)
4b27cc72bb Don't raise when spark-defaults.conf doesn't exist 2019-05-08 17:30:23 -07:00
Abhishek Sharma
e2dcef8606 HUDI-101: added exclusion filters for signature files. 2019-05-07 18:35:18 -07:00
Omkar Joshi
738635306b migrating kryo's dependency from twitter chill to plain kryo library 2019-05-06 20:32:00 -07:00
Nishith Agarwal
a33a55fcb5 Caching Avro Binary encoder/decoder to avoid creating new one for every record 2019-05-06 11:28:08 -07:00
Balaji Varadarajan
ee1feb7c75 Revert "HUDI-101: added mevn-shade plugin with filters."
Creates fat jars for all hoodie packages

This reverts commit f47f0eb6cb.
2019-05-05 18:39:38 -07:00
Abhishek Sharma
f47f0eb6cb HUDI-101: added mevn-shade plugin with filters. 2019-05-03 13:49:51 -07:00
Balaji Varadarajan
978470af33 Rollback inflights when using Spark [Streaming] write 2019-05-02 12:51:02 -07:00
vinothchandar
57a8b9cc8c Making DataSource/DeltaStreamer use defaults for combining
- Addresses issue where insert will combine and remove duplicates within batch
 - Setting default insert combining to false (write client default)
 - Set to true if filtering duplicates on insert/bulk_insert
2019-05-01 13:21:21 -07:00
Vinoth Chandar
ea20d47248 Introduce config to control interval tree pruning
- turned on by default
 - Minor code refactoring/restructuring
2019-04-29 11:38:23 -07:00
Sivabalan Narayanan
7129dc5bb7 Improving Tag location using interval trees for index files
Adding interface for index look up

Adding index filtering implementations for global bloom index too
2019-04-29 11:38:23 -07:00
Naoki Takezoe
461ce18bd1 Fix to enable hoodie.datasource.read.incr.filters 2019-04-26 11:14:06 -07:00
Nishith Agarwal
26f24b6728 Removing OLD MAGIC header since a) it's no longer used b) causes issues when the data actually has OLD MAGIC 2019-04-25 20:47:16 -07:00
Balaji Varadarajan
2f1e3e15fb Revert "Read and apply schema for each log block from the metadata header instead of the latest schema"
This reverts commit 9e7ce19b06.
2019-04-18 08:54:34 -07:00
lyogev
9ef51deb84 Add empty payload class to support deletes via apache spark 2019-04-17 23:00:20 -07:00
Balaji Varadarajan
243c58f77c Move to apachehudi dockerhub repository & use openjdk docker containers 2019-04-17 16:37:58 -07:00
Balaji Varadarajan
36ef94004e Fix Hive RT query failure in hoodie demo 2019-04-17 16:36:32 -07:00
Omkar Joshi
e35d24f31d Revert "Replacing Apache commons-lang3 object serializer with Kryo serializer"
This reverts commit a6c45feb2c.
2019-04-17 09:23:37 -07:00
Nishith Agarwal
9e7ce19b06 Read and apply schema for each log block from the metadata header instead of the latest schema 2019-04-16 17:20:03 -07:00
Bhavani Sudha Saktheeswaran
83b6aa5e91 Fix multiple issues when using build_local_docker_images for setting up the demo
Details here - https://issues.apache.org/jira/browse/HUDI-98
2019-04-15 10:10:05 -07:00
Nishith Agarwal
a8feee9293 Performing commit archiving in batches to avoid keeping a huge chunk in memory 2019-04-10 15:17:04 -07:00
Balaji Varadarajan
b07110b9fd Essential Hive packages missing in hoodie spark bundle 2019-04-09 21:42:42 -07:00
Nishith Agarwal
2577014617 1. Minor changes to fix compaction 2. Adding 2 compaction policies 2019-04-03 17:38:17 -07:00
Jing Chen
d1d33f725e [HUDI-66] FSUtils.getRelativePartitionPath does not handle repeated folder names 2019-04-03 17:37:03 -07:00
Vinoth Chandar
b34a204a52 Fixing small file handling, inline compaction defaults
- Small file limit is now 100MB by default
 - Turned on inline compaction by default for MOR
 - Changes take effect on DataSource and DeltaStreamer
2019-04-03 10:56:10 -07:00
Vinoth Chandar
51f4908989 Follow up HUDI-27 : Call super.close() in HoodieWraperFileSystem::close() 2019-04-02 21:31:41 -07:00
Vinoth Chandar
5847f0c934 Fix HUDI-27 : Support num_cores > 1 for writing through spark
- Users using spark.executor.cores > 1 used to fail due to "FileSystem closed"
 - This is due to HoodieWrapperFileSystem closing the wrapped filesytem obj
 - FileSystem.getInternal caching code races threads and closes the extra fs instance(s)
 - Bumped up num cores in tests to 8, speeds up tests by 3-4 mins
2019-03-28 15:56:21 -07:00
Vinoth Chandar
f1410bfdcd Fixes HUDI-38: Reduce memory overhead of WriteStatus
- For implicit indexes (e.g BloomIndex), don't buffer up written records
 - By default, only collect 10% of failing records to avoid OOMs
 - Improves debuggability via above, since data errors can now show up in collect()
 - Unit tests & fixing subclasses & adjusting tests
2019-03-28 10:32:59 -07:00
Vinoth Chandar
e56c1612e4 Fixed HUDI-87 : Remove schemastr from BaseAvroPayload 2019-03-27 23:03:25 -07:00
Vinoth Chandar
372fbc4733 Fixes HUDI-9 : Check precondition minInstantsToKeep > cleanerCommitsRetained
- Added a precondition check, otherwise incr pull could miss commits
 - Lowered default cleaner retention to 10, to enable simpler understanding for newbies
 - Bumped down min/max instants to retain as well
2019-03-27 11:02:17 -07:00
Nishith Agarwal
3d9041e216 Fixing source schema and writer schema distinction in payloads 2019-03-26 19:44:27 -07:00
ambition119
395806fc68 [HUDI-63] Removed unused BucketedIndex code 2019-03-26 10:12:47 -07:00
Balaji Varadarajan
194d904c99 run_hive_sync tool must be able to handle case where there are multiple standalone jdbc jars in hive installation dir 2019-03-21 09:58:20 -07:00
Jing Chen
a2a052abd9 add a script that shuts down demo cluster gracefully 2019-03-19 11:01:06 -07:00
Nishith Agarwal
9e59da7fd9 Refactor HoodieTable Rollback to write one rollback instant for a batch of commits to rollback 2019-03-19 10:10:16 -07:00
Nishith Agarwal
0dd4a90b03 Enable multi/nested rollbacks for MOR table type 2019-03-19 10:10:16 -07:00
Omkar Joshi
a6c45feb2c Replacing Apache commons-lang3 object serializer with Kryo serializer 2019-03-18 14:12:25 -07:00
kaka11chen
48797b1ae1 Add compression codec configurations for HoodieParquetWriter. 2019-03-18 07:48:20 -07:00
smarthi
621f2b878d HUDI-75: Add KEYS 2019-03-18 07:46:25 -07:00
Vinoth Chandar
57bbed21de Removing docs folder from master branch
- Only asf-site branch contains the docs
 - Helps streamline doc contributions
2019-03-14 18:19:30 -07:00
Balaji Varadarajan
adc8cac743 Fix hive sync (libfb version mismatch) and deltastreamer issue (missing cmdline argument) in demo 2019-03-13 16:14:32 -07:00
Bhavani Sudha Saktheeswaran
3c647a99cf Fix quickstart documentation for querying via Presto 2019-03-13 15:34:50 -07:00
Omkar Joshi
4a8bec7ea5 Handling duplicate record update for single partition (duplicates in single or different parquet files) 2019-03-10 20:15:17 -07:00
kaka11chen
b514e1ab18 Fix avro doesn't have short and byte byte. 2019-03-06 16:09:24 -08:00
Balaji Varadarajan
3ae6cb4ed5 FileSystem View must treat same fileIds present in different partitions as different file-groups and handle pending compaction correctly 2019-03-01 10:49:04 -08:00
Vinoth Chandar
363df2c12e Upgrade various jar, gem versions for maintenance 2019-03-01 10:14:00 -08:00