Balaji Varadarajan
a7e6cf5197
Support nested types for recordKey, partitionPath and combineKey
2019-05-18 07:14:58 -07:00
Vinoth Chandar
e43efa042f
Downgrading fasterxml jackson to 2.6.7 to be spark compatible
2019-05-16 13:53:54 -07:00
Balaji Varadarajan
64fec64097
Timeline Service with Incremental View Syncing support
2019-05-16 13:25:33 -07:00
vinothchandar
446f99aa0f
[maven-release-plugin] prepare for next development iteration
2019-05-14 07:29:22 -07:00
vinothchandar
cc38abecc8
[maven-release-plugin] prepare release hoodie-0.4.6
2019-05-14 07:29:11 -07:00
Vinoth Chandar
7002ca6775
Update release notes for 0.4.6 release
2019-05-14 05:16:58 -07:00
Balaji Varadarajan
6e1e626357
Minor CLI documentation change in delta-streamer
2019-05-14 04:05:47 -07:00
Nishith Agarwal
af46078a82
converting map task memory from mb to bytes
2019-05-13 21:23:30 -07:00
Balaji Varadarajan
9cce9abf4d
Fix various errors found by long running delta-streamer tests
...
1. Parquet Avro schema mismatch errors when ingesting are sometimes silently ignored due to race-condition in BoundedInMemoryExecutor. This was reproducible when running long-running delta-streamer with wrong schema and it caused data-loss
2. Fix behavior of Delta-Streamer to error out by default if there are any error records
3. Fix a bug in tracking write errors in WriteStats. Earlier the write errors were tracking sampled errors as opposed to total errors.
4. Delta Streamer does not commit the changes done as part of inline compaction as auto-commit is force disabled. Fix this behavior to always auto-commit inline compaction as it would not otherwise commit.
2019-05-13 10:47:34 -07:00
Vinoth Chandar
a0e62b7919
Bucketized Bloom Filter checking
...
- Tackles the skew seen in sort based partitioning/checking
- Parameterized the HoodieBloomIndex test
- Config to turn on/off (on by default)
- Unit tests & also tested at scale
2019-05-11 16:38:28 -07:00
David Muto (pseudomuto)
4b27cc72bb
Don't raise when spark-defaults.conf doesn't exist
2019-05-08 17:30:23 -07:00
Abhishek Sharma
e2dcef8606
HUDI-101: added exclusion filters for signature files.
2019-05-07 18:35:18 -07:00
Omkar Joshi
738635306b
migrating kryo's dependency from twitter chill to plain kryo library
2019-05-06 20:32:00 -07:00
Nishith Agarwal
a33a55fcb5
Caching Avro Binary encoder/decoder to avoid creating new one for every record
2019-05-06 11:28:08 -07:00
Balaji Varadarajan
ee1feb7c75
Revert "HUDI-101: added mevn-shade plugin with filters."
...
Creates fat jars for all hoodie packages
This reverts commit f47f0eb6cb .
2019-05-05 18:39:38 -07:00
Abhishek Sharma
f47f0eb6cb
HUDI-101: added mevn-shade plugin with filters.
2019-05-03 13:49:51 -07:00
Balaji Varadarajan
978470af33
Rollback inflights when using Spark [Streaming] write
2019-05-02 12:51:02 -07:00
vinothchandar
57a8b9cc8c
Making DataSource/DeltaStreamer use defaults for combining
...
- Addresses issue where insert will combine and remove duplicates within batch
- Setting default insert combining to false (write client default)
- Set to true if filtering duplicates on insert/bulk_insert
2019-05-01 13:21:21 -07:00
Vinoth Chandar
ea20d47248
Introduce config to control interval tree pruning
...
- turned on by default
- Minor code refactoring/restructuring
2019-04-29 11:38:23 -07:00
Sivabalan Narayanan
7129dc5bb7
Improving Tag location using interval trees for index files
...
Adding interface for index look up
Adding index filtering implementations for global bloom index too
2019-04-29 11:38:23 -07:00
Naoki Takezoe
461ce18bd1
Fix to enable hoodie.datasource.read.incr.filters
2019-04-26 11:14:06 -07:00
Nishith Agarwal
26f24b6728
Removing OLD MAGIC header since a) it's no longer used b) causes issues when the data actually has OLD MAGIC
2019-04-25 20:47:16 -07:00
Balaji Varadarajan
2f1e3e15fb
Revert "Read and apply schema for each log block from the metadata header instead of the latest schema"
...
This reverts commit 9e7ce19b06 .
2019-04-18 08:54:34 -07:00
lyogev
9ef51deb84
Add empty payload class to support deletes via apache spark
2019-04-17 23:00:20 -07:00
Balaji Varadarajan
243c58f77c
Move to apachehudi dockerhub repository & use openjdk docker containers
2019-04-17 16:37:58 -07:00
Balaji Varadarajan
36ef94004e
Fix Hive RT query failure in hoodie demo
2019-04-17 16:36:32 -07:00
Omkar Joshi
e35d24f31d
Revert "Replacing Apache commons-lang3 object serializer with Kryo serializer"
...
This reverts commit a6c45feb2c .
2019-04-17 09:23:37 -07:00
Nishith Agarwal
9e7ce19b06
Read and apply schema for each log block from the metadata header instead of the latest schema
2019-04-16 17:20:03 -07:00
Bhavani Sudha Saktheeswaran
83b6aa5e91
Fix multiple issues when using build_local_docker_images for setting up the demo
...
Details here - https://issues.apache.org/jira/browse/HUDI-98
2019-04-15 10:10:05 -07:00
Nishith Agarwal
a8feee9293
Performing commit archiving in batches to avoid keeping a huge chunk in memory
2019-04-10 15:17:04 -07:00
Balaji Varadarajan
b07110b9fd
Essential Hive packages missing in hoodie spark bundle
2019-04-09 21:42:42 -07:00
Nishith Agarwal
2577014617
1. Minor changes to fix compaction 2. Adding 2 compaction policies
2019-04-03 17:38:17 -07:00
Jing Chen
d1d33f725e
[HUDI-66] FSUtils.getRelativePartitionPath does not handle repeated folder names
2019-04-03 17:37:03 -07:00
Vinoth Chandar
b34a204a52
Fixing small file handling, inline compaction defaults
...
- Small file limit is now 100MB by default
- Turned on inline compaction by default for MOR
- Changes take effect on DataSource and DeltaStreamer
2019-04-03 10:56:10 -07:00
Vinoth Chandar
51f4908989
Follow up HUDI-27 : Call super.close() in HoodieWraperFileSystem::close()
2019-04-02 21:31:41 -07:00
Vinoth Chandar
5847f0c934
Fix HUDI-27 : Support num_cores > 1 for writing through spark
...
- Users using spark.executor.cores > 1 used to fail due to "FileSystem closed"
- This is due to HoodieWrapperFileSystem closing the wrapped filesytem obj
- FileSystem.getInternal caching code races threads and closes the extra fs instance(s)
- Bumped up num cores in tests to 8, speeds up tests by 3-4 mins
2019-03-28 15:56:21 -07:00
Vinoth Chandar
f1410bfdcd
Fixes HUDI-38: Reduce memory overhead of WriteStatus
...
- For implicit indexes (e.g BloomIndex), don't buffer up written records
- By default, only collect 10% of failing records to avoid OOMs
- Improves debuggability via above, since data errors can now show up in collect()
- Unit tests & fixing subclasses & adjusting tests
2019-03-28 10:32:59 -07:00
Vinoth Chandar
e56c1612e4
Fixed HUDI-87 : Remove schemastr from BaseAvroPayload
2019-03-27 23:03:25 -07:00
Vinoth Chandar
372fbc4733
Fixes HUDI-9 : Check precondition minInstantsToKeep > cleanerCommitsRetained
...
- Added a precondition check, otherwise incr pull could miss commits
- Lowered default cleaner retention to 10, to enable simpler understanding for newbies
- Bumped down min/max instants to retain as well
2019-03-27 11:02:17 -07:00
Nishith Agarwal
3d9041e216
Fixing source schema and writer schema distinction in payloads
2019-03-26 19:44:27 -07:00
ambition119
395806fc68
[HUDI-63] Removed unused BucketedIndex code
2019-03-26 10:12:47 -07:00
Balaji Varadarajan
194d904c99
run_hive_sync tool must be able to handle case where there are multiple standalone jdbc jars in hive installation dir
2019-03-21 09:58:20 -07:00
Jing Chen
a2a052abd9
add a script that shuts down demo cluster gracefully
2019-03-19 11:01:06 -07:00
Nishith Agarwal
9e59da7fd9
Refactor HoodieTable Rollback to write one rollback instant for a batch of commits to rollback
2019-03-19 10:10:16 -07:00
Nishith Agarwal
0dd4a90b03
Enable multi/nested rollbacks for MOR table type
2019-03-19 10:10:16 -07:00
Omkar Joshi
a6c45feb2c
Replacing Apache commons-lang3 object serializer with Kryo serializer
2019-03-18 14:12:25 -07:00
kaka11chen
48797b1ae1
Add compression codec configurations for HoodieParquetWriter.
2019-03-18 07:48:20 -07:00
smarthi
621f2b878d
HUDI-75: Add KEYS
2019-03-18 07:46:25 -07:00
Vinoth Chandar
57bbed21de
Removing docs folder from master branch
...
- Only asf-site branch contains the docs
- Helps streamline doc contributions
2019-03-14 18:19:30 -07:00
Balaji Varadarajan
adc8cac743
Fix hive sync (libfb version mismatch) and deltastreamer issue (missing cmdline argument) in demo
2019-03-13 16:14:32 -07:00