1
0
Commit Graph

268 Commits

Author SHA1 Message Date
Ron Barabash
1b61eb45e0 Adding support for optional skipping single archiving failures 2019-06-20 22:54:45 -07:00
Balaji Varadarajan
66c7fa2322 Reword confusing message and reducing the severity level 2019-06-20 22:46:09 -07:00
Balaji Varadarajan
2c40e8419e Ensure TableMetaClient and FileSystem instances have exclusive copy of Configuration 2019-06-20 14:05:00 -07:00
Balaji Varadarajan
a0d7ab2384 HUDI-70 : Making DeltaStreamer run in continuous mode with concurrent compaction 2019-06-18 17:48:14 -07:00
Balaji Varadarajan
a1483f2c5f HUDI-148 Small File selection logic for MOR must skip fileIds selected for pending compaction correctly 2019-06-17 18:35:17 -07:00
Nishith Agarwal
8e08d498c9 Reading baseCommitTime from the latest file slice as opposed to the tagged record value 2019-06-17 16:46:16 -07:00
Balaji Varadarajan
cd7623e216 All Opened hoodie clients in tests needs to be closed
TestMergeOnReadTable must use embedded timeline server
2019-06-13 12:30:07 -07:00
Balaji Varadarajan
136f8478a3 TestMergeOnReadTable must use embedded timeline server 2019-06-12 19:08:09 -07:00
Balaji Varadarajan
04fc86b43d Turn on embedded server for all client tests 2019-06-12 18:14:55 -07:00
Vinoth Chandar
b791473a6d Introduce HoodieReadHandle abstraction into index
- Generalized BloomIndex to work with file ids instead of paths
 - Abstracted away Bloom filter checking into HoodieLookupHandle
 - Abstracted away range information retrieval into HoodieRangeInfoHandle
2019-06-12 10:46:14 -07:00
Balaji Varadarajan
065173211e HUDI-147 Compaction Inflight Rollback not deleting Marker directory 2019-06-09 11:45:54 -07:00
Balaji Varadarajan
479908fd20 HUDI-125 : Change License for all source files and update RAT configurations 2019-06-09 11:41:55 -07:00
Balaji Varadarajan
30b0f2636f Changes related to Licensing work
1. Go through dependencies list one round to ensure compliance. Generated current NOTICE list in all submodules (other apache projects like flink does this).
   To be on conservative side regarding licensing, NOTICE.txt lists all dependencies including transitive. Pending Compliance questions reported in https://issues.apache.org/jira/browse/LEGAL-461
2. Automate generating NOTICE.txt files to allow future package compliance issues be identified early as part of code-review process.
3. Added NOTICE.txt and LICENSE.txt to all HUDI jars
2019-06-07 17:58:57 -07:00
guanjianhui
6b5abb5d92 fix maven pom 2019-05-29 16:16:29 -07:00
Balaji Varadarajan
d860fb18b6 HUDI-139 Compaction running twice due to duplicate "map" transformation while finalizing compaction 2019-05-29 15:12:30 -07:00
vinothchandar
66c0b81b49 [maven-release-plugin] prepare for next development iteration 2019-05-28 19:17:26 -07:00
vinothchandar
227785c022 [maven-release-plugin] prepare release hoodie-0.4.7 2019-05-28 19:17:15 -07:00
Balaji Varadarajan
33f5208c1e Only inflight commit timeline (.commit/.deltacommit) must be used when checking for sanity during compaction scheduling 2019-05-28 16:54:20 -07:00
Balaji Varadarajan
d0d2fa0337 Reduce logging in unit-test runs 2019-05-24 23:43:54 -07:00
Venkat
f2d91a455e default implementation for HBase index qps allocator (#685)
* default implementation and configs for HBase index qps allocator

* Test for QPS allocator and address CR

* fix QPS allocator test
2019-05-24 18:43:46 -07:00
Balaji Varadarajan
99b0c72aa6 HUDI-131 Zero FIle Listing in Compactor run 2019-05-24 18:34:14 -07:00
Vinoth Chandar
4074c5eb23 Fixed HUDI-116 : Handle duplicate record keys across partitions
- Join based on HoodieKey and not RecordKey during tagging
 - Unit tests changed to run with duplicate keys
 - Special casing GlobalBloom to still join by recordkey
2019-05-24 18:32:49 -07:00
Balaji Varadarajan
145034c5fa Spark Stage retry handling 2019-05-21 14:49:51 -07:00
David Muto (pseudomuto)
3fd2fd6e9d Remove redundant string from file comp rdd 2019-05-21 13:07:32 -07:00
Balaji Varadarajan
64fec64097 Timeline Service with Incremental View Syncing support 2019-05-16 13:25:33 -07:00
vinothchandar
446f99aa0f [maven-release-plugin] prepare for next development iteration 2019-05-14 07:29:22 -07:00
vinothchandar
cc38abecc8 [maven-release-plugin] prepare release hoodie-0.4.6 2019-05-14 07:29:11 -07:00
Balaji Varadarajan
9cce9abf4d Fix various errors found by long running delta-streamer tests
1. Parquet Avro schema mismatch errors when ingesting are sometimes silently ignored due to race-condition in BoundedInMemoryExecutor. This was reproducible when running long-running delta-streamer with wrong schema and it caused data-loss
  2. Fix behavior of Delta-Streamer to error out by default if there are any error records
  3. Fix a bug in tracking write errors in WriteStats. Earlier the write errors were tracking sampled errors as opposed to total errors.
  4. Delta Streamer does not commit the changes done as part of inline compaction as auto-commit is force disabled. Fix this behavior to always auto-commit inline compaction as it would not otherwise commit.
2019-05-13 10:47:34 -07:00
Vinoth Chandar
a0e62b7919 Bucketized Bloom Filter checking
- Tackles the skew seen in sort based partitioning/checking
 - Parameterized the HoodieBloomIndex test
 - Config to turn on/off (on by default)
 - Unit tests & also tested at scale
2019-05-11 16:38:28 -07:00
Vinoth Chandar
ea20d47248 Introduce config to control interval tree pruning
- turned on by default
 - Minor code refactoring/restructuring
2019-04-29 11:38:23 -07:00
Sivabalan Narayanan
7129dc5bb7 Improving Tag location using interval trees for index files
Adding interface for index look up

Adding index filtering implementations for global bloom index too
2019-04-29 11:38:23 -07:00
Nishith Agarwal
a8feee9293 Performing commit archiving in batches to avoid keeping a huge chunk in memory 2019-04-10 15:17:04 -07:00
Nishith Agarwal
2577014617 1. Minor changes to fix compaction 2. Adding 2 compaction policies 2019-04-03 17:38:17 -07:00
Vinoth Chandar
b34a204a52 Fixing small file handling, inline compaction defaults
- Small file limit is now 100MB by default
 - Turned on inline compaction by default for MOR
 - Changes take effect on DataSource and DeltaStreamer
2019-04-03 10:56:10 -07:00
Vinoth Chandar
51f4908989 Follow up HUDI-27 : Call super.close() in HoodieWraperFileSystem::close() 2019-04-02 21:31:41 -07:00
Vinoth Chandar
5847f0c934 Fix HUDI-27 : Support num_cores > 1 for writing through spark
- Users using spark.executor.cores > 1 used to fail due to "FileSystem closed"
 - This is due to HoodieWrapperFileSystem closing the wrapped filesytem obj
 - FileSystem.getInternal caching code races threads and closes the extra fs instance(s)
 - Bumped up num cores in tests to 8, speeds up tests by 3-4 mins
2019-03-28 15:56:21 -07:00
Vinoth Chandar
f1410bfdcd Fixes HUDI-38: Reduce memory overhead of WriteStatus
- For implicit indexes (e.g BloomIndex), don't buffer up written records
 - By default, only collect 10% of failing records to avoid OOMs
 - Improves debuggability via above, since data errors can now show up in collect()
 - Unit tests & fixing subclasses & adjusting tests
2019-03-28 10:32:59 -07:00
Vinoth Chandar
e56c1612e4 Fixed HUDI-87 : Remove schemastr from BaseAvroPayload 2019-03-27 23:03:25 -07:00
Vinoth Chandar
372fbc4733 Fixes HUDI-9 : Check precondition minInstantsToKeep > cleanerCommitsRetained
- Added a precondition check, otherwise incr pull could miss commits
 - Lowered default cleaner retention to 10, to enable simpler understanding for newbies
 - Bumped down min/max instants to retain as well
2019-03-27 11:02:17 -07:00
Nishith Agarwal
3d9041e216 Fixing source schema and writer schema distinction in payloads 2019-03-26 19:44:27 -07:00
ambition119
395806fc68 [HUDI-63] Removed unused BucketedIndex code 2019-03-26 10:12:47 -07:00
Nishith Agarwal
9e59da7fd9 Refactor HoodieTable Rollback to write one rollback instant for a batch of commits to rollback 2019-03-19 10:10:16 -07:00
Nishith Agarwal
0dd4a90b03 Enable multi/nested rollbacks for MOR table type 2019-03-19 10:10:16 -07:00
kaka11chen
48797b1ae1 Add compression codec configurations for HoodieParquetWriter. 2019-03-18 07:48:20 -07:00
Omkar Joshi
4a8bec7ea5 Handling duplicate record update for single partition (duplicates in single or different parquet files) 2019-03-10 20:15:17 -07:00
Balaji Varadarajan
3ae6cb4ed5 FileSystem View must treat same fileIds present in different partitions as different file-groups and handle pending compaction correctly 2019-03-01 10:49:04 -08:00
vinothchandar
687395e40f [maven-release-plugin] prepare for next development iteration 2019-02-27 07:16:27 -08:00
vinothchandar
bbf40ef987 [maven-release-plugin] prepare release hoodie-0.4.5 2019-02-27 07:16:15 -08:00
Balaji Varadarajan
8adaca3454 Table rollback for inflight compactions MUST not delete instant files at any time to avoid race conditions 2019-02-11 18:30:21 -08:00
Balaji Varadarajan
3a0044216c New Features in DeltaStreamer :
(1) Apply transformation when using delta-streamer to ingest data.
 (2) Add Hudi Incremental Source for Delta Streamer
 (3) Allow delta-streamer config-property to be passed as command-line
 (4) Add Hive Integration to Delta-Streamer and address Review comments
 (5) Ensure MultiPartKeysValueExtractor  handle hive style partition description
 (6) Reuse same spark session on both source and transformer
 (7) Support extracting partition fields from _hoodie_partition_path for HoodieIncrSource
 (8) Reuse Binary Avro coders
 (9) Add push down filter for Incremental source
 (10) Add Hoodie DeltaStreamer metrics to track total time taken
2019-02-11 18:22:05 -08:00