Importing Hoodie Client from internal repo

Abberved History: * 25c6991 Removed non-opensource modules * a62abf3 Removing email from pom.xml * 0931b68 Misspelt in the copyright * c1cac7d Preperation for OSS: Added License and rat plugin check. Also added meta information about the project in pom.xml * 16b07b3 Preparation of OSS - Remove hoodie specific URL from hoodie cli * fd3e0dd Small code cleanups * 8aa7e34 Adding a de-duplication command to CLI * b464842 Adding a de-duplication command to CLI * 59265b1 RegisterDataset should pass the right zkNodeName after the support for multiple databases added * b295f70 [maven-release-plugin] prepare for next development iteration * 1006e4b [maven-release-plugin] prepare release hoodie-0.2.4 * 4c99437 Move to using hdrone release 0.7.4 * 1891939 Auto tuning the buckets needed for plain inserts also - Off by default for now - Enhanced an existing unit test * b4563bd Change HoodieReadClient to use commit metadata for incremental pull * ee20183 Add full file path onto HoodieWriteStat - This will become an issue later on for incremental processing use cases - Tested with cli, that is able to read older HoodieCommitMetadata * 7dcd5d5 Address skew in cleaner work distribution * 8d7c15d Fixing bug around partial failures of rollback * d4ada1d Empty RDD should not throw java.lang.IllegalArgumentException: Positive number of slices required * 076bea9 Dont clean if there are no partitions to clean * c014f80 Minor changes to SQLStreamer * a96d4df Minor changes to SQLStreamer * bc289cc [maven-release-plugin] prepare for next development iteration * 4160107 [maven-release-plugin] prepare release hoodie-0.2.3 * 409b07a [maven-release-plugin] prepare for next development iteration * 3d71514 [maven-release-plugin] prepare release hoodie-0.2.2 * 4969d52 Fix test failures * ac62609 Implement Review Comments for: Parallelize cleaning and including cleaning time and commit archival time in commit time graphite reporting * cebe65a Parallelize cleaning and including cleaning time and commit archival time in commit time graphite reporting * 2e5b372 Migrating to CDH 5.7.2 * 899ae12 Remove filtering of /tmp/hive/hive paths from HoodieInputFormat. This fixes Join with temporary tables with HoodieCombineHiveInputFormat * 69a68f6 Implement equals and hashCode for HoodieTableMetadata, its used in hash based structures * 12d29c6 Update hive staging url * 1c5c88a Copy filterExists to WriteClient * 76aee67 [maven-release-plugin] prepare for next development iteration * 1f0a715 [maven-release-plugin] prepare release hoodie-0.2.1 * dbfd1d4 HoodieReadClient and HoodieWriteClient separation * c39a98b Revamped HoodieRecordPayload API that supports merging of old & new values during update * 79e5bbd Add a helper to configure SparkConf for SparkSQL on Hoodie tables * f56f423 [maven-release-plugin] prepare for next development iteration * 780fc44 [maven-release-plugin] prepare release hoodie-0.2 * 1ea2238 Modifying the git utl * b0af8dc Depending on hdrone release version * 7753693 Removing a System.out.println which got in by mistake * 1f5b019 Adding HBase Config to HoodieClientConfig * 2fce97f Implement Review comments and merge into master * f389820 Bunch of API changes * 909a856 HoodieClientConfig split up and revamp * c2ad946 Fix TestHoodieClient to not double persist in testFilterExists * 3ab0da6 Fix breaking test * 2860542 CR feedback for small inserts turned to updates * 0dfce57 Small inserts are now turned into upserts * bb1a8b3 Add filterExist API for Hoodie Records * d983c24 Implement review comments * c0bd5d1 Implement HoodieClient.checkExists() * db078f6 Pick up HoodieTable based on hoodie.properties * ad023e9 Refactor upsert() using HoodieTable interface * ee9b9b3 Refactor upsert() using HoodieTable interface * 2d6fdc9 Adding a utility to generate the percentage of updates in commit * ea3ad58 Adding additional optimizations to remove similar queries from the perf test (using levenshtein distance) * 1e443a0 Add test case for the added support for SchemaEvolution during updates * 1cadcbb Add more logging * 6163dfe Parquet read of old file should have the right read schema specified * 29c746a Few fixes in ReduceByKey parallelism, HoodieInputFormat.filterFiles for non-hoodie paths and more logging in upsert schema issues * 5a33af6 Fixing an issue in HoodieReader, target temp directory not created * 09a5e8e Adding more logging in HoodieReader * 1474250 Adding more logging in HoodieReader * a3b0567 Make targetDb not required in HoodieReader * e9c08b9 Setting the inputformat as the CombineHiveInputFormat in the HoodieReader * 61c75d2 Hoodie Query Performance: Add Support for CombineHiveInputFormat and implement CombineFileInputFormat * 38c6e44 Improvements to Hoodie Reader * ac7398a Add totalWriteErrors to HoodieCommitMetadata * fc0536e Change archive location to be under .hoodie * e313294 Implement Hive Perf comparison for Hoodie and non-Hoodie datasets * 17cfe2a Fix bug in HoodieInputFormat, where it filters out files from archived commits * 30de990 Add note about showpartitions command to README * 8634ffb Add commits showpartitions command to show break down per partition * 324b24e Adding a CLI command to print file size stats * 56532ff T484792. Deterministically report metrics during shutdown * 3571768 Fixes to Hoodie Cleaner. Upgrade HDrone version. Changes to HoodieReader. * a02c97f Bumping hdrone-api to 0.7.2 * b29ce67 Bug in RegisterDataset dataset creation * 5a15a9a Fixing bug in cleaning up partial files * dbf6669 Comment out predicate pushdown test * 44ed4d1 Merge branch 'lazyitr-fixes-1' |\ | * e913d3b Fixing bug in LazyInsertIterable | * 8a1fecd Wrapping upsert() inside HoodieUpsertException | * 39cfe39 Fixing bug in LazyInsertIterable - Return a List<WriteStatus> to handle last record in itr, belonging to a separate file - Remove insert() related code form UpsertMapFunction | * 00252e5 Making TestHoodieBloomIndex less flaky * | 6f2d417 Making TestHoodieBloomIndex less flaky * | 63ebbdc fs.mkdirs does not honor permission umask passed. Need to use the static method FileSystem.mkdirs for that. * | f49ef67 Adding more logging to Hoodie Reader * | 9f5a699 Fixing permission on the base intermediate folder created in HoodieReader |/ * 70e501f Fixing the drop table before create table in HoodieReader * 120cda8 Hoodie tools jar should not require jars in the CDH classpath to be available. Needed for HoodieReader to run in Docker. * 60b59de Adding client configurations. Needed to run the HoodieReader in Docker (where CDH is not installed) * fece98d Merge conflicts w/ master * 64e58b0 Auto tuning parallelism in BloomIndex & Upsert() * 930199e Fixing skew in Index join when new partition paths dont exist yet * 9a3e511 Adding subpartitioning to scale join in HoodieBloomIndex * 57512a7 Changing sort key for IndexLookup to (filename, record) to split more evenly * 3ede14c Major changes to BloomIndex & Upsert DAG * 1c4071a Implement Dataset creation if a Hoodie dataset was not already registered * 944f007 Implement Review comments * 6a5b675 Implement Review Comments * bfde3a9 Implement review comments * d195ab3 Implementing Commit Archiving * 8af656b Exception refactor - part 2 * 697a699 HoodieTableMetadata refactor and Exception refactor * 7804ca3 Adding HoodieAppendLog (fork of SequenceFile) & Initial Impl of HoodieCommitArchiveLog * 2db4931 Adjust partitionFileRDD parallelism to max(recordRDD partitions, total partitions) * 23405c5 Config name changes * 5e673ea Implementing more CLI commands * 918cfce Moving to 0.1.1-SNAPSHOT * afad497 Change the master branch to 0.2-SNAPSHOT * 832c1a7 Make sure the bloom filter reading and tagging has a parellel factor >= group by parallelism * 0a6a6d3 Prepare the v0.1 version * 72cfbe2 The snapshoter should also copy hoodie.properties file * 3b0ee45 Add one more metric * 488f1c7 Add switch for cleaning out inflight commits * a259b6f Adding textutils jar to hoodie build * 36e3118 Fix Hoodie CLI - ClassNotFound and added more logging to JDBC Incremental pull * 2c8f554 Fix Predicate pushdown during incremental pull * 888ec20 Add one more graphite metrics * a671dfc Ensure files picked for cleaning are part of some valid commit * ba5cd65 Adding cleaning based on last X commits * 7dc76d3 Organize config values by category * 9da6474 Move cleaning logic into HoodieCleaner class * 7becba9 Change the update metric name * d32b1f3 Fix some graphite issues * 365ee14 hot fix a stupid bug I made * 93eab43 Adding a hoodie.table.type value to hoodie.properties on init * 075c646 Add the database name to the sync * 3bae059 Adding HoodieKey as metadata field into Record * 61513fa Add stats and more cli commands * b0cb112 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * aaa1bf8 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * 3a3db73 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * c413342 Fail the job if exception during writing old records * 7304d3d Exclude javax.servlet from hive-jdbc * 3d65b50 Add the datestr <> '0000-00-00' back to the incremental sql * 0577661 HoodieIncrementalConfig not used anymore * 5338004 Fixing multiple minor issues we found during the SQLStreamer demo preperation * 0744283 Fix the Hive server and Spark Hive client mismatch by setting userClassPathFirst=true and creating a assembly jar with all hadoop related dependencies excluded * c189dc0 Kickoff hdrone sync after SQLStreamer finishing committing to target hoodie dataset * 1eb8da0 Check if the .commit file is empty * f95386a Add support for rollbacking .inflight commit in Admin CLI * 97595ea Update the record count when upserting * 49139cd Remove table config and add _SUCCESS tag * 8500a48 Catch the exception when upserting * 10bcc19 Merge branch 'sqlload' |\ | * 10fcc88 More log statements | * ca6b71d Merge with master | |\ | | * b33db25 Merge remote-tracking branch 'origin/sqlload' into sqlload | | |\ | | | * 8fca7c6 insert() takes a JavaRDD<HoodieRecord> again | | * | 63db8c6 Fix test breakage from javax.servlet pom dependency | | * | b2cff33 insert() takes a JavaRDD<HoodieRecord> again | | * | 0162930 Minor Fixes | | * | a0eb0b8 Minor Fixes | | * | 5853e7c Minor fixed to HoodieSQLStreamer | | * | 379bbed HoodieSQLStreamer improvements | | * | 22bf816 Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload | | * | 4cacde6 Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload | | * | 5f985f3 Refactor of AvroParquetIO and create proper abstraction for StorageWriter | | * | 6b90bb0 Refactor to introduce proper abstractions for RawTripPayload and implement HoodieSQLStreamer | | * | ff24ce8 Implementation of HoodieSQLStreamer | | * | abae08a Implementation of HoodieSQLStreamer | * | | c2d306d Fixes to HoodieSQLStreamer | | |/ | |/| | * | 70bad72 Minor Fixes | * | 8da6abf Minor Fixes | * | 6b9d16b Minor fixed to HoodieSQLStreamer | * | f76f5b8 HoodieSQLStreamer improvements | * | 5f1425e Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload | * | 616e2ee Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload | * | 9e77ef9 Refactor of AvroParquetIO and create proper abstraction for StorageWriter | * | 14e4812 Refactor to introduce proper abstractions for RawTripPayload and implement HoodieSQLStreamer | * | 3b05f04 Implementation of HoodieSQLStreamer | * | 1484c34 Implementation of HoodieSQLStreamer * | | b3b9754 Standardize UTF-8 for getBytes() calls | |/ |/| * | 8cde079 Add graphite metrics to HoodieClient * | b94afad Add testcase for the snapshot copy |/ * 8567225 T417977. WriteStatus for failed records * 11d7cd2 Add code to deflate the HoodieRecord after writing it to storage * 9edafb4 Add a daily snapshot job * 2962bf6 Fix the last file non-closed issue * d995b6b SizeAwareParquetWriter will now have a fixed compression ratio * 6b5f67f HoodieWrapperFileSystem should initialize the underlying filesystem with default uri * 2a607c2 Merging conflicts with master * ac9852d Auto size parquet files to just under block size based on incoming records size * 3c4c0d0 Remove client code leaks & add parallelism config for sorting * 1e51e30 Add UpsertHandle * 685ca1f Add hoodie cli * ded7f6c CR feedback incorporated * d532089 Change the return type to a RDD * 22533c1 Fix bug in cleanup logic by using TaskContext.getPartitionId() in place of unitNumber * 86532fb Implement insert() using sorting, to align file sizes easily * 0967e1c Add hook to compare old record with new incoming record * f48b048 Merge branch 'sort-based-dag' |\ | * 3614cec Rename write() -> upsert() and load() -> insert() * | 65cf631 Parquet version mismatch in HoodieInputFormat * | 160303b Formatting change * | 2c079c8 Formatting change |/ * e4eb658 Fix formatting * 025114a Add test for HoodieAvroWriteSupport * 6fd11ef Fix small bug in HoodieCommits & correct doc to reflect exclusivity of findCommitsInRange - Added simple unit test * 05659c9 Add tests around HoodieClient apis * 8d3f73e Fix some small bugs * 7f1c4bc Modify HoodieInputFormatTest to make it certain that incremental pull is only pulling the required records * 2b73ba0 Remove direct versioning in pom * dd5695f Comment change * f62eef7 Unit test for predicate pushdown * 9941dad Fixing an issue which results in unsorted commits * 5e71506 Update README * 219e103 InputFormat unit tests * 8f1c7ba Enable cobertura coverage to be run with mvn test * 01f76e3 Call out self-join limitation in README * 4284a73 Defaulting to Google Java Style and reformatting existing code * de2cbda Making sure that incremental does not send duplicate records * f6a3833 Implement Review comments * 1de5025 Refactor in HoodieTableMetadata, HoodieInputFormat * 549ad9a Fixing broken test schemas * fbb2190 update the unit number * 9353ba9 Change the io number to 1 for old load data * e28f0cf Add commit metadata fields to create_table.sql * d06e93d Pull avroFn & dedupeFn into a single HoodieClientHooks class * b6d387f Changes to sequence_no/commit metadata addition * 212d237 Add some benchmark results to the code * 70d7715 Add commit rollback logic * 54a4d0f Use FSUtils helper to detemine fileId * 4b672ad Core classes refactoring * f705fab Move partitionPath back into HoodieKey * 39b3ff3 Cleanup Sample job & add a detailed quickstart * 981c6f7 fix the hoodie-query-meta pom * 371ab34 Publish hoodie to uber internal artifactory * b4e83bc improvement on the bloom index tag job * 779b502 Change to use hadoop's bloom filter * cfbd9e6 Add bloom filter indexing mechanism * f519c47 Initial Implementation of storing the client metadata for hoodie queries * d5eccea Initial Implementation of storing the client metadata for hoodie queries * ef34482 Pass on the HDrone configuration profile as an argument * 5578cd3 Implement initial incremental tailing support in InputFormat and provide a seperate module for Hdrone registration to be created as a oozie trigger * b08e5ff Merge branch 'master' into AddBloomFilterWriteSupport * 20b7e8e fix a typo * 4c39407 Quick fix for the HBASE indx duplicates records issue * 6dca38f Adding code to sync to hive using hdrone * 55a1d44 Fixes to InputFormat. Created a placeholder OutputFormat. * beda7ed Revise the globPartitions to avoid the bad partition paths * 5d889c0 Fix a wrong config * a60fbdf First version to add load function * 4b90944 Adding detailed metadata to each commit * 4a97a6c Changes to backfill script + enabling spark event log * ada2b79 Discard records without partition path & move parquet writer to snappy * 954c933 Adding backfill script - Cleanups & additional cmd line options to job - Changed iounit logic to special case 2010-2014 again * 8b5e288 Breaking apart backfill job & single run into two classes * ebdcbea Handle partial failures in update() * 4bf6ffe Fixing an issue where file name is not present * e468bff Fix couple of issues with Hbase indexing and commit ts checks * 17da30c Changing de-dupe implementation to be a Spark reduceByKey * 248c725 removed coalescing which was put in there for testing * 1b3f929 Implement compression when storing large json strings in memory * 5bada98 Changes to accomodate task failure handling, on top of cleaner * 66f895a Clean out files generated by previous failed attempts * 9cbe370 Implementing a rudimentary cleaner & avro conversion rewrite * 3606658 Adding configs for iounits & reduce parallelism * 066c2f5 Registering the Hoodie classes with Kryo * 342eed1 Implementing a rudimentary cleaner * 0d20d1d Merge branch 'trip-test-run' |\ | * 6eafdbb Adding de-dupe step before writing/shuffling * | 34baba7 Packaging hadoop-common with the hadoop-mr InputFormat JAR |/ * d5856db Merge HoodieInputFormat with existing code. Factor out common logic into hadoop-common. Tune the partitions, spark executors, parquet parameters to be able to run on a single day of input data * e8885ce Introduce IOUnit to split parallelize inserts * ab1977a Pushing in a real Spark job that works off real data * 0c86645 HoodirInputFormat with TestDataSimulator * 6af483c Initial checkin for HoodieInputFormat * 99c58f2 Implementing HBase backed index * 4177529 First major chunk of Hoodie Spark Client Impl * 29fad70 Benchmark bloom filter file read performance * 18f52a4 Checking in the simulation code, measuring cost of trip's file-level updates * 885f444 Adding basic datastructures for Client, key & record. * 72e7b4d Initial commit
2016-12-16 14:34:42 -08:00
parent 0512da094b
commit 81874a8406
69 changed files with 10464 additions and 11 deletions
--- a/hoodie-client/src/test/java/com/uber/hoodie/index/TestHoodieBloomIndex.java
+++ b/hoodie-client/src/test/java/com/uber/hoodie/index/TestHoodieBloomIndex.java
@@ -0,0 +1,439 @@
+/*
+ * Copyright (c) 2016 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *          http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.index;
+
+import com.google.common.base.Optional;
+import com.google.common.collect.Lists;
+
+import com.uber.hoodie.config.HoodieWriteConfig;
+import com.uber.hoodie.avro.HoodieAvroWriteSupport;
+import com.uber.hoodie.common.BloomFilter;
+import com.uber.hoodie.common.TestRawTripPayload;
+import com.uber.hoodie.common.model.HoodieKey;
+import com.uber.hoodie.common.model.HoodieRecord;
+import com.uber.hoodie.common.model.HoodieTableMetadata;
+import com.uber.hoodie.common.model.HoodieTestUtils;
+import com.uber.hoodie.common.util.FSUtils;
+import com.uber.hoodie.common.util.HoodieAvroUtils;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+
+import scala.Tuple2;
+
+import java.io.File;
+import java.io.IOException;
+import java.text.SimpleDateFormat;
+import java.util.*;
+
+import static org.junit.Assert.*;
+
+public class TestHoodieBloomIndex {
+    private JavaSparkContext jsc = null;
+    private String basePath = null;
+    private transient final FileSystem fs;
+
+    public TestHoodieBloomIndex() throws Exception {
+        fs = FSUtils.getFs();
+    }
+
+    @Before
+    public void init() throws IOException {
+        // Initialize a local spark env
+        SparkConf sparkConf = new SparkConf().setAppName("TestHoodieBloomIndex").setMaster("local[4]");
+        jsc = new JavaSparkContext(sparkConf);
+        // Create a temp folder as the base path
+        TemporaryFolder folder = new TemporaryFolder();
+        folder.create();
+        basePath = folder.getRoot().getAbsolutePath();
+        HoodieTestUtils.initializeHoodieDirectory(basePath);
+    }
+
+    @Test
+    public void testLoadUUIDsInMemory() throws IOException {
+        // Create one RDD of hoodie record
+        String recordStr1 = "{\"_row_key\":\"1eb5b87a-1feh-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":12}";
+        String recordStr2 = "{\"_row_key\":\"2eb5b87b-1feu-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:20:41.415Z\",\"number\":100}";
+        String recordStr3 = "{\"_row_key\":\"3eb5b87c-1fej-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":15}";
+        String recordStr4 = "{\"_row_key\":\"4eb5b87c-1fej-4edd-87b4-6ec96dc405a0\",\"time\":\"2015-01-31T03:16:41.415Z\",\"number\":32}";
+
+        String schemaStr =
+            IOUtils.toString(getClass().getResourceAsStream("/exampleSchema.txt"), "UTF-8");
+        TestRawTripPayload rowChange1 = new TestRawTripPayload(recordStr1);
+        HoodieRecord record1 = new HoodieRecord(new HoodieKey(rowChange1.getRowKey(), rowChange1.getPartitionPath()), rowChange1);
+        TestRawTripPayload rowChange2 = new TestRawTripPayload(recordStr2);
+        HoodieRecord record2 = new HoodieRecord(new HoodieKey(rowChange2.getRowKey(), rowChange2.getPartitionPath()), rowChange2);
+        TestRawTripPayload rowChange3 = new TestRawTripPayload(recordStr3);
+        HoodieRecord record3 = new HoodieRecord(new HoodieKey(rowChange3.getRowKey(), rowChange3.getPartitionPath()), rowChange3);
+        TestRawTripPayload rowChange4 = new TestRawTripPayload(recordStr4);
+        HoodieRecord record4 = new HoodieRecord(new HoodieKey(rowChange4.getRowKey(), rowChange4.getPartitionPath()), rowChange4);
+
+        JavaRDD<HoodieRecord> recordRDD = jsc.parallelize(Arrays.asList(record1, record2, record3, record4));
+
+        // Load to memory
+        HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).build();
+        HoodieBloomIndex index = new HoodieBloomIndex(config, jsc);
+        Map<String, Iterable<String>> map = index.getPartitionToRowKeys(recordRDD);
+        assertEquals(map.size(), 2);
+        List<String> list1 = Lists.newArrayList(map.get("2016/01/31"));
+        List<String> list2 = Lists.newArrayList(map.get("2015/01/31"));
+        assertEquals(list1.size(), 3);
+        assertEquals(list2.size(), 1);
+    }
+
+    @Test
+    public void testLoadInvolvedFiles() throws IOException {
+        HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).build();
+        HoodieBloomIndex index = new HoodieBloomIndex(config, jsc);
+
+        // Create some partitions, and put some files
+        // "2016/01/21": 0 file
+        // "2016/04/01": 1 file (2_0_20160401010101.parquet)
+        // "2015/03/12": 3 files (1_0_20150312101010.parquet, 3_0_20150312101010.parquet, 4_0_20150312101010.parquet)
+        new File(basePath + "/2016/01/21").mkdirs();
+        new File(basePath + "/2016/04/01").mkdirs();
+        new File(basePath + "/2015/03/12").mkdirs();
+        new File(basePath + "/2016/04/01/2_0_20160401010101.parquet").createNewFile();
+        new File(basePath + "/2015/03/12/1_0_20150312101010.parquet").createNewFile();
+        new File(basePath + "/2015/03/12/3_0_20150312101010.parquet").createNewFile();
+        new File(basePath + "/2015/03/12/4_0_20150312101010.parquet").createNewFile();
+        List<String> partitions = Arrays.asList("2016/01/21", "2016/04/01", "2015/03/12");
+        HoodieTableMetadata metadata = new HoodieTableMetadata(fs, basePath, "testTable");
+        JavaPairRDD<String, String> rdd = index.loadInvolvedFiles(partitions, metadata);
+        // Still 0, as no valid commit
+        assertEquals(rdd.count(), 0);
+
+        // Add some commits
+        new File(basePath + "/.hoodie").mkdirs();
+        new File(basePath + "/.hoodie/20160401010101.commit").createNewFile();
+        new File(basePath + "/.hoodie/20150312101010.commit").createNewFile();
+        metadata = new HoodieTableMetadata(fs, basePath, "testTable");
+        rdd = index.loadInvolvedFiles(partitions, metadata);
+        final List<Tuple2<String, String>> filesList = rdd.collect();
+        assertEquals(filesList.size(), 4);
+
+        // no longer sorted, but should have same files.
+        Set<String> actualFiles = new HashSet<String>(){{
+            add(filesList.get(0)._1 + "/" + filesList.get(0)._2);
+            add(filesList.get(1)._1 + "/" + filesList.get(1)._2);
+            add(filesList.get(2)._1 + "/" + filesList.get(2)._2);
+            add(filesList.get(3)._1 + "/" + filesList.get(3)._2);
+        }};
+
+        Set<String> expected = new HashSet<String>() {{
+            add("2016/04/01/2_0_20160401010101.parquet");
+            add("2015/03/12/1_0_20150312101010.parquet");
+            add("2015/03/12/3_0_20150312101010.parquet");
+            add("2015/03/12/4_0_20150312101010.parquet");
+        }};
+        assertEquals(expected, actualFiles);
+    }
+
+    @Test
+    public void testCheckUUIDsAgainstOneFile() throws IOException, InterruptedException, ClassNotFoundException {
+        HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).build();
+        HoodieBloomIndex index = new HoodieBloomIndex(config, jsc);
+
+        String schemaStr =
+            IOUtils.toString(getClass().getResourceAsStream("/exampleSchema.txt"), "UTF-8");
+        Schema schema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(schemaStr));
+        // Create some records to use
+        String recordStr1 = "{\"_row_key\":\"1eb5b87a-1feh-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":12}";
+        String recordStr2 = "{\"_row_key\":\"2eb5b87b-1feu-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:20:41.415Z\",\"number\":100}";
+        String recordStr3 = "{\"_row_key\":\"3eb5b87c-1fej-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":15}";
+        String recordStr4 = "{\"_row_key\":\"4eb5b87c-1fej-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":32}";
+        TestRawTripPayload rowChange1 = new TestRawTripPayload(recordStr1);
+        HoodieRecord record1 = new HoodieRecord(new HoodieKey(rowChange1.getRowKey(), rowChange1.getPartitionPath()), rowChange1);
+        TestRawTripPayload rowChange2 = new TestRawTripPayload(recordStr2);
+        HoodieRecord record2 = new HoodieRecord(new HoodieKey(rowChange2.getRowKey(), rowChange2.getPartitionPath()), rowChange2);
+        TestRawTripPayload rowChange3 = new TestRawTripPayload(recordStr3);
+        HoodieRecord record3 = new HoodieRecord(new HoodieKey(rowChange3.getRowKey(), rowChange3.getPartitionPath()), rowChange3);
+        TestRawTripPayload rowChange4 = new TestRawTripPayload(recordStr4);
+        HoodieRecord record4 = new HoodieRecord(new HoodieKey(rowChange4.getRowKey(), rowChange4.getPartitionPath()), rowChange4);
+
+
+        // We write record1, record2 to a parquet file, but the bloom filter contains (record1, record2, record3).
+        BloomFilter filter = new BloomFilter(10000, 0.0000001);
+        filter.add(record3.getRecordKey());
+        String filename = writeParquetFile("2016/01/31", Arrays.asList(record1, record2), schema, filter, true);
+
+        // The bloom filter contains 3 records
+        assertTrue(filter.mightContain(record1.getRecordKey()));
+        assertTrue(filter.mightContain(record2.getRecordKey()));
+        assertTrue(filter.mightContain(record3.getRecordKey()));
+        assertFalse(filter.mightContain(record4.getRecordKey()));
+
+        // Compare with file
+        List<String> uuids = Arrays.asList(record1.getRecordKey(), record2.getRecordKey(),
+                record3.getRecordKey(), record4.getRecordKey());
+
+        List<String> results = HoodieBloomIndexCheckFunction.checkCandidatesAgainstFile(uuids,
+                new Path(basePath + "/2016/01/31/" + filename));
+        assertEquals(results.size(), 2);
+        assertTrue(results.get(0).equals("1eb5b87a-1feh-4edd-87b4-6ec96dc405a0")
+                || results.get(1).equals("1eb5b87a-1feh-4edd-87b4-6ec96dc405a0"));
+        assertTrue(results.get(0).equals("2eb5b87b-1feu-4edd-87b4-6ec96dc405a0")
+                || results.get(1).equals("2eb5b87b-1feu-4edd-87b4-6ec96dc405a0"));
+        // TODO(vc): Need more coverage on actual filenames
+        //assertTrue(results.get(0)._2().equals(filename));
+        //assertTrue(results.get(1)._2().equals(filename));
+    }
+
+    @Test
+    public void testTagLocationWithEmptyRDD() throws Exception {
+        // We have some records to be tagged (two different partitions)
+        JavaRDD<HoodieRecord> recordRDD = jsc.emptyRDD();
+        // Also create the metadata and config
+        HoodieTableMetadata metadata = new HoodieTableMetadata(fs, basePath, "testTable");
+        HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).build();
+
+        // Let's tag
+        HoodieBloomIndex bloomIndex = new HoodieBloomIndex(config, jsc);
+
+        try {
+            bloomIndex.tagLocation(recordRDD, metadata);
+        } catch (IllegalArgumentException e) {
+            fail("EmptyRDD should not result in IllegalArgumentException: Positive number of slices required");
+        }
+    }
+
+
+    @Test
+    public void testTagLocation() throws Exception {
+        // We have some records to be tagged (two different partitions)
+        String schemaStr =
+            IOUtils.toString(getClass().getResourceAsStream("/exampleSchema.txt"), "UTF-8");
+        Schema schema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(schemaStr));
+
+        String recordStr1 = "{\"_row_key\":\"1eb5b87a-1feh-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":12}";
+        String recordStr2 = "{\"_row_key\":\"2eb5b87b-1feu-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:20:41.415Z\",\"number\":100}";
+        String recordStr3 = "{\"_row_key\":\"3eb5b87c-1fej-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":15}";
+        String recordStr4 = "{\"_row_key\":\"4eb5b87c-1fej-4edd-87b4-6ec96dc405a0\",\"time\":\"2015-01-31T03:16:41.415Z\",\"number\":32}";
+        TestRawTripPayload rowChange1 = new TestRawTripPayload(recordStr1);
+        HoodieRecord record1 = new HoodieRecord(new HoodieKey(rowChange1.getRowKey(), rowChange1.getPartitionPath()), rowChange1);
+        TestRawTripPayload rowChange2 = new TestRawTripPayload(recordStr2);
+        HoodieRecord record2 = new HoodieRecord(new HoodieKey(rowChange2.getRowKey(), rowChange2.getPartitionPath()), rowChange2);
+        TestRawTripPayload rowChange3 = new TestRawTripPayload(recordStr3);
+        HoodieRecord record3 = new HoodieRecord(new HoodieKey(rowChange3.getRowKey(), rowChange3.getPartitionPath()), rowChange3);
+        TestRawTripPayload rowChange4 = new TestRawTripPayload(recordStr4);
+        HoodieRecord record4 = new HoodieRecord(new HoodieKey(rowChange4.getRowKey(), rowChange4.getPartitionPath()), rowChange4);
+        JavaRDD<HoodieRecord> recordRDD = jsc.parallelize(Arrays.asList(record1, record2, record3, record4));
+
+        // Also create the metadata and config
+        HoodieTableMetadata metadata = new HoodieTableMetadata(fs, basePath, "testTable");
+        HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).build();
+
+        // Let's tag
+        HoodieBloomIndex bloomIndex = new HoodieBloomIndex(config, jsc);
+        JavaRDD<HoodieRecord> taggedRecordRDD = bloomIndex.tagLocation(recordRDD, metadata);
+
+        // Should not find any files
+        for (HoodieRecord record : taggedRecordRDD.collect()) {
+            assertTrue(!record.isCurrentLocationKnown());
+        }
+
+        // We create three parquet file, each having one record. (two different partitions)
+        String filename1 = writeParquetFile("2016/01/31", Arrays.asList(record1), schema, null, true);
+        String filename2 = writeParquetFile("2016/01/31", Arrays.asList(record2), schema, null, true);
+        String filename3 = writeParquetFile("2015/01/31", Arrays.asList(record4), schema, null, true);
+
+        // We do the tag again
+        metadata = new HoodieTableMetadata(fs, basePath, "testTable");
+        taggedRecordRDD = bloomIndex.tagLocation(recordRDD, metadata);
+
+        // Check results
+        for (HoodieRecord record : taggedRecordRDD.collect()) {
+            if (record.getRecordKey().equals("1eb5b87a-1feh-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(record.getCurrentLocation().getFileId().equals(FSUtils.getFileId(filename1)));
+            } else if (record.getRecordKey().equals("2eb5b87b-1feu-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(record.getCurrentLocation().getFileId().equals(FSUtils.getFileId(filename2)));
+            } else if (record.getRecordKey().equals("3eb5b87c-1fej-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(!record.isCurrentLocationKnown());
+            } else if (record.getRecordKey().equals("4eb5b87c-1fej-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(record.getCurrentLocation().getFileId().equals(FSUtils.getFileId(filename3)));
+            }
+        }
+    }
+
+    @Test
+    public void testCheckExists() throws Exception {
+        // We have some records to be tagged (two different partitions)
+        String schemaStr =
+            IOUtils.toString(getClass().getResourceAsStream("/exampleSchema.txt"), "UTF-8");
+        Schema schema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(schemaStr));
+
+        String recordStr1 = "{\"_row_key\":\"1eb5b87a-1feh-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":12}";
+        String recordStr2 = "{\"_row_key\":\"2eb5b87b-1feu-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:20:41.415Z\",\"number\":100}";
+        String recordStr3 = "{\"_row_key\":\"3eb5b87c-1fej-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":15}";
+        String recordStr4 = "{\"_row_key\":\"4eb5b87c-1fej-4edd-87b4-6ec96dc405a0\",\"time\":\"2015-01-31T03:16:41.415Z\",\"number\":32}";
+        TestRawTripPayload rowChange1 = new TestRawTripPayload(recordStr1);
+        HoodieKey key1 = new HoodieKey(rowChange1.getRowKey(), rowChange1.getPartitionPath());
+        HoodieRecord record1 = new HoodieRecord(key1, rowChange1);
+        TestRawTripPayload rowChange2 = new TestRawTripPayload(recordStr2);
+        HoodieKey key2 = new HoodieKey(rowChange2.getRowKey(), rowChange2.getPartitionPath());
+        HoodieRecord record2 = new HoodieRecord(key2, rowChange2);
+        TestRawTripPayload rowChange3 = new TestRawTripPayload(recordStr3);
+        HoodieKey key3 = new HoodieKey(rowChange3.getRowKey(), rowChange3.getPartitionPath());
+        HoodieRecord record3 = new HoodieRecord(key3, rowChange3);
+        TestRawTripPayload rowChange4 = new TestRawTripPayload(recordStr4);
+        HoodieKey key4 = new HoodieKey(rowChange4.getRowKey(), rowChange4.getPartitionPath());
+        HoodieRecord record4 = new HoodieRecord(key4, rowChange4);
+        JavaRDD<HoodieKey> keysRDD = jsc.parallelize(Arrays.asList(key1, key2, key3, key4));
+
+        // Also create the metadata and config
+        HoodieTableMetadata metadata = new HoodieTableMetadata(fs, basePath, "testTable");
+        HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).build();
+
+        // Let's tag
+        HoodieBloomIndex bloomIndex = new HoodieBloomIndex(config, jsc);
+        JavaPairRDD<HoodieKey, Optional<String>> taggedRecordRDD = bloomIndex.fetchRecordLocation(keysRDD, metadata);
+
+        // Should not find any files
+        for (Tuple2<HoodieKey, Optional<String>> record : taggedRecordRDD.collect()) {
+            assertTrue(!record._2.isPresent());
+        }
+
+        // We create three parquet file, each having one record. (two different partitions)
+        String filename1 = writeParquetFile("2016/01/31", Arrays.asList(record1), schema, null, true);
+        String filename2 = writeParquetFile("2016/01/31", Arrays.asList(record2), schema, null, true);
+        String filename3 = writeParquetFile("2015/01/31", Arrays.asList(record4), schema, null, true);
+
+        // We do the tag again
+        metadata = new HoodieTableMetadata(fs, basePath, "testTable");
+        taggedRecordRDD = bloomIndex.fetchRecordLocation(keysRDD, metadata);
+
+        // Check results
+        for (Tuple2<HoodieKey, Optional<String>> record : taggedRecordRDD.collect()) {
+            if (record._1.getRecordKey().equals("1eb5b87a-1feh-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(record._2.isPresent());
+                Path path1 = new Path(record._2.get());
+                assertEquals(FSUtils.getFileId(filename1), FSUtils.getFileId(path1.getName()));
+            } else if (record._1.getRecordKey().equals("2eb5b87b-1feu-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(record._2.isPresent());
+                Path path2 = new Path(record._2.get());
+                assertEquals(FSUtils.getFileId(filename2), FSUtils.getFileId(path2.getName()));
+            } else if (record._1.getRecordKey().equals("3eb5b87c-1fej-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(!record._2.isPresent());
+            } else if (record._1.getRecordKey().equals("4eb5b87c-1fej-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(record._2.isPresent());
+                Path path3 = new Path(record._2.get());
+                assertEquals(FSUtils.getFileId(filename3), FSUtils.getFileId(path3.getName()));
+            }
+        }
+    }
+
+
+    @Test
+    public void testBloomFilterFalseError() throws IOException, InterruptedException {
+        // We have two hoodie records
+        String recordStr1 = "{\"_row_key\":\"1eb5b87a-1feh-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":12}";
+        String recordStr2 = "{\"_row_key\":\"2eb5b87b-1feu-4edd-87b4-6ec96dc405a0\",\"time\":\"2016-01-31T03:20:41.415Z\",\"number\":100}";
+
+        // We write record1 to a parquet file, using a bloom filter having both records
+        String schemaStr =
+            IOUtils.toString(getClass().getResourceAsStream("/exampleSchema.txt"), "UTF-8");
+        Schema schema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(schemaStr));
+
+        TestRawTripPayload rowChange1 = new TestRawTripPayload(recordStr1);
+        HoodieRecord record1 = new HoodieRecord(new HoodieKey(rowChange1.getRowKey(), rowChange1.getPartitionPath()), rowChange1);
+        TestRawTripPayload rowChange2 = new TestRawTripPayload(recordStr2);
+        HoodieRecord record2 = new HoodieRecord(new HoodieKey(rowChange2.getRowKey(), rowChange2.getPartitionPath()), rowChange2);
+
+        BloomFilter filter = new BloomFilter(10000, 0.0000001);
+        filter.add(record2.getRecordKey());
+        String filename = writeParquetFile("2016/01/31", Arrays.asList(record1), schema, filter, true);
+        assertTrue(filter.mightContain(record1.getRecordKey()));
+        assertTrue(filter.mightContain(record2.getRecordKey()));
+
+        // We do the tag
+        JavaRDD<HoodieRecord> recordRDD = jsc.parallelize(Arrays.asList(record1, record2));
+        HoodieTableMetadata metadata = new HoodieTableMetadata(fs, basePath, "testTable");
+        HoodieWriteConfig config = HoodieWriteConfig.newBuilder().withPath(basePath).build();
+        HoodieBloomIndex bloomIndex = new HoodieBloomIndex(config, jsc);
+        JavaRDD<HoodieRecord> taggedRecordRDD = bloomIndex.tagLocation(recordRDD, metadata);
+
+        // Check results
+        for (HoodieRecord record : taggedRecordRDD.collect()) {
+            if (record.getKey().equals("1eb5b87a-1feh-4edd-87b4-6ec96dc405a0")) {
+                assertTrue(record.getCurrentLocation().getFileId().equals(FSUtils.getFileId(filename)));
+            } else if (record.getRecordKey().equals("2eb5b87b-1feu-4edd-87b4-6ec96dc405a0")) {
+                assertFalse(record.isCurrentLocationKnown());
+            }
+        }
+    }
+
+    private String writeParquetFile(String partitionPath, List<HoodieRecord> records, Schema schema,
+                                    BloomFilter filter, boolean createCommitTime) throws IOException, InterruptedException {
+        Thread.sleep(1000);
+        String commitTime = new SimpleDateFormat("yyyyMMddHHmmss").format(new Date());
+        String fileId = UUID.randomUUID().toString();
+        String filename = FSUtils.makeDataFileName(commitTime, 1, fileId);
+
+        return writeParquetFile(partitionPath, filename, records, schema, filter, createCommitTime);
+    }
+
+    private String writeParquetFile(String partitionPath, String filename, List<HoodieRecord> records, Schema schema,
+                                    BloomFilter filter, boolean createCommitTime) throws IOException {
+        if (filter == null) {
+            filter = new BloomFilter(10000, 0.0000001);
+        }
+        HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(new AvroSchemaConverter().convert(schema), schema, filter);
+        ParquetWriter writer = new ParquetWriter(new Path(basePath + "/" + partitionPath + "/" + filename),
+                writeSupport, CompressionCodecName.GZIP, 120 * 1024 * 1024, ParquetWriter.DEFAULT_PAGE_SIZE);
+        int seqId = 1;
+        String commitTime = FSUtils.getCommitTime(filename);
+        for (HoodieRecord record : records) {
+            GenericRecord avroRecord = (GenericRecord) record.getData().getInsertValue(schema);
+            HoodieAvroUtils.addCommitMetadataToRecord(avroRecord, commitTime, "" + seqId++);
+            HoodieAvroUtils.addHoodieKeyToRecord(avroRecord, record.getRecordKey(), record.getPartitionPath(), filename);
+            writer.write(avroRecord);
+            filter.add(record.getRecordKey());
+        }
+        writer.close();
+
+        if (createCommitTime) {
+            // Also make sure the commit is valid
+            new File(basePath + "/" + HoodieTableMetadata.METAFOLDER_NAME).mkdirs();
+            new File(basePath + "/" + HoodieTableMetadata.METAFOLDER_NAME + "/" + commitTime + ".commit").createNewFile();
+        }
+        return filename;
+    }
+
+    @After
+    public void clean() {
+        if (jsc != null) {
+            jsc.stop();
+        }
+        if (basePath != null) {
+            new File(basePath).delete();
+        }
+    }
+}