Importing Hoodie Client from internal repo

Abberved History: * 25c6991 Removed non-opensource modules * a62abf3 Removing email from pom.xml * 0931b68 Misspelt in the copyright * c1cac7d Preperation for OSS: Added License and rat plugin check. Also added meta information about the project in pom.xml * 16b07b3 Preparation of OSS - Remove hoodie specific URL from hoodie cli * fd3e0dd Small code cleanups * 8aa7e34 Adding a de-duplication command to CLI * b464842 Adding a de-duplication command to CLI * 59265b1 RegisterDataset should pass the right zkNodeName after the support for multiple databases added * b295f70 [maven-release-plugin] prepare for next development iteration * 1006e4b [maven-release-plugin] prepare release hoodie-0.2.4 * 4c99437 Move to using hdrone release 0.7.4 * 1891939 Auto tuning the buckets needed for plain inserts also - Off by default for now - Enhanced an existing unit test * b4563bd Change HoodieReadClient to use commit metadata for incremental pull * ee20183 Add full file path onto HoodieWriteStat - This will become an issue later on for incremental processing use cases - Tested with cli, that is able to read older HoodieCommitMetadata * 7dcd5d5 Address skew in cleaner work distribution * 8d7c15d Fixing bug around partial failures of rollback * d4ada1d Empty RDD should not throw java.lang.IllegalArgumentException: Positive number of slices required * 076bea9 Dont clean if there are no partitions to clean * c014f80 Minor changes to SQLStreamer * a96d4df Minor changes to SQLStreamer * bc289cc [maven-release-plugin] prepare for next development iteration * 4160107 [maven-release-plugin] prepare release hoodie-0.2.3 * 409b07a [maven-release-plugin] prepare for next development iteration * 3d71514 [maven-release-plugin] prepare release hoodie-0.2.2 * 4969d52 Fix test failures * ac62609 Implement Review Comments for: Parallelize cleaning and including cleaning time and commit archival time in commit time graphite reporting * cebe65a Parallelize cleaning and including cleaning time and commit archival time in commit time graphite reporting * 2e5b372 Migrating to CDH 5.7.2 * 899ae12 Remove filtering of /tmp/hive/hive paths from HoodieInputFormat. This fixes Join with temporary tables with HoodieCombineHiveInputFormat * 69a68f6 Implement equals and hashCode for HoodieTableMetadata, its used in hash based structures * 12d29c6 Update hive staging url * 1c5c88a Copy filterExists to WriteClient * 76aee67 [maven-release-plugin] prepare for next development iteration * 1f0a715 [maven-release-plugin] prepare release hoodie-0.2.1 * dbfd1d4 HoodieReadClient and HoodieWriteClient separation * c39a98b Revamped HoodieRecordPayload API that supports merging of old & new values during update * 79e5bbd Add a helper to configure SparkConf for SparkSQL on Hoodie tables * f56f423 [maven-release-plugin] prepare for next development iteration * 780fc44 [maven-release-plugin] prepare release hoodie-0.2 * 1ea2238 Modifying the git utl * b0af8dc Depending on hdrone release version * 7753693 Removing a System.out.println which got in by mistake * 1f5b019 Adding HBase Config to HoodieClientConfig * 2fce97f Implement Review comments and merge into master * f389820 Bunch of API changes * 909a856 HoodieClientConfig split up and revamp * c2ad946 Fix TestHoodieClient to not double persist in testFilterExists * 3ab0da6 Fix breaking test * 2860542 CR feedback for small inserts turned to updates * 0dfce57 Small inserts are now turned into upserts * bb1a8b3 Add filterExist API for Hoodie Records * d983c24 Implement review comments * c0bd5d1 Implement HoodieClient.checkExists() * db078f6 Pick up HoodieTable based on hoodie.properties * ad023e9 Refactor upsert() using HoodieTable interface * ee9b9b3 Refactor upsert() using HoodieTable interface * 2d6fdc9 Adding a utility to generate the percentage of updates in commit * ea3ad58 Adding additional optimizations to remove similar queries from the perf test (using levenshtein distance) * 1e443a0 Add test case for the added support for SchemaEvolution during updates * 1cadcbb Add more logging * 6163dfe Parquet read of old file should have the right read schema specified * 29c746a Few fixes in ReduceByKey parallelism, HoodieInputFormat.filterFiles for non-hoodie paths and more logging in upsert schema issues * 5a33af6 Fixing an issue in HoodieReader, target temp directory not created * 09a5e8e Adding more logging in HoodieReader * 1474250 Adding more logging in HoodieReader * a3b0567 Make targetDb not required in HoodieReader * e9c08b9 Setting the inputformat as the CombineHiveInputFormat in the HoodieReader * 61c75d2 Hoodie Query Performance: Add Support for CombineHiveInputFormat and implement CombineFileInputFormat * 38c6e44 Improvements to Hoodie Reader * ac7398a Add totalWriteErrors to HoodieCommitMetadata * fc0536e Change archive location to be under .hoodie * e313294 Implement Hive Perf comparison for Hoodie and non-Hoodie datasets * 17cfe2a Fix bug in HoodieInputFormat, where it filters out files from archived commits * 30de990 Add note about showpartitions command to README * 8634ffb Add commits showpartitions command to show break down per partition * 324b24e Adding a CLI command to print file size stats * 56532ff T484792. Deterministically report metrics during shutdown * 3571768 Fixes to Hoodie Cleaner. Upgrade HDrone version. Changes to HoodieReader. * a02c97f Bumping hdrone-api to 0.7.2 * b29ce67 Bug in RegisterDataset dataset creation * 5a15a9a Fixing bug in cleaning up partial files * dbf6669 Comment out predicate pushdown test * 44ed4d1 Merge branch 'lazyitr-fixes-1' |\ | * e913d3b Fixing bug in LazyInsertIterable | * 8a1fecd Wrapping upsert() inside HoodieUpsertException | * 39cfe39 Fixing bug in LazyInsertIterable - Return a List<WriteStatus> to handle last record in itr, belonging to a separate file - Remove insert() related code form UpsertMapFunction | * 00252e5 Making TestHoodieBloomIndex less flaky * | 6f2d417 Making TestHoodieBloomIndex less flaky * | 63ebbdc fs.mkdirs does not honor permission umask passed. Need to use the static method FileSystem.mkdirs for that. * | f49ef67 Adding more logging to Hoodie Reader * | 9f5a699 Fixing permission on the base intermediate folder created in HoodieReader |/ * 70e501f Fixing the drop table before create table in HoodieReader * 120cda8 Hoodie tools jar should not require jars in the CDH classpath to be available. Needed for HoodieReader to run in Docker. * 60b59de Adding client configurations. Needed to run the HoodieReader in Docker (where CDH is not installed) * fece98d Merge conflicts w/ master * 64e58b0 Auto tuning parallelism in BloomIndex & Upsert() * 930199e Fixing skew in Index join when new partition paths dont exist yet * 9a3e511 Adding subpartitioning to scale join in HoodieBloomIndex * 57512a7 Changing sort key for IndexLookup to (filename, record) to split more evenly * 3ede14c Major changes to BloomIndex & Upsert DAG * 1c4071a Implement Dataset creation if a Hoodie dataset was not already registered * 944f007 Implement Review comments * 6a5b675 Implement Review Comments * bfde3a9 Implement review comments * d195ab3 Implementing Commit Archiving * 8af656b Exception refactor - part 2 * 697a699 HoodieTableMetadata refactor and Exception refactor * 7804ca3 Adding HoodieAppendLog (fork of SequenceFile) & Initial Impl of HoodieCommitArchiveLog * 2db4931 Adjust partitionFileRDD parallelism to max(recordRDD partitions, total partitions) * 23405c5 Config name changes * 5e673ea Implementing more CLI commands * 918cfce Moving to 0.1.1-SNAPSHOT * afad497 Change the master branch to 0.2-SNAPSHOT * 832c1a7 Make sure the bloom filter reading and tagging has a parellel factor >= group by parallelism * 0a6a6d3 Prepare the v0.1 version * 72cfbe2 The snapshoter should also copy hoodie.properties file * 3b0ee45 Add one more metric * 488f1c7 Add switch for cleaning out inflight commits * a259b6f Adding textutils jar to hoodie build * 36e3118 Fix Hoodie CLI - ClassNotFound and added more logging to JDBC Incremental pull * 2c8f554 Fix Predicate pushdown during incremental pull * 888ec20 Add one more graphite metrics * a671dfc Ensure files picked for cleaning are part of some valid commit * ba5cd65 Adding cleaning based on last X commits * 7dc76d3 Organize config values by category * 9da6474 Move cleaning logic into HoodieCleaner class * 7becba9 Change the update metric name * d32b1f3 Fix some graphite issues * 365ee14 hot fix a stupid bug I made * 93eab43 Adding a hoodie.table.type value to hoodie.properties on init * 075c646 Add the database name to the sync * 3bae059 Adding HoodieKey as metadata field into Record * 61513fa Add stats and more cli commands * b0cb112 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * aaa1bf8 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * 3a3db73 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * c413342 Fail the job if exception during writing old records * 7304d3d Exclude javax.servlet from hive-jdbc * 3d65b50 Add the datestr <> '0000-00-00' back to the incremental sql * 0577661 HoodieIncrementalConfig not used anymore * 5338004 Fixing multiple minor issues we found during the SQLStreamer demo preperation * 0744283 Fix the Hive server and Spark Hive client mismatch by setting userClassPathFirst=true and creating a assembly jar with all hadoop related dependencies excluded * c189dc0 Kickoff hdrone sync after SQLStreamer finishing committing to target hoodie dataset * 1eb8da0 Check if the .commit file is empty * f95386a Add support for rollbacking .inflight commit in Admin CLI * 97595ea Update the record count when upserting * 49139cd Remove table config and add _SUCCESS tag * 8500a48 Catch the exception when upserting * 10bcc19 Merge branch 'sqlload' |\ | * 10fcc88 More log statements | * ca6b71d Merge with master | |\ | | * b33db25 Merge remote-tracking branch 'origin/sqlload' into sqlload | | |\ | | | * 8fca7c6 insert() takes a JavaRDD<HoodieRecord> again | | * | 63db8c6 Fix test breakage from javax.servlet pom dependency | | * | b2cff33 insert() takes a JavaRDD<HoodieRecord> again | | * | 0162930 Minor Fixes | | * | a0eb0b8 Minor Fixes | | * | 5853e7c Minor fixed to HoodieSQLStreamer | | * | 379bbed HoodieSQLStreamer improvements | | * | 22bf816 Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload | | * | 4cacde6 Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload | | * | 5f985f3 Refactor of AvroParquetIO and create proper abstraction for StorageWriter | | * | 6b90bb0 Refactor to introduce proper abstractions for RawTripPayload and implement HoodieSQLStreamer | | * | ff24ce8 Implementation of HoodieSQLStreamer | | * | abae08a Implementation of HoodieSQLStreamer | * | | c2d306d Fixes to HoodieSQLStreamer | | |/ | |/| | * | 70bad72 Minor Fixes | * | 8da6abf Minor Fixes | * | 6b9d16b Minor fixed to HoodieSQLStreamer | * | f76f5b8 HoodieSQLStreamer improvements | * | 5f1425e Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload | * | 616e2ee Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload | * | 9e77ef9 Refactor of AvroParquetIO and create proper abstraction for StorageWriter | * | 14e4812 Refactor to introduce proper abstractions for RawTripPayload and implement HoodieSQLStreamer | * | 3b05f04 Implementation of HoodieSQLStreamer | * | 1484c34 Implementation of HoodieSQLStreamer * | | b3b9754 Standardize UTF-8 for getBytes() calls | |/ |/| * | 8cde079 Add graphite metrics to HoodieClient * | b94afad Add testcase for the snapshot copy |/ * 8567225 T417977. WriteStatus for failed records * 11d7cd2 Add code to deflate the HoodieRecord after writing it to storage * 9edafb4 Add a daily snapshot job * 2962bf6 Fix the last file non-closed issue * d995b6b SizeAwareParquetWriter will now have a fixed compression ratio * 6b5f67f HoodieWrapperFileSystem should initialize the underlying filesystem with default uri * 2a607c2 Merging conflicts with master * ac9852d Auto size parquet files to just under block size based on incoming records size * 3c4c0d0 Remove client code leaks & add parallelism config for sorting * 1e51e30 Add UpsertHandle * 685ca1f Add hoodie cli * ded7f6c CR feedback incorporated * d532089 Change the return type to a RDD * 22533c1 Fix bug in cleanup logic by using TaskContext.getPartitionId() in place of unitNumber * 86532fb Implement insert() using sorting, to align file sizes easily * 0967e1c Add hook to compare old record with new incoming record * f48b048 Merge branch 'sort-based-dag' |\ | * 3614cec Rename write() -> upsert() and load() -> insert() * | 65cf631 Parquet version mismatch in HoodieInputFormat * | 160303b Formatting change * | 2c079c8 Formatting change |/ * e4eb658 Fix formatting * 025114a Add test for HoodieAvroWriteSupport * 6fd11ef Fix small bug in HoodieCommits & correct doc to reflect exclusivity of findCommitsInRange - Added simple unit test * 05659c9 Add tests around HoodieClient apis * 8d3f73e Fix some small bugs * 7f1c4bc Modify HoodieInputFormatTest to make it certain that incremental pull is only pulling the required records * 2b73ba0 Remove direct versioning in pom * dd5695f Comment change * f62eef7 Unit test for predicate pushdown * 9941dad Fixing an issue which results in unsorted commits * 5e71506 Update README * 219e103 InputFormat unit tests * 8f1c7ba Enable cobertura coverage to be run with mvn test * 01f76e3 Call out self-join limitation in README * 4284a73 Defaulting to Google Java Style and reformatting existing code * de2cbda Making sure that incremental does not send duplicate records * f6a3833 Implement Review comments * 1de5025 Refactor in HoodieTableMetadata, HoodieInputFormat * 549ad9a Fixing broken test schemas * fbb2190 update the unit number * 9353ba9 Change the io number to 1 for old load data * e28f0cf Add commit metadata fields to create_table.sql * d06e93d Pull avroFn & dedupeFn into a single HoodieClientHooks class * b6d387f Changes to sequence_no/commit metadata addition * 212d237 Add some benchmark results to the code * 70d7715 Add commit rollback logic * 54a4d0f Use FSUtils helper to detemine fileId * 4b672ad Core classes refactoring * f705fab Move partitionPath back into HoodieKey * 39b3ff3 Cleanup Sample job & add a detailed quickstart * 981c6f7 fix the hoodie-query-meta pom * 371ab34 Publish hoodie to uber internal artifactory * b4e83bc improvement on the bloom index tag job * 779b502 Change to use hadoop's bloom filter * cfbd9e6 Add bloom filter indexing mechanism * f519c47 Initial Implementation of storing the client metadata for hoodie queries * d5eccea Initial Implementation of storing the client metadata for hoodie queries * ef34482 Pass on the HDrone configuration profile as an argument * 5578cd3 Implement initial incremental tailing support in InputFormat and provide a seperate module for Hdrone registration to be created as a oozie trigger * b08e5ff Merge branch 'master' into AddBloomFilterWriteSupport * 20b7e8e fix a typo * 4c39407 Quick fix for the HBASE indx duplicates records issue * 6dca38f Adding code to sync to hive using hdrone * 55a1d44 Fixes to InputFormat. Created a placeholder OutputFormat. * beda7ed Revise the globPartitions to avoid the bad partition paths * 5d889c0 Fix a wrong config * a60fbdf First version to add load function * 4b90944 Adding detailed metadata to each commit * 4a97a6c Changes to backfill script + enabling spark event log * ada2b79 Discard records without partition path & move parquet writer to snappy * 954c933 Adding backfill script - Cleanups & additional cmd line options to job - Changed iounit logic to special case 2010-2014 again * 8b5e288 Breaking apart backfill job & single run into two classes * ebdcbea Handle partial failures in update() * 4bf6ffe Fixing an issue where file name is not present * e468bff Fix couple of issues with Hbase indexing and commit ts checks * 17da30c Changing de-dupe implementation to be a Spark reduceByKey * 248c725 removed coalescing which was put in there for testing * 1b3f929 Implement compression when storing large json strings in memory * 5bada98 Changes to accomodate task failure handling, on top of cleaner * 66f895a Clean out files generated by previous failed attempts * 9cbe370 Implementing a rudimentary cleaner & avro conversion rewrite * 3606658 Adding configs for iounits & reduce parallelism * 066c2f5 Registering the Hoodie classes with Kryo * 342eed1 Implementing a rudimentary cleaner * 0d20d1d Merge branch 'trip-test-run' |\ | * 6eafdbb Adding de-dupe step before writing/shuffling * | 34baba7 Packaging hadoop-common with the hadoop-mr InputFormat JAR |/ * d5856db Merge HoodieInputFormat with existing code. Factor out common logic into hadoop-common. Tune the partitions, spark executors, parquet parameters to be able to run on a single day of input data * e8885ce Introduce IOUnit to split parallelize inserts * ab1977a Pushing in a real Spark job that works off real data * 0c86645 HoodirInputFormat with TestDataSimulator * 6af483c Initial checkin for HoodieInputFormat * 99c58f2 Implementing HBase backed index * 4177529 First major chunk of Hoodie Spark Client Impl * 29fad70 Benchmark bloom filter file read performance * 18f52a4 Checking in the simulation code, measuring cost of trip's file-level updates * 885f444 Adding basic datastructures for Client, key & record. * 72e7b4d Initial commit
2016-12-16 14:34:42 -08:00
parent 0512da094b
commit 81874a8406
69 changed files with 10464 additions and 11 deletions
--- a/hoodie-client/src/main/java/com/uber/hoodie/index/HBaseIndex.java
+++ b/hoodie-client/src/main/java/com/uber/hoodie/index/HBaseIndex.java
@@ -0,0 +1,229 @@
+/*
+ * Copyright (c) 2016 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *          http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.index;
+
+import com.google.common.base.Optional;
+import com.uber.hoodie.config.HoodieWriteConfig;
+import com.uber.hoodie.WriteStatus;
+import com.uber.hoodie.common.model.HoodieKey;
+import com.uber.hoodie.common.model.HoodieRecordLocation;
+import com.uber.hoodie.common.model.HoodieRecordPayload;
+import com.uber.hoodie.common.model.HoodieTableMetadata;
+import com.uber.hoodie.common.model.HoodieRecord;
+
+import com.uber.hoodie.config.HoodieIndexConfig;
+import com.uber.hoodie.exception.HoodieDependentSystemUnavailableException;
+import com.uber.hoodie.exception.HoodieIndexException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hbase.HBaseConfiguration;
+import org.apache.hadoop.hbase.TableName;
+import org.apache.hadoop.hbase.client.*;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function2;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+/**
+ * Hoodie Index implementation backed by HBase
+ */
+public class HBaseIndex<T extends HoodieRecordPayload> extends HoodieIndex<T> {
+    private final static byte[] SYSTEM_COLUMN_FAMILY = Bytes.toBytes("_s");
+    private final static byte[] COMMIT_TS_COLUMN = Bytes.toBytes("commit_ts");
+    private final static byte[] FILE_NAME_COLUMN = Bytes.toBytes("file_name");
+    private final static byte[] PARTITION_PATH_COLUMN = Bytes.toBytes("partition_path");
+
+    private static Logger logger = LogManager.getLogger(HBaseIndex.class);
+
+    private final String tableName;
+
+    public HBaseIndex(HoodieWriteConfig config, JavaSparkContext jsc) {
+        super(config, jsc);
+        this.tableName = config.getProps().getProperty(HoodieIndexConfig.HBASE_TABLENAME_PROP);
+    }
+
+    @Override
+    public JavaPairRDD<HoodieKey, Optional<String>> fetchRecordLocation(
+        JavaRDD<HoodieKey> hoodieKeys, HoodieTableMetadata metadata) {
+        throw new UnsupportedOperationException("HBase index does not implement check exist yet");
+    }
+
+    private static Connection hbaseConnection = null;
+
+    private Connection getHBaseConnection() {
+        Configuration hbaseConfig = HBaseConfiguration.create();
+        String quorum = config.getProps().getProperty(HoodieIndexConfig.HBASE_ZKQUORUM_PROP);
+        hbaseConfig.set("hbase.zookeeper.quorum", quorum);
+        String port = config.getProps().getProperty(HoodieIndexConfig.HBASE_ZKPORT_PROP);
+        hbaseConfig.set("hbase.zookeeper.property.clientPort", port);
+        try {
+            return ConnectionFactory.createConnection(hbaseConfig);
+        } catch (IOException e) {
+            throw new HoodieDependentSystemUnavailableException(
+                HoodieDependentSystemUnavailableException.HBASE, quorum + ":" + port);
+        }
+    }
+
+    /**
+     * Function that tags each HoodieRecord with an existing location, if known.
+     */
+    class LocationTagFunction
+            implements Function2<Integer, Iterator<HoodieRecord<T>>, Iterator<HoodieRecord<T>>> {
+
+        private final HoodieTableMetadata metadata;
+
+        LocationTagFunction(HoodieTableMetadata metadata) {
+            this.metadata = metadata;
+        }
+
+        @Override
+        public Iterator<HoodieRecord<T>> call(Integer partitionNum,
+                                           Iterator<HoodieRecord<T>> hoodieRecordIterator) {
+            // Grab the global HBase connection
+            synchronized (HBaseIndex.class) {
+                if (hbaseConnection == null) {
+                    hbaseConnection = getHBaseConnection();
+                }
+            }
+            List<HoodieRecord<T>> taggedRecords = new ArrayList<>();
+            HTable hTable = null;
+            try {
+                hTable = (HTable) hbaseConnection.getTable(TableName.valueOf(tableName));
+                // Do the tagging.
+                while (hoodieRecordIterator.hasNext()) {
+                    HoodieRecord rec = hoodieRecordIterator.next();
+                    // TODO(vc): This may need to be a multi get.
+                    Result result = hTable.get(
+                            new Get(Bytes.toBytes(rec.getRecordKey())).setMaxVersions(1)
+                                    .addColumn(SYSTEM_COLUMN_FAMILY, COMMIT_TS_COLUMN)
+                                    .addColumn(SYSTEM_COLUMN_FAMILY, FILE_NAME_COLUMN)
+                                    .addColumn(SYSTEM_COLUMN_FAMILY, PARTITION_PATH_COLUMN));
+
+                    // first, attempt to grab location from HBase
+                    if (result.getRow() != null) {
+                        String commitTs =
+                                Bytes.toString(result.getValue(SYSTEM_COLUMN_FAMILY, COMMIT_TS_COLUMN));
+                        String fileId =
+                                Bytes.toString(result.getValue(SYSTEM_COLUMN_FAMILY, FILE_NAME_COLUMN));
+
+                        // if the last commit ts for this row is less than the system commit ts
+                        if (!metadata.isCommitsEmpty() && metadata.isCommitTsSafe(commitTs)) {
+                            rec.setCurrentLocation(new HoodieRecordLocation(commitTs, fileId));
+                        }
+                    }
+                    taggedRecords.add(rec);
+                }
+            } catch (IOException e) {
+                throw new HoodieIndexException(
+                    "Failed to Tag indexed locations because of exception with HBase Client", e);
+            }
+
+            finally {
+                if (hTable != null) {
+                    try {
+                        hTable.close();
+                    } catch (IOException e) {
+                        // Ignore
+                    }
+                }
+
+            }
+            return taggedRecords.iterator();
+        }
+    }
+
+    @Override
+    public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD,
+                                             HoodieTableMetadata metadata) {
+        return recordRDD.mapPartitionsWithIndex(this.new LocationTagFunction(metadata), true);
+    }
+
+    class UpdateLocationTask implements Function2<Integer, Iterator<WriteStatus>, Iterator<WriteStatus>> {
+        @Override
+        public Iterator<WriteStatus> call(Integer partition, Iterator<WriteStatus> statusIterator) {
+
+            List<WriteStatus> writeStatusList = new ArrayList<>();
+            // Grab the global HBase connection
+            synchronized (HBaseIndex.class) {
+                if (hbaseConnection == null) {
+                    hbaseConnection = getHBaseConnection();
+                }
+            }
+            HTable hTable = null;
+            try {
+                hTable = (HTable) hbaseConnection.getTable(TableName.valueOf(tableName));
+                while (statusIterator.hasNext()) {
+                    WriteStatus writeStatus = statusIterator.next();
+                    List<Put> puts = new ArrayList<>();
+                    try {
+                        for (HoodieRecord rec : writeStatus.getWrittenRecords()) {
+                            if (!writeStatus.isErrored(rec.getKey())) {
+                                Put put = new Put(Bytes.toBytes(rec.getRecordKey()));
+                                HoodieRecordLocation loc = rec.getNewLocation();
+                                put.addColumn(SYSTEM_COLUMN_FAMILY, COMMIT_TS_COLUMN,
+                                    Bytes.toBytes(loc.getCommitTime()));
+                                put.addColumn(SYSTEM_COLUMN_FAMILY, FILE_NAME_COLUMN,
+                                    Bytes.toBytes(loc.getFileId()));
+                                put.addColumn(SYSTEM_COLUMN_FAMILY, PARTITION_PATH_COLUMN,
+                                    Bytes.toBytes(rec.getPartitionPath()));
+                                puts.add(put);
+                            }
+                        }
+                        hTable.put(puts);
+                        hTable.flushCommits();
+                    } catch (Exception e) {
+                        Exception we = new Exception("Error updating index for " + writeStatus, e);
+                        logger.error(we);
+                        writeStatus.setGlobalError(we);
+                    }
+                    writeStatusList.add(writeStatus);
+                }
+            } catch (IOException e) {
+                throw new HoodieIndexException(
+                    "Failed to Update Index locations because of exception with HBase Client", e);
+            } finally {
+                if (hTable != null) {
+                    try {
+                        hTable.close();
+                    } catch (IOException e) {
+                        // Ignore
+                    }
+                }
+            }
+            return writeStatusList.iterator();
+        }
+    }
+
+    @Override
+    public JavaRDD<WriteStatus> updateLocation(JavaRDD<WriteStatus> writeStatusRDD,
+                                               HoodieTableMetadata metadata) {
+        return writeStatusRDD.mapPartitionsWithIndex(new UpdateLocationTask(), true);
+    }
+
+    @Override
+    public boolean rollbackCommit(String commitTime) {
+        // TODO (weiy)
+        return true;
+    }
+}
--- a/hoodie-client/src/main/java/com/uber/hoodie/index/HoodieBloomIndex.java
+++ b/hoodie-client/src/main/java/com/uber/hoodie/index/HoodieBloomIndex.java
@@ -0,0 +1,422 @@
+/*
+ * Copyright (c) 2016 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *          http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.index;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Optional;
+
+import com.uber.hoodie.config.HoodieWriteConfig;
+import com.uber.hoodie.WriteStatus;
+import com.uber.hoodie.common.model.HoodieKey;
+import com.uber.hoodie.common.model.HoodieRecord;
+import com.uber.hoodie.common.model.HoodieRecordLocation;
+import com.uber.hoodie.common.model.HoodieRecordPayload;
+import com.uber.hoodie.common.model.HoodieTableMetadata;
+import com.uber.hoodie.common.util.FSUtils;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.PairFlatMapFunction;
+import org.apache.spark.api.java.function.PairFunction;
+
+import scala.Tuple2;
+import java.util.*;
+
+/**
+ * Indexing mechanism based on bloom filter. Each parquet file includes its row_key bloom filter in
+ * its metadata.
+ */
+public class HoodieBloomIndex<T extends HoodieRecordPayload> extends HoodieIndex<T> {
+
+    private static Logger logger = LogManager.getLogger(HoodieBloomIndex.class);
+
+    // we need to limit the join such that it stays within 1.5GB per Spark partition. (SPARK-1476)
+    private static final int SPARK_MAXIMUM_BYTES_PER_PARTITION = 1500 * 1024 * 1024;
+    // this is how much a triplet of (partitionPath, fileId, recordKey) costs.
+    private static final int BYTES_PER_PARTITION_FILE_KEY_TRIPLET = 300;
+    private static int MAX_ITEMS_PER_JOIN_PARTITION = SPARK_MAXIMUM_BYTES_PER_PARTITION / BYTES_PER_PARTITION_FILE_KEY_TRIPLET;
+
+    public HoodieBloomIndex(HoodieWriteConfig config, JavaSparkContext jsc) {
+        super(config, jsc);
+    }
+
+    @Override
+    /**
+     *
+     */
+    public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD, final HoodieTableMetadata metadata) {
+
+        // Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
+        JavaPairRDD<String, String> partitionRecordKeyPairRDD = recordRDD
+                .mapToPair(new PairFunction<HoodieRecord<T>, String, String>() {
+                    @Override
+                    public Tuple2<String, String> call(HoodieRecord<T> record) throws Exception {
+                        return new Tuple2<>(record.getPartitionPath(), record.getRecordKey());
+                    }
+                });
+
+        // Lookup indexes for all the partition/recordkey pair
+        JavaPairRDD<String, String> rowKeyFilenamePairRDD =
+            lookupIndex(partitionRecordKeyPairRDD, metadata);
+
+        // Cache the result, for subsequent stages.
+        rowKeyFilenamePairRDD.cache();
+        long totalTaggedRecords = rowKeyFilenamePairRDD.count();
+        logger.info("Number of update records (ones tagged with a fileID): " + totalTaggedRecords);
+
+
+        // Step 4: Tag the incoming records, as inserts or updates, by joining with existing record keys
+        // Cost: 4 sec.
+        return tagLocationBacktoRecords(rowKeyFilenamePairRDD, recordRDD);
+    }
+
+    public JavaPairRDD<HoodieKey, Optional<String>> fetchRecordLocation(
+        JavaRDD<HoodieKey> hoodieKeys, final HoodieTableMetadata metadata) {
+        JavaPairRDD<String, String> partitionRecordKeyPairRDD =
+            hoodieKeys.mapToPair(new PairFunction<HoodieKey, String, String>() {
+                @Override
+                public Tuple2<String, String> call(HoodieKey key) throws Exception {
+                    return new Tuple2<>(key.getPartitionPath(), key.getRecordKey());
+                }
+            });
+
+        // Lookup indexes for all the partition/recordkey pair
+        JavaPairRDD<String, String> rowKeyFilenamePairRDD =
+            lookupIndex(partitionRecordKeyPairRDD, metadata);
+
+        JavaPairRDD<String, HoodieKey> rowKeyHoodieKeyPairRDD =
+            hoodieKeys.mapToPair(new PairFunction<HoodieKey, String, HoodieKey>() {
+                @Override
+                public Tuple2<String, HoodieKey> call(HoodieKey key) throws Exception {
+                    return new Tuple2<>(key.getRecordKey(), key);
+                }
+            });
+
+        return rowKeyHoodieKeyPairRDD.leftOuterJoin(rowKeyFilenamePairRDD).mapToPair(
+            new PairFunction<Tuple2<String, Tuple2<HoodieKey, Optional<String>>>, HoodieKey, Optional<String>>() {
+                @Override
+                public Tuple2<HoodieKey, Optional<String>> call(
+                    Tuple2<String, Tuple2<HoodieKey, Optional<String>>> keyPathTuple)
+                    throws Exception {
+                    Optional<String> recordLocationPath;
+                    if (keyPathTuple._2._2.isPresent()) {
+                        String fileName = keyPathTuple._2._2.get();
+                        String partitionPath = keyPathTuple._2._1.getPartitionPath();
+                        recordLocationPath = Optional
+                            .of(new Path(new Path(metadata.getBasePath(), partitionPath), fileName)
+                                .toUri().getPath());
+                    } else {
+                        recordLocationPath = Optional.absent();
+                    }
+                    return new Tuple2<>(keyPathTuple._2._1, recordLocationPath);
+                }
+            });
+    }
+
+    /**
+     * Lookup the location for each record key and return the pair<record_key,location> for all
+     * record keys already present and drop the record keys if not present
+     *
+     * @param partitionRecordKeyPairRDD
+     * @param metadata
+     * @return
+     */
+    private JavaPairRDD<String, String> lookupIndex(
+        JavaPairRDD<String, String> partitionRecordKeyPairRDD, final HoodieTableMetadata metadata) {
+        // Obtain records per partition, in the incoming records
+        Map<String, Object> recordsPerPartition = partitionRecordKeyPairRDD.countByKey();
+        List<String> affectedPartitionPathList = new ArrayList<>(recordsPerPartition.keySet());
+
+        // Step 2: Load all involved files as <Partition, filename> pairs
+        JavaPairRDD<String, String> partitionFilePairRDD =
+            loadInvolvedFiles(affectedPartitionPathList, metadata);
+        Map<String, Object> filesPerPartition = partitionFilePairRDD.countByKey();
+
+        // Compute total subpartitions, to split partitions into.
+        Map<String, Long> subpartitionCountMap =
+            computeSubPartitions(recordsPerPartition, filesPerPartition);
+
+        // Step 3: Obtain a RDD, for each incoming record, that already exists, with the file id, that contains it.
+        return findMatchingFilesForRecordKeys(partitionFilePairRDD, partitionRecordKeyPairRDD,
+            subpartitionCountMap);
+    }
+
+    /**
+     * The index lookup can be skewed in three dimensions : #files, #partitions, #records
+     *
+     * To be able to smoothly handle skews, we need to compute how to split each partitions
+     * into subpartitions. We do it here, in a way that keeps the amount of each Spark join
+     * partition to < 2GB.
+     *
+     * @param recordsPerPartition
+     * @param filesPerPartition
+     * @return
+     */
+    private Map<String, Long> computeSubPartitions(Map<String, Object> recordsPerPartition, Map<String, Object> filesPerPartition) {
+        Map<String, Long> subpartitionCountMap = new HashMap<>();
+        long totalRecords = 0;
+        long totalFiles = 0;
+
+        for (String partitionPath : recordsPerPartition.keySet()) {
+            long numRecords = (Long) recordsPerPartition.get(partitionPath);
+            long numFiles = filesPerPartition.containsKey(partitionPath) ? (Long) filesPerPartition.get(partitionPath) : 1L;
+            subpartitionCountMap.put(partitionPath, ((numFiles * numRecords) / MAX_ITEMS_PER_JOIN_PARTITION) + 1);
+
+            totalFiles += filesPerPartition.containsKey(partitionPath) ? (Long) filesPerPartition.get(partitionPath) : 0L;
+            totalRecords += numRecords;
+        }
+        logger.info("TotalRecords: " + totalRecords + ", TotalFiles: " + totalFiles + ", TotalAffectedPartitions:" + recordsPerPartition.size());
+        logger.info("Sub Partition Counts : " + subpartitionCountMap);
+        return subpartitionCountMap;
+    }
+
+    /**
+     * Load the input records as <Partition, RowKeys> in memory.
+     */
+    @VisibleForTesting
+    Map<String, Iterable<String>> getPartitionToRowKeys(JavaRDD<HoodieRecord<T>> recordRDD) {
+        // Have to wrap the map into a hashmap becuase of the need to braoadcast (see: http://php.sabscape.com/blog/?p=671)
+        return recordRDD.mapToPair(new PairFunction<HoodieRecord<T>, String, String>() {
+            @Override
+            public Tuple2<String, String> call(HoodieRecord record) {
+                return new Tuple2<>(record.getPartitionPath(), record.getRecordKey());
+            }
+        }).groupByKey().collectAsMap();
+    }
+
+    /**
+     * Load all involved files as <Partition, filename> pair RDD.
+     */
+    @VisibleForTesting
+    JavaPairRDD<String, String> loadInvolvedFiles(List<String> partitions, final HoodieTableMetadata metadata) {
+        return jsc.parallelize(partitions, Math.max(partitions.size(), 1))
+                .flatMapToPair(new PairFlatMapFunction<String, String, String>() {
+                    @Override
+                    public Iterable<Tuple2<String, String>> call(String partitionPath) {
+                        FileSystem fs = FSUtils.getFs();
+                        String latestCommitTime = metadata.getAllCommits().lastCommit();
+                        FileStatus[] filteredStatus = metadata.getLatestVersionInPartition(fs, partitionPath, latestCommitTime);
+                        List<Tuple2<String, String>> list = new ArrayList<>();
+                        for (FileStatus fileStatus : filteredStatus) {
+                            list.add(new Tuple2<>(partitionPath, fileStatus.getPath().getName()));
+                        }
+                        return list;
+                    }
+                });
+    }
+
+    @Override
+    public boolean rollbackCommit(String commitTime) {
+        // Nope, don't need to do anything.
+        return true;
+    }
+
+
+    /**
+     * When we subpartition records going into a partition, we still need to check them against
+     * all the files within the partition. Thus, we need to explode the (partition, file) pairs
+     * to (partition_subpartnum, file), so we can later join.
+     *
+     *
+     * @param partitionFilePairRDD
+     * @param subpartitionCountMap
+     * @return
+     */
+    private JavaPairRDD<String, String> explodePartitionFilePairRDD(JavaPairRDD<String, String> partitionFilePairRDD,
+                                                                    final Map<String, Long> subpartitionCountMap) {
+        return partitionFilePairRDD
+                .map(new Function<Tuple2<String, String>, List<Tuple2<String, String>>>() {
+                    @Override
+                    public List<Tuple2<String, String>> call(Tuple2<String, String> partitionFilePair) throws Exception {
+                        List<Tuple2<String, String>> explodedPartitionFilePairs = new ArrayList<>();
+                        for (long l = 0; l < subpartitionCountMap.get(partitionFilePair._1); l++) {
+                            explodedPartitionFilePairs.add(new Tuple2<>(
+                                    String.format("%s#%d", partitionFilePair._1, l),
+                                    partitionFilePair._2));
+                        }
+                        return explodedPartitionFilePairs;
+                    }
+                })
+                .flatMapToPair(new PairFlatMapFunction<List<Tuple2<String, String>>, String, String>() {
+                    @Override
+                    public Iterable<Tuple2<String, String>> call(List<Tuple2<String, String>> exploded) throws Exception {
+                        return exploded;
+                    }
+                });
+
+    }
+
+    /**
+     * To handle tons of incoming records to a partition, we need to split them into groups or create subpartitions.
+     * Here, we do a simple hash mod splitting, based on computed sub partitions.
+     *
+     * @param partitionRecordKeyPairRDD
+     * @param subpartitionCountMap
+     * @return
+     */
+    private JavaPairRDD<String, String> splitPartitionRecordKeysPairRDD(JavaPairRDD<String, String> partitionRecordKeyPairRDD,
+                                                                        final Map<String, Long> subpartitionCountMap) {
+        return partitionRecordKeyPairRDD
+                .mapToPair(new PairFunction<Tuple2<String, String>, String, String>() {
+                    @Override
+                    public Tuple2<String, String> call(Tuple2<String, String> partitionRecordKeyPair) throws Exception {
+                        long subpart = Math.abs(partitionRecordKeyPair._2.hashCode()) % subpartitionCountMap.get(partitionRecordKeyPair._1);
+                        return new Tuple2<>(
+                                String.format("%s#%d", partitionRecordKeyPair._1, subpart),
+                                partitionRecordKeyPair._2);
+                    }
+                });
+    }
+
+
+    /**
+     * Its crucial to pick the right parallelism.
+     *
+     * totalSubPartitions : this is deemed safe limit, to be nice with Spark.
+     * inputParallelism : typically number of input files.
+     *
+     * We pick the max such that, we are always safe, but go higher if say a there are
+     * a lot of input files. (otherwise, we will fallback to number of partitions in input and
+     * end up with slow performance)
+     *
+     *
+     * @param inputParallelism
+     * @param subpartitionCountMap
+     * @return
+     */
+    private int determineParallelism(int inputParallelism, final Map<String, Long> subpartitionCountMap) {
+        // size the join parallelism to max(total number of sub partitions, total number of files).
+        int totalSubparts = 0;
+        for (long subparts : subpartitionCountMap.values()) {
+            totalSubparts += (int) subparts;
+        }
+        int joinParallelism = Math.max(totalSubparts, inputParallelism);
+        logger.info("InputParallelism: ${" + inputParallelism + "}, " +
+                "TotalSubParts: ${" + totalSubparts + "}, " +
+                "Join Parallelism set to : " + joinParallelism);
+        return joinParallelism;
+    }
+
+
+    /**
+     * Find out <RowKey, filename> pair. All workload grouped by file-level.
+     *
+     *         // Join PairRDD(PartitionPath, RecordKey) and PairRDD(PartitionPath, File) & then repartition such that
+     // each RDD partition is a file, then for each file, we do (1) load bloom filter, (2) load rowKeys, (3) Tag rowKey
+     // Make sure the parallelism is atleast the groupby parallelism for tagging location
+     */
+    private JavaPairRDD<String, String> findMatchingFilesForRecordKeys(JavaPairRDD<String, String> partitionFilePairRDD,
+                                                                       JavaPairRDD<String, String> partitionRecordKeyPairRDD,
+                                                                       final Map<String, Long> subpartitionCountMap) {
+
+        // prepare the two RDDs and their join parallelism
+        JavaPairRDD<String, String> subpartitionFilePairRDD = explodePartitionFilePairRDD(partitionFilePairRDD, subpartitionCountMap);
+        JavaPairRDD<String, String> subpartitionRecordKeyPairRDD = splitPartitionRecordKeysPairRDD(partitionRecordKeyPairRDD,
+                subpartitionCountMap);
+        int joinParallelism = determineParallelism(partitionRecordKeyPairRDD.partitions().size(), subpartitionCountMap);
+
+        // Perform a join, to bring all the files in each subpartition ,together with the record keys to be tested against them
+        JavaPairRDD<String, Tuple2<String, String>> joinedTripletRDD = subpartitionFilePairRDD.join(subpartitionRecordKeyPairRDD, joinParallelism);
+
+        // sort further based on filename, such that all checking for the file can happen within a single partition, on-the-fly
+        JavaPairRDD<String, Tuple2<String, HoodieKey>> fileSortedTripletRDD = joinedTripletRDD
+                .mapToPair(new PairFunction<Tuple2<String, Tuple2<String, String>>, String, Tuple2<String, HoodieKey>>() {
+                    @Override
+                    /**
+                     * Incoming triplet is (partitionPath_subpart) => (file, recordKey)
+                     */
+                    public Tuple2<String, Tuple2<String, HoodieKey>> call(Tuple2<String, Tuple2<String, String>> joinedTriplet) throws Exception {
+                        String partitionPath = joinedTriplet._1.split("#")[0]; // throw away the subpart
+                        String fileName = joinedTriplet._2._1;
+                        String recordKey = joinedTriplet._2._2;
+
+                        // make a sort key as <file>#<recordKey>, to handle skews
+                        return new Tuple2<>(String.format("%s#%s", fileName, recordKey),
+                                new Tuple2<>(fileName, new HoodieKey(recordKey, partitionPath)));
+                    }
+                }).sortByKey(true, joinParallelism);
+
+        return fileSortedTripletRDD
+            .mapPartitionsWithIndex(new HoodieBloomIndexCheckFunction(config.getBasePath()), true)
+            .flatMap(new FlatMapFunction<List<IndexLookupResult>, IndexLookupResult>() {
+                @Override
+                public Iterable<IndexLookupResult> call(List<IndexLookupResult> indexLookupResults)
+                    throws Exception {
+                    return indexLookupResults;
+                }
+            }).filter(new Function<IndexLookupResult, Boolean>() {
+                @Override
+                public Boolean call(IndexLookupResult lookupResult) throws Exception {
+                    return lookupResult.getMatchingRecordKeys().size() > 0;
+                }
+            }).flatMapToPair(new PairFlatMapFunction<IndexLookupResult, String, String>() {
+                @Override
+                public Iterable<Tuple2<String, String>> call(IndexLookupResult lookupResult)
+                    throws Exception {
+                    List<Tuple2<String, String>> vals = new ArrayList<>();
+                    for (String recordKey : lookupResult.getMatchingRecordKeys()) {
+                        vals.add(new Tuple2<>(recordKey, lookupResult.getFileName()));
+                    }
+                    return vals;
+                }
+            });
+    }
+
+    /**
+     * Tag the <rowKey, filename> back to the original HoodieRecord RDD.
+     */
+    private JavaRDD<HoodieRecord<T>> tagLocationBacktoRecords(JavaPairRDD<String, String> rowKeyFilenamePairRDD,
+                                                              JavaRDD<HoodieRecord<T>> recordRDD) {
+        JavaPairRDD<String, HoodieRecord<T>> rowKeyRecordPairRDD = recordRDD.mapToPair(
+                new PairFunction<HoodieRecord<T>, String, HoodieRecord<T>>() {
+                    @Override
+                    public Tuple2<String, HoodieRecord<T>> call(HoodieRecord<T> record) throws Exception {
+                        return new Tuple2<>(record.getRecordKey(), record);
+                    }
+                });
+
+        // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
+        return rowKeyRecordPairRDD.leftOuterJoin(rowKeyFilenamePairRDD).values().map(
+                new Function<Tuple2<HoodieRecord<T>, Optional<String>>, HoodieRecord<T>>() {
+                    @Override
+                    public HoodieRecord<T> call(Tuple2<HoodieRecord<T>, Optional<String>> v1) throws Exception {
+                        HoodieRecord<T> record = v1._1();
+                        if (v1._2().isPresent()) {
+                            String filename = v1._2().get();
+                            if (filename != null && !filename.isEmpty()) {
+                                record.setCurrentLocation(new HoodieRecordLocation(FSUtils.getCommitTime(filename),
+                                        FSUtils.getFileId(filename)));
+                            }
+                        }
+                        return record;
+                    }
+                });
+    }
+
+    @Override
+    public JavaRDD<WriteStatus> updateLocation(JavaRDD<WriteStatus> writeStatusRDD, HoodieTableMetadata metadata) {
+        return writeStatusRDD;
+    }
+}
--- a/hoodie-client/src/main/java/com/uber/hoodie/index/HoodieBloomIndexCheckFunction.java
+++ b/hoodie-client/src/main/java/com/uber/hoodie/index/HoodieBloomIndexCheckFunction.java
@@ -0,0 +1,193 @@
+/*
+ * Copyright (c) 2016 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *          http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.index;
+
+import com.uber.hoodie.common.BloomFilter;
+import com.uber.hoodie.common.model.HoodieKey;
+import com.uber.hoodie.common.util.ParquetUtils;
+import com.uber.hoodie.exception.HoodieException;
+import com.uber.hoodie.exception.HoodieIndexException;
+import com.uber.hoodie.func.LazyIterableIterator;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.function.Function2;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+
+import scala.Tuple2;
+
+/**
+ * Function performing actual checking of RDD parition containing (fileId, hoodieKeys) against the
+ * actual files
+ */
+public class HoodieBloomIndexCheckFunction implements Function2<Integer, Iterator<Tuple2<String, Tuple2<String, HoodieKey>>>, Iterator<List<IndexLookupResult>>> {
+
+    private static Logger logger = LogManager.getLogger(HoodieBloomIndexCheckFunction.class);
+
+    private final String basePath;
+
+    public HoodieBloomIndexCheckFunction(String basePath) {
+        this.basePath = basePath;
+    }
+
+    /**
+     * Given a list of row keys and one file, return only row keys existing in that file.
+     */
+    public static List<String> checkCandidatesAgainstFile(List<String> candidateRecordKeys, Path filePath) throws HoodieIndexException {
+        List<String> foundRecordKeys = new ArrayList<>();
+        try {
+            // Load all rowKeys from the file, to double-confirm
+            if (!candidateRecordKeys.isEmpty()) {
+                Set<String> fileRowKeys = ParquetUtils.readRowKeysFromParquet(filePath);
+                logger.info("Loading " + fileRowKeys.size() + " row keys from " + filePath);
+                if (logger.isDebugEnabled()) {
+                    logger.debug("Keys from " + filePath + " => " + fileRowKeys);
+                }
+                for (String rowKey : candidateRecordKeys) {
+                    if (fileRowKeys.contains(rowKey)) {
+                        foundRecordKeys.add(rowKey);
+                    }
+                }
+                logger.info("After checking with row keys, we have " + foundRecordKeys.size() + " results, for file " + filePath + " => " + foundRecordKeys);
+                if (logger.isDebugEnabled()) {
+                    logger.debug("Keys matching for file " + filePath + " => " + foundRecordKeys);
+                }
+            }
+        } catch (Exception e){
+            throw new HoodieIndexException("Error checking candidate keys against file.", e);
+        }
+        return foundRecordKeys;
+    }
+
+    class LazyKeyCheckIterator extends LazyIterableIterator<Tuple2<String, Tuple2<String, HoodieKey>>, List<IndexLookupResult>> {
+
+        private List<String> candidateRecordKeys;
+
+        private BloomFilter bloomFilter;
+
+        private String currentFile;
+
+        private String currentParitionPath;
+
+        LazyKeyCheckIterator(Iterator<Tuple2<String, Tuple2<String, HoodieKey>>> fileParitionRecordKeyTripletItr) {
+            super(fileParitionRecordKeyTripletItr);
+            currentFile = null;
+            candidateRecordKeys = new ArrayList<>();
+            bloomFilter = null;
+            currentParitionPath = null;
+        }
+
+        @Override
+        protected void start() {
+        }
+
+        private void initState(String fileName, String partitionPath) throws HoodieIndexException {
+            try {
+                Path filePath = new Path(basePath + "/" + partitionPath + "/" + fileName);
+                bloomFilter = ParquetUtils.readBloomFilterFromParquetMetadata(filePath);
+                candidateRecordKeys = new ArrayList<>();
+                currentFile = fileName;
+                currentParitionPath = partitionPath;
+            } catch (Exception e) {
+                throw new HoodieIndexException("Error checking candidate keys against file.", e);
+            }
+        }
+
+        @Override
+        protected List<IndexLookupResult> computeNext() {
+
+            List<IndexLookupResult> ret = new ArrayList<>();
+            try {
+                // process one file in each go.
+                while (inputItr.hasNext()) {
+
+                    Tuple2<String, Tuple2<String, HoodieKey>> currentTuple = inputItr.next();
+                    String fileName = currentTuple._2._1;
+                    String partitionPath = currentTuple._2._2.getPartitionPath();
+                    String recordKey = currentTuple._2._2.getRecordKey();
+
+                    // lazily init state
+                    if (currentFile == null) {
+                        initState(fileName, partitionPath);
+                    }
+
+                    // if continue on current file)
+                    if (fileName.equals(currentFile)) {
+                        // check record key against bloom filter of current file & add to possible keys if needed
+                        if (bloomFilter.mightContain(recordKey)) {
+                            if (logger.isDebugEnabled()) {
+                                logger.debug("#1 Adding " + recordKey + " as candidate for file " + fileName);
+                            }
+                            candidateRecordKeys.add(recordKey);
+                        }
+                    } else {
+                        // do the actual checking of file & break out
+                        Path filePath = new Path(basePath + "/" + currentParitionPath + "/" + currentFile);
+                        logger.info("#1 After bloom filter, the candidate row keys is reduced to " + candidateRecordKeys.size() + " for " + filePath);
+                        if (logger.isDebugEnabled()) {
+                            logger.debug("#The candidate row keys for " + filePath + " => " + candidateRecordKeys);
+                        }
+                        ret.add(new IndexLookupResult(currentFile, checkCandidatesAgainstFile(candidateRecordKeys, filePath)));
+
+                        initState(fileName, partitionPath);
+                        if (bloomFilter.mightContain(recordKey)) {
+                            if (logger.isDebugEnabled()) {
+                                logger.debug("#2 Adding " + recordKey + " as candidate for file " + fileName);
+                            }
+                            candidateRecordKeys.add(recordKey);
+                        }
+                        break;
+                    }
+                }
+
+                // handle case, where we ran out of input, finish pending work, update return val
+                if (!inputItr.hasNext()) {
+                    Path filePath = new Path(basePath + "/" + currentParitionPath + "/" + currentFile);
+                    logger.info("#2 After bloom filter, the candidate row keys is reduced to " + candidateRecordKeys.size() + " for " + filePath);
+                    if (logger.isDebugEnabled()) {
+                        logger.debug("#The candidate row keys for " + filePath + " => " + candidateRecordKeys);
+                    }
+                    ret.add(new IndexLookupResult(currentFile, checkCandidatesAgainstFile(candidateRecordKeys, filePath)));
+                }
+
+            } catch (Throwable e) {
+                if (e instanceof HoodieException) {
+                    throw e;
+                }
+                throw new HoodieIndexException("Error checking bloom filter index. ", e);
+            }
+
+            return ret;
+        }
+
+        @Override
+        protected void end() {
+        }
+    }
+
+
+    @Override
+    public Iterator<List<IndexLookupResult>> call(Integer partition,
+                                                  Iterator<Tuple2<String, Tuple2<String, HoodieKey>>> fileParitionRecordKeyTripletItr) throws Exception {
+        return new LazyKeyCheckIterator(fileParitionRecordKeyTripletItr);
+    }
+}
--- a/hoodie-client/src/main/java/com/uber/hoodie/index/HoodieIndex.java
+++ b/hoodie-client/src/main/java/com/uber/hoodie/index/HoodieIndex.java
@@ -0,0 +1,101 @@
+/*
+ * Copyright (c) 2016 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *          http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.index;
+
+import com.google.common.base.Optional;
+import com.uber.hoodie.config.HoodieWriteConfig;
+import com.uber.hoodie.WriteStatus;
+import com.uber.hoodie.common.model.HoodieKey;
+import com.uber.hoodie.common.model.HoodieRecordPayload;
+import com.uber.hoodie.common.model.HoodieTableMetadata;
+import com.uber.hoodie.common.model.HoodieRecord;
+
+import com.uber.hoodie.exception.HoodieIndexException;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+
+/**
+ * Base class for different types of indexes to determine the mapping from uuid
+ * <p/>
+ * TODO(vc): need methods for recovery and rollback
+ */
+public abstract class HoodieIndex<T extends HoodieRecordPayload> implements Serializable {
+    protected transient JavaSparkContext jsc = null;
+
+    public enum IndexType {
+        HBASE,
+        INMEMORY,
+        BLOOM
+    }
+
+    protected final HoodieWriteConfig config;
+
+    protected HoodieIndex(HoodieWriteConfig config, JavaSparkContext jsc) {
+        this.config = config;
+        this.jsc = jsc;
+    }
+
+    /**
+     * Checks if the given [Keys] exists in the hoodie table and returns [Key, Optional<FullFilePath>]
+     * If the optional FullFilePath value is not present, then the key is not found. If the FullFilePath
+     * value is present, it is the path component (without scheme) of the URI underlying file
+     *
+     * @param hoodieKeys
+     * @param metadata
+     * @return
+     */
+    public abstract JavaPairRDD<HoodieKey, Optional<String>> fetchRecordLocation(
+        JavaRDD<HoodieKey> hoodieKeys, final HoodieTableMetadata metadata);
+
+    /**
+     * Looks up the index and tags each incoming record with a location of a file that contains the
+     * row (if it is actually present)
+     */
+    public abstract JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD,
+                                                      HoodieTableMetadata metadata) throws
+        HoodieIndexException;
+
+    /**
+     * Extracts the location of written records, and updates the index.
+     * <p/>
+     * TODO(vc): We may need to propagate the record as well in a WriteStatus class
+     */
+    public abstract JavaRDD<WriteStatus> updateLocation(JavaRDD<WriteStatus> writeStatusRDD,
+                                                        HoodieTableMetadata metadata) throws
+        HoodieIndexException;
+
+    /**
+     * Rollback the efffects of the commit made at commitTime.
+     */
+    public abstract boolean rollbackCommit(String commitTime);
+
+    public static <T extends HoodieRecordPayload> HoodieIndex<T> createIndex(
+            HoodieWriteConfig config, JavaSparkContext jsc) throws HoodieIndexException {
+        switch (config.getIndexType()) {
+            case HBASE:
+                return new HBaseIndex<>(config, jsc);
+            case INMEMORY:
+                return new InMemoryHashIndex<>(config, jsc);
+            case BLOOM:
+                return new HoodieBloomIndex<>(config, jsc);
+        }
+        throw new HoodieIndexException("Index type unspecified, set " + config.getIndexType());
+    }
+}
--- a/hoodie-client/src/main/java/com/uber/hoodie/index/InMemoryHashIndex.java
+++ b/hoodie-client/src/main/java/com/uber/hoodie/index/InMemoryHashIndex.java
@@ -0,0 +1,109 @@
+/*
+ * Copyright (c) 2016 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *          http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.index;
+
+import com.google.common.base.Optional;
+import com.uber.hoodie.config.HoodieWriteConfig;
+import com.uber.hoodie.WriteStatus;
+import com.uber.hoodie.common.model.HoodieKey;
+import com.uber.hoodie.common.model.HoodieRecord;
+import com.uber.hoodie.common.model.HoodieRecordLocation;
+import com.uber.hoodie.common.model.HoodieRecordPayload;
+import com.uber.hoodie.common.model.HoodieTableMetadata;
+
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.Function2;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ConcurrentMap;
+
+
+/**
+ * Hoodie Index implementation backed by an in-memory Hash map.
+ *
+ * ONLY USE FOR LOCAL TESTING
+ *
+ */
+public class InMemoryHashIndex<T extends HoodieRecordPayload> extends HoodieIndex<T> {
+
+    private static ConcurrentMap<HoodieKey, HoodieRecordLocation> recordLocationMap;
+
+    public InMemoryHashIndex(HoodieWriteConfig config, JavaSparkContext jsc) {
+        super(config, jsc);
+        recordLocationMap = new ConcurrentHashMap<>();
+    }
+
+    @Override
+    public JavaPairRDD<HoodieKey, Optional<String>> fetchRecordLocation(
+        JavaRDD<HoodieKey> hoodieKeys, final HoodieTableMetadata metadata) {
+        throw new UnsupportedOperationException("InMemory index does not implement check exist yet");
+    }
+
+    /**
+     * Function that tags each HoodieRecord with an existing location, if known.
+     */
+    class LocationTagFunction
+            implements Function2<Integer, Iterator<HoodieRecord<T>>, Iterator<HoodieRecord<T>>> {
+        @Override
+        public Iterator<HoodieRecord<T>> call(Integer partitionNum,
+                                           Iterator<HoodieRecord<T>> hoodieRecordIterator) {
+            List<HoodieRecord<T>> taggedRecords = new ArrayList<>();
+            while (hoodieRecordIterator.hasNext()) {
+                HoodieRecord<T> rec = hoodieRecordIterator.next();
+                if (recordLocationMap.containsKey(rec.getKey())) {
+                    rec.setCurrentLocation(recordLocationMap.get(rec.getKey()));
+                }
+                taggedRecords.add(rec);
+            }
+            return taggedRecords.iterator();
+        }
+    }
+
+    @Override
+    public JavaRDD<HoodieRecord<T>> tagLocation(JavaRDD<HoodieRecord<T>> recordRDD,
+                                             HoodieTableMetadata metadata) {
+        return recordRDD.mapPartitionsWithIndex(this.new LocationTagFunction(), true);
+    }
+
+    @Override
+    public JavaRDD<WriteStatus> updateLocation(JavaRDD<WriteStatus> writeStatusRDD,
+                                               HoodieTableMetadata metadata) {
+        return writeStatusRDD.map(new Function<WriteStatus, WriteStatus>() {
+            @Override
+            public WriteStatus call(WriteStatus writeStatus) {
+                for (HoodieRecord record : writeStatus.getWrittenRecords()) {
+                    if (!writeStatus.isErrored(record.getKey())) {
+                        recordLocationMap.put(record.getKey(), record.getNewLocation());
+                    }
+                }
+                return writeStatus;
+            }
+        });
+    }
+
+    @Override
+    public boolean rollbackCommit(String commitTime) {
+        // TODO (weiy)
+        return true;
+    }
+}
--- a/hoodie-client/src/main/java/com/uber/hoodie/index/IndexLookupResult.java
+++ b/hoodie-client/src/main/java/com/uber/hoodie/index/IndexLookupResult.java
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2016 Uber Technologies, Inc. (hoodie-dev-group@uber.com)
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *          http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.index;
+
+import java.util.List;
+
+/**
+ * Encapsulates the result from an index lookup
+ */
+public class IndexLookupResult {
+
+    private String fileName;
+
+
+    private List<String> matchingRecordKeys;
+
+    public IndexLookupResult(String fileName, List<String> matchingRecordKeys) {
+        this.fileName = fileName;
+        this.matchingRecordKeys = matchingRecordKeys;
+    }
+
+    public String getFileName() {
+        return fileName;
+    }
+
+    public List<String> getMatchingRecordKeys() {
+        return matchingRecordKeys;
+    }
+}