[HUDI-1822] Rewriting rfc-27 for data skipping index (#4280)

2022-01-25 00:27:59 -05:00
parent bf409e8423
commit 920f45926a
2 changed files with 444 additions and 0 deletions
--- a/rfc/rfc-27/col_stats.png
+++ b/rfc/rfc-27/col_stats.png
--- a/rfc/rfc-27/rfc-27.md
+++ b/rfc/rfc-27/rfc-27.md
@@ -0,0 +1,444 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-[27]: [Data skipping Index to improve query performance]
+
+## Proposers
+
+- @manojpec
+- @shivnarayan
+- @satish.kotha
+
+## Approvers
+- @rmpifer
+- @uditme
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-1822
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Query engines typically scan large amounts of irrelevant data for query planning and execution. Some workarounds are 
+available to reduce amount of irrelevant data scanned. These include
+- Partition pruning
+- File pruning <br>
+    - Some data file formats contain metadata including range information for certain columns (for parquet, this metadata 
+    is stored in footer).
+    - As part of query planning, all range information from data files is read.
+    - Irrelevant data files are then pruned based on predicates and available range information
+    
+Partition pruning typically puts the burden on users to select partitions where the data may exist. File pruning approach 
+  is expensive and does not scale if there are large number of partitions and data files to be scanned. So we propose a 
+  new solution to store additional information as part of Hudi metadata table to implement data skipping index. The 
+  goals of data skipping index is to provide:
+
+- Global index: Users query for information they need without need for specifying partitions. Index can effectively find 
+  data files in the table.
+- Improve query plan: Efficiently find data files that have information for specified query predicates.
+- Support multiple types of index: Initial implementation may provide range index. But goal is provide flexible 
+  framework to implement other types of index (e.g. bloom)
+
+## Background
+RFC-15 added metadata table support to Hudi for optimized file listing. RFC-37 is adding metadata index and column stats 
+as another partition to metadata table. This RFC will piggyback on the column stats partition that RFC-37 will be adding 
+to metadata table. 
+
+Notes: Effectiveness of the index will be proportional to how data is layed out. If every file contains data for 
+commonly specified query predicate, index may not be very effective.
+
+## Implementation
+At a high level there are 3 components to implement index support: 
+- Storage format
+- Metadata generation
+- Query engine integration.
+
+### Column_Stats Index/Partition
+We want to support multiple types of index (range, bloom etc). So it is important to generate different types of record
+for different columns. Focus of this RFC will be column range or column stats index. i.e min, max values, null counts etc.
+Users can configure the commonly queried columns and columns stats partition in metadata table will store all stats pertaining
+to the configured columns for every valid data file where the column is present.
+
+Similar to how we generate records for files partition in metadata table, we will generate HoodieMetadataRecord
+for column stats partition on any commit that gets applied to metadata table. Basic building block of metadata table used 
+for file listing will be used for this column stats partition as well (how updates are applied to metadata table,
+how invalid data is ignored, etc)
+
+Column_stats partition stores statistics for all indexed columns in the Hudi data table. The index maintained in this 
+partition helps
+Predicate pushing/data skipping - file filtering based on column predicates
+
+For the purpose of column predicate filtering, this partition can store statistics for any column as per configs.
+
+So, high level requirement for this column_stats partition is (pertaining to this RFC):
+ - Given a list of columns and predicates(and optionally partitions), return a list of matching file names
+
+### Storage format
+To cater to the above requirement, we plan to encode column name, partition path and file name to the keys in HFile. 
+Since HFile supports efficient range/prefix search, our look up should be very fast.
+
+![Column Stats Partition](col_stats.png)
+
+We plan to generate unique and random and unique hash IDs for all 3 components
+- ColumnID :
+    - base64(hash32(column name))
+    - on-disk size = 12bytes per col_stat per file 
+- PartitionID:
+    - base64(hash32(partition name))
+    - on-disk size = 12bytes per partition 
+- FileID:
+    - base64(hash128(file name))
+    - on-disk size = 24bytes per file
+
+#### Design Choices for ID generation
+1. Incremental IDs: Sequentially increasing IDs can be generated in the context of the ongoing commit/write. ID can always start at 1 and to make the full ID unique enough, sequential IDs can be appended with the ongoing commit time.
+ a. Pros:
+    ID is simple to generate, doesn't depend on key lookups for resuming the ID generation across writers.
+    Overall ID can be shorter than Hash based IDs and can still be unique
+    Differential/delta encoding goes good with sequential numbers and can get high compression ratio (though we didn't see this in the tests)
+ b. Cons:
+    Same column can be given several IDs across several commits spilled over several files. Complex merging logic is needed to coalesce them all when looking up for any interested columns.
+    Doesn't go good with schema evolution. Even without schema evolution, changing IDs for the same column by itself is small schema evolution problem.
+   
+2. Hash IDs: Hashing utilities can be used to generate unique and random IDs of any length for the given column/partition/file name.
+ a. Pros:
+    Deterministic Name to ID generation
+    Reverse lookup of ID to name is possible by relatively much smaller meta index read
+    ID length can be controlled for the scaling needs
+    Sharding and locality can be controlled by prefixing with more bits (doable by Incremental IDs also)
+ b. Cons:
+    Big scale deployments demand a huge ID space for files there by needing to generate 128 bits hashes
+    These are usually 32 digit hex chars, taking up at least 32 bytes/ID on disk. However, base64 encoding can help to shave off few bytes and get them to 24 bytes.
+    Takes up larger space in-memory and on-disk compared to Sequential IDs. Theoretically, the compression ratio should be lesser compared to Sequential IDs.
+   
+Key format in column_stats partition<br/> 
+- [colId][PartitionId][FileId]  
+- [colId]+"agg"+[PartitionId]
+
+First type will be used to store one entry per column per file. And second type will be used to store one aggregated 
+entry per column per partition. This will be a fixed size key. Lookups don't have to search for ID delimiters as in the 
+case of incremental IDs.
+
+These key encodings fit in well to serve our requirements.
+Since we are using Hfile as the format, all keys are going to be sorted and hence range read will be very effective for
+our use-case as we have chosen the key format consciously having this in mind.  
+
+Given a list of columns and optionally partitions, return a list of matching file names.
+
+1. We can do prefix search of [ColumnID] or [ColumnID][PartitionID]
+    - If both columnId and partitionIds are supplied, we will do range read of [colId][partitionId].
+    - If list of partitions not available as part of query, we will first look up [colId]+"agg" to do prefix search 
+   for partition level stats. Filter for those partitions which matches the predicates and then follow (1) as in previous line.
+    
+2. Fetch only interested entries for [colId][partitionId] list.
+3. Will look up the stats and filter for matching FileIDs
+4. Reverse lookup in Files partition to get FileID to FileName mapping.
+   
+Note: 
+As you could see here, reverse look up of FileId to fileName mapping has to go into "Files" partition to satisfy our requirement.
+So, "Files" partition will be added with additional entries of fileId to fileName mappings on the write path. 
+
+#### Sharding: 
+Any partition in metadata table needs to be instantiated with N file groups/shards upfront. "Files" partition is small and hence
+we went with just one file group. But for record level index, we can't go with single file group and had to shard the data. 
+We will employ some kind of hashing mechanism for key to file group mapping. On the write path, entries will be sharded 
+and written to different file groups. On the read path, key to be looked up will be hashed to find the right file group
+to be looked up. For wild card search, all file groups will be looked up. 
+
+// To be revisited.<br/>
+We plan to instantiate the number of file groups in column stats partition based on number of columns being indexed. 
+We can't estimate the data scale upfront, to which the table might grow eventually and hence have to go with some estimates.
+So a rough idea is to instantiate one file group for 10 columns being indexed. Or get some rough input from the user whether 
+the table will be a small/medium/large scale and determine based on that. 
+
+Similar to how we generate records for files partition in metadata table, we will generate HoodieMetadataRecord 
+for column stats partition on any commit that gets applied to metadata table.  
+
+### Metadata generation 
+The existing metadata payload schema will be extended and shared for this new "column_stats" partition also. The type 
+field will be used to detect the column stats payload record. Here is the schema for the column stats payload record.
+
+```
+    "namespace": "org.apache.hudi.avro.model",
+    "type": "record",
+    "name": "HoodieMetadataRecord",
+    "doc": "A record saved within the Metadata Table",
+    "fields": [
+        {
+            "name": "key",
+            "type": "string"
+        },
+        {
+            "name": "type",
+            "doc": "Type of the metadata record",
+            "type": "int"
+        },
+        {   "name": "filesystemMetadata",
+           .
+           .
+           .
+        }, 
+        {
+            "name": "ColumnStatsMetadata",
+            "doc": "Contains information about column statistics for all data files in the table",
+            "type": [
+                "null",
+                {
+                    "type": "record",
+                    "name": "HoodieColumnStats",
+                    "fields": [
+                        {
+                            "name": "rangeLow",
+                            "type": [
+                                "null",
+                                "bytes"
+                            ],
+                            "doc": "Low end of the range. For now, this is a String. Based on main data table schema, we can convert it to appropriate type"
+                        },
+                        {
+                            "name": "rangeHigh",
+                            "type": [
+                                "null",
+                                "bytes"
+                            ],
+                            "doc": "High end of the range. For now, this is a String. Based on main data table schema, we can convert it to appropriate type"
+                        },
+                        {
+                            "name":"total_values",
+                            "type":["long", "null"],
+                            "doc" : "Stores total values for this column in the resepective data file"
+                         },
+                        {
+                            "name":"total_nulls",
+                            "type":["long", "null"],
+                            "doc" : "Stores total null values for this column in the resepective data file"
+                         },
+                        {
+                            "name":"total_nans",
+                            "type":["long", "null"],
+                            "doc" : "Stores total Nan values for this column in the resepective data file"
+                         },
+                        {
+                            "name":"total_size_on_disk",
+                            "type":["long", "null"],
+                            "doc" : "Stores total size occupied by this column on disk corresponding to the resepective data file"
+                         },                                                                           
+                        {
+                            "name": "isDeleted",
+                            "type": "boolean",
+                            "doc": "True if this file has been deleted"
+                        }
+                    ]
+                }
+            ]
+        }
+```
+
+Column stats records hold all stats for the file. The key for the column stat record would be an
+encoded string as discussed earlier. 
+
+```
+key = base64_encode(hash64(column name) + hash64(partition name) + hash128(file path))
+key = base64_encode(hash64(column name) + "agg" + hash64(partition name))
+```
+
+While Hash based IDs have quite a few desirable properties in the context of Hudi index lookups, there is an impact
+on the column level schema changes though. Refer to [Schema Evolution](#Schema-Evolution) section for more details.
+
+#### Writer flow
+Let's walk through the writer flow to update column_stats partition in metadata table.
+
+1. Files partition - prepare records for adding // just calling out whats required in the context of column_stats 
+   partition. General files partition will be updated as usual to store file listing information.
+    - FileID => FileName mapping entries
+    - PartitionID => PartitionName entry, if not already exists
+    - Since these IDs are hash based IDs, no look up of prior usages is required here. If not, we need to know what was
+   the last assigned ID and then go about assigning new incremental/sequential IDs, which slows down the performance significantly
+2. Column_stats partition - prepare records for adding
+    - [ColumnID][PartitionID][FileID] => ColumnStat
+    - [ColumnId]"agg"[PartitionId] => ColumnStat
+    - This involves reading the base file footers to fetch min max and other stats to populate values for the record.
+d. Commit all these records to metadata table.
+
+We need to ensure we have all sufficient info in WriteStatus/Commit Metadata that gets passed to metadata writer for 
+every commit. Reading parquet footers and meta is unavoidable, but other than that, we should try to embed all other info 
+in the WriteStatus.
+
+### Index integrations with query engines
+
+#### Spark
+We already added support for z-ordering with 0.10.0. So, we will re-use data skipping code paths from there. 
+
+Here is the high level flow of z-ordering:
+##### Write path
+1. Sort the data (Z-order/Hilbert/Linear)
+    - Being triggered by Clustering (right now)
+    - RDDSpatialCurveOptimizationSortPartitioner
+2. Build "Col Stats" Index (.hoodie/.colstatsindex)
+    - Upon Clustering completion we invoke ColumnStatsIndexHelper.updateColumnStatsIndexFor
+
+##### Read path
+1. (Spark SQL) Asks for a list of files to fetch data from
+    - HoodieFileIndex.listFiles
+2. HoodieFileIndex will read Col Stats Index and apply the data predicates to fetch list of candidate files from it
+3. Returns it back to Spark
+
+Given this, lets see how we can integrate the new column_stats partition.
+
+##### Z-order Write path
+1. Sort the data (Z-order/Hilbert/Linear)
+    - Being triggered by Clustering (right now)
+    - RDDSpatialCurveOptimizationSortPartitioner
+2. Do not do anything. 
+    - Upon Clustering completion, replace commit will get applied to metadata table by default if metadata is enabled. 
+
+##### Read path
+1. (Spark SQL) Asks for a list of files to fetch data from
+    - HoodieFileIndex.listFiles
+2. HoodieFileIndex will read Col Stats partition in metadata table and apply the data predicates to fetch list of candidate files from it
+3. Returns it back to Spark
+
+One caveat:
+But we can't get rid of z-order index completely though right away. If metadata table is not build out yet or has entered 
+an inconsistent state and is not usable, we have to go the existing way of building an index at the end of z-order clustering.
+
+### Predicate filtering 
+
+#### How to apply query predicates in Hudi?
+Query predicates are normally constructed in a tree like structure so this will follow same pattern. The proposal is 
+create a mapping utility from “Engine” query predicates to a HudiExpression. This way filtering logic is engine agnostic
+
+For AND and OR operators we can translate to a tree node with left and right expressions. An example is shown below of what the structure would look
+
+```java
+public class HudiExpressionParentNode implements HudiExpression {
+   HudiExpression left;
+   HudiExpression right;
+    
+   @override
+   boolean evaluate() {
+        left.evaluate() && right.evaluate()
+   }
+}
+```
+
+For LEAF nodes we can create expression which contains the operator and value we are comparing to determine whether the 
+file group may have data relevant to this query. The common search expressions for the leaf nodes:
+
+1. Equal to - if value in search expression greater than or equal to lower bound and is less than or equal to upper bound 
+   in file’s column statistics then true, else false
+2. Less than - if value in search expression is greater than lower bound in file’s column statistics then true, else false
+3. Less than or equal to - if value in search expression is greater than or equal to lower bound in file’s column statistics 
+   then true, else false
+4. Greater than - if value in search expression is lower than upper bound in file’s column statistics then true, else false
+5. Greater than or equal to - if value in search expression is lower than or equal to upper bound in file’s column statistics
+   then true, else false
+
+True tells us that there is a possibility that the file contains data which matches the search expression and to include 
+in result set. False tells us that there is no possibility this file contains any data which matches the search 
+expression and to exclude from the results.
+
+```java
+public class HudiExpressionLeafNode implements HudiExpression {
+    
+   Operator op; // (EQ, LT, LTEQ, GT, GTEQ)
+   T literal; // (INT, DOUBLE, FLOAT value)
+   String column;
+    
+   @override
+   boolean evaluate()
+}
+```
+
+This way we can call evaluate on the root HudiExpression tree and it will determine whether the entire expression is 
+satisfied for the file group.
+
+#### Hive
+In order for us to implement predicate push down in Hive we need to have access to the query predicate. Query predicate 
+is not passed to InputFormat by default. HiveStoragePredicateHandler interface needs to be implemented in order to 
+provide query predicate to InputFormat and for this we need to create a custom HiveStorageHandler. Therefore we will 
+be creating new storage handler HudiStorageHandler.
+
+```java
+public interface HiveStorageHandler extends Configurable {
+  public Class<? extends InputFormat> getInputFormatClass();
+  public Class<? extends OutputFormat> getOutputFormatClass();
+  public Class<? extends SerDe> getSerDeClass();
+  public HiveMetaHook getMetaHook();
+  public void configureTableJobProperties(
+    TableDesc tableDesc,
+    Map<String, String> jobProperties);
+}
+```
+
+Everything will remain same with input format, output format, and serde classes being used in existing Hudi tables 
+registered in Hive (HoodieParquetInputFormat still being used).  HudiStorageHandler would implement HiveStorageHandler 
+and HiveStoragePredicateHandler.
+
+Hive adds the query predicate returned by the Storage Handler to the job configuration. This job configuration is then 
+supplied to the Input Format. It can be fetched and deserialized using the following:
+
+```java
+    String hiveFilter = jobConf.get(TableScanDesc.FILTER_EXPR_CONF_STR);
+    if (hiveFilter != null) {
+      ExprNodeGenericFuncDesc exprNodeDesc = SerializationUtilities
+              .deserializeObject(hiveFilter, ExprNodeGenericFuncDesc.class);
+      SearchArgument sarg = ConvertAstToSearchArg.create(job, exprNodeDesc);
+```
+
+The SearchArgument contains an ExpressionTree and a list of PredicateLeaf. The ExpressionTree is a tree structure used 
+to define the query predicate. If operator is defined as OR, AND, or NOT this indicates there are children expressions, 
+normally LEAFs.
+
+```java
+public class ExpressionTree {
+  public enum Operator {OR, AND, NOT, LEAF, CONSTANT}
+  private final Operator operator;
+  private final List<ExpressionTree> children;
+  private int leaf;
+```
+
+If operator in ExpressionTree is defined as LEAF it corresponds to a PredicateLeaf defined in the Search Argument. 
+PredicateLeaf will contain information about the query predicate such as operator, column name, and literal which is 
+being compared
+
+```java
+   private final org.apache.hadoop.hive.ql.io.sarg.PredicateLeaf.Operator operator;
+        private final Type type;
+        private String columnName;
+        private final Object literal;
+        private final List<Object> literalList;
+
+```
+
+We can use this information and the SearchArgument to generate our HudiExpression. Then in HoodieParquetInputFormat.listStatus() 
+after fetching files from FileSystemView for the remaining file groups we can apply HudieExpression using column metadata.
+
+#### Presto
+To be filled. 
+
+## Rollout/Adoption Plan
+
+- What impact (if any) will there be on existing users?
+- If we are changing behavior how will we phase out the older behavior?
+- If we need special migration tools, describe them here.
+- When will we remove the existing behavior
+
+## Test Plan
+
+Describe in few sentences how the RFC will be tested. How will we know that the implementation works as expected? How will we know nothing broke?.