[HUDI-3468][RFC-49] Support sync with DataHub (#5022)

2022-04-01 12:27:01 -07:00
parent dfdd2de99c
commit 9275b8fc7e
5 changed files with 154 additions and 51 deletions
--- a/rfc/README.md
+++ b/rfc/README.md
@@ -14,61 +14,74 @@
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
+
 # RFCs
- - The RFC process is documented on our [site](https://hudi.apache.org/contribute/rfc-process). Please familiarize yourself with it, before working a new RFC.
- - Status can be one of these values: `UNDER REVIEW` (or) `IN PROGRESS` (or) `ABANDONED` (or) `COMPLETED`.
+
+The RFC process is documented on our [site](https://hudi.apache.org/contribute/rfc-process). Please familiarize yourself
+with it, before working a new RFC.
+
+Status can be one of these values.
+
+| Status | Meaning |
+| -------|-------------------------------------------------------|
+| `UNDER REVIEW` |  RFC has been proposed and community is actively debating the design/proposal.        |
+| `IN PROGRESS` |  The initial phase of implementation is underway.        |
+| `ONGOING` |  Some or most work has landed; community continues to improve or build follow on phases.         |
+| `ABANDONED` | The proposal was not implemented, due to various reasons.         |
+| `COMPLETED` |  All work is deemed complete.        |

 The list of all RFCs can be found here.

 > Older RFC content is still [here](https://cwiki.apache.org/confluence/display/HUDI/RFC+Process).

-| RFC Number | Title | Status |
-| ---| ---| --- |
-| 1 | [CSV Source Support for Delta Streamer](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+01+%3A+CSV+Source+Support+for+Delta+Streamer) | `COMPLETED` |
-| 2 | [ORC Storage in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708439) | `IN PROGRESS` |
-| 3 | [Timeline Service with Incremental File System View Syncing](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708965) | `COMPLETED` |
-| 4 | [Faster Hive incremental pull queries](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115513622) | `COMPLETED` |
-| 5 | [HUI (Hudi WebUI)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233) | `ABANDONED` |
-| 6 | [Add indexing support to the log file](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+06+%3A+Add+indexing+support+to+the+log+file) | `ABANDONED` |
-| 7 | [Point in time Time-Travel queries on Hudi table](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table) | `COMPLETED` |
-| 8 | [Record level indexing mechanisms for Hudi datasets](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets) | `UNDER REVIEW` |
-| 9 | [Hudi Dataset Snapshot Exporter](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter) | `COMPLETED` |
-| 10 | [Restructuring and auto-generation of docs](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs) | `COMPLETED` |
-| 11 | [Refactor of the configuration framework of hudi project](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project) | `ABANDONED` |
-| 12 | [Efficient Migration of Large Parquet Tables to Apache Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi) | `COMPLETED` |
-| 13 | [Integrate Hudi with Flink](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=141724520) | `COMPLETED` |
-| 14 | [JDBC incremental puller](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller) | `COMPLETED` |
-| 15 | [HUDI File Listing Improvements](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements) | `COMPLETED` |
-| 16 | [Abstraction for HoodieInputFormat and RecordReader](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader) | `COMPLETED` |
-| 17 | [Abstract common meta sync module support multiple meta service](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service) | `COMPLETED` |
-| 18 | [Insert Overwrite API](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API) | `COMPLETED` |
-| 19 | [Clustering data for freshness and query performance](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance) | `COMPLETED` |
-| 20 | [handle failed records](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records) | `IN PROGRESS` |
-| 21 | [Allow HoodieRecordKey to be Virtual](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+21+%3A+Allow+HoodieRecordKey+to+be+Virtual) | `COMPLETED` |
+| RFC Number | Title                                                                                                                                                                                                                | Status |
+| ---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --- |
+| 1 | [CSV Source Support for Delta Streamer](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+01+%3A+CSV+Source+Support+for+Delta+Streamer)                                                                         | `COMPLETED` |
+| 2 | [ORC Storage in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708439)                                                                                                                    | `ONGOING` |
+| 3 | [Timeline Service with Incremental File System View Syncing](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708965)                                                                             | `COMPLETED` |
+| 4 | [Faster Hive incremental pull queries](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115513622)                                                                                                   | `COMPLETED` |
+| 5 | [HUI (Hudi WebUI)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233)                                                                                                                       | `ABANDONED` |
+| 6 | [Add indexing support to the log file](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+06+%3A+Add+indexing+support+to+the+log+file)                                                                           | `ABANDONED` |
+| 7 | [Point in time Time-Travel queries on Hudi table](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table)                                                     | `COMPLETED` |
+| 8 | [Record level indexing mechanisms for Hudi datasets](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets)                                                    | `ONGOING` |
+| 9 | [Hudi Dataset Snapshot Exporter](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter)                                                                                       | `COMPLETED` |
+| 10 | [Restructuring and auto-generation of docs](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs)                                                                 | `COMPLETED` |
+| 11 | [Refactor of the configuration framework of hudi project](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project)                                     | `ABANDONED` |
+| 12 | [Efficient Migration of Large Parquet Tables to Apache Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi)                               | `COMPLETED` |
+| 13 | [Integrate Hudi with Flink](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=141724520)                                                                                                              | `COMPLETED` |
+| 14 | [JDBC incremental puller](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller)                                                                                                     | `COMPLETED` |
+| 15 | [HUDI File Listing Improvements](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)                                                                                        | `COMPLETED` |
+| 16 | [Abstraction for HoodieInputFormat and RecordReader](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader)                                                   | `COMPLETED` |
+| 17 | [Abstract common meta sync module support multiple meta service](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service)                           | `COMPLETED` |
+| 18 | [Insert Overwrite API](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API)                                                                                                               | `COMPLETED` |
+| 19 | [Clustering data for freshness and query performance](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance)                                                 | `COMPLETED` |
+| 20 | [handle failed records](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records)                                                                                                         | `IN PROGRESS` |
+| 21 | [Allow HoodieRecordKey to be Virtual](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+21+%3A+Allow+HoodieRecordKey+to+be+Virtual)                                                                             | `COMPLETED` |
 | 22 | [Snapshot Isolation using Optimistic Concurrency Control for multi-writers](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+22+%3A+Snapshot+Isolation+using+Optimistic+Concurrency+Control+for+multi-writers) | `COMPLETED` |
-| 23 | [Hudi Observability metrics collection](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+23+%3A+Hudi+Observability+metrics+collection) | `UNDER REVIEW` |
-| 24 | [Hoodie Flink Writer Proposal](https://cwiki.apache.org/confluence/display/HUDI/RFC-24%3A+Hoodie+Flink+Writer+Proposal) | `UNDER REVIEW` |
-| 25 | [Spark SQL Extension For Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+25%3A+Spark+SQL+Extension+For+Hudi) | `COMPLETED` |
-| 26 | [Optimization For Hudi Table Query](https://cwiki.apache.org/confluence/display/HUDI/RFC-26+Optimization+For+Hudi+Table+Query) | `IN PROGRESS` |
-| 27 | [Data skipping index to improve query performance](https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance) | `IN PROGRESS` |
-| 28 | [Support Z-order curve](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181307144) | `IN PROGRESS` |
-| 29 | [Hash Index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index) | `IN PROGRESS` |
-| 30 | [Batch operation](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+30%3A+Batch+operation) | `UNDER REVIEW` |
-| 31 | [Hive integration Improvment](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment) | `UNDER REVIEW` |
-| 32 | [Kafka Connect Sink for Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi) | `IN PROGRESS` |
-| 33 | [Hudi supports more comprehensive Schema Evolution](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution) | `IN PROGRESS` |
-| 34 | [Hudi BigQuery Integration (WIP)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980) | `UNDER REVIEW` |
-| 35 | [Make Flink MOR table writing streaming friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly) | `UNDER REVIEW` |
-| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` |
-| 37 | [Hudi Metadata based Bloom Index](rfc-37/rfc-37.md) | `IN PROGRESS` |
-| 38 | [Spark Datasource V2 Integration](./rfc-38/rfc-38.md) | `IN PROGRESS` |
-| 39 | [Incremental source for Debezium](./rfc-39/rfc-39.md) | `IN PROGRESS` |
-| 40 | [Hudi Connector for Trino](./rfc-40/rfc-40.md) | `IN PROGRESS` |
-| 41 | [Hudi Snowflake Integration] | `UNDER REVIEW` |
-| 42 | [Consistent Hashing Index](./rfc-42/rfc-42.md) | `UNDER REVIEW` |
-| 43 | [Compaction / Clustering Service](./rfc-43/rfc-43.md) | `UNDER REVIEW` |
-| 44 | [Hudi Connector for Presto](./rfc-44/rfc-44.md) | `UNDER REVIEW` |
-| 45 | [Asynchronous Metadata Indexing](./rfc-45/rfc-45.md) | `IN PROGRESS` |
-| 46 | [Optimizing Record Payload Handling](./rfc-46/rfc-46.md) | `UNDER REVIEW` |
-| 47 | [Add Call Produce Command for Spark SQL](./rfc-47/rfc-47.md) | `UNDER REVIEW` |
-| 48 | [LogCompaction for MOR tables](./rfc-48/rfc-48.md) | `UNDER REVIEW` |
+| 23 | [Hudi Observability metrics collection](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+23+%3A+Hudi+Observability+metrics+collection)| `ABANDONED` | 
+| 24 | [Hoodie Flink Writer Proposal](https://cwiki.apache.org/confluence/display/HUDI/RFC-24%3A+Hoodie+Flink+Writer+Proposal) | `COMPLETED` | 
+| 25 | [Spark SQL Extension For Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+25%3A+Spark+SQL+Extension+For+Hudi) | `COMPLETED` | 
+| 26 | [Optimization For Hudi Table Query](https://cwiki.apache.org/confluence/display/HUDI/RFC-26+Optimization+For+Hudi+Table+Query) | `ONGOING` | 
+| 27 | [Data skipping index to improve query performance](https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance) | `ONGOING` | 
+| 28 | [Support Z-order curve](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181307144) | `COMPLETED` |
+| 29 | [Hash Index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index) | `ONGOING` | 
+| 30 | [Batch operation](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+30%3A+Batch+operation) | `UNDER REVIEW` | 
+| 31 | [Hive integration Improvement](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment)| `UNDER REVIEW` | 
+| 32 | [Kafka Connect Sink for Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi)| `ONGOING` | 
+| 33 | [Hudi supports more comprehensive Schema Evolution](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution)| `ONGOING` | 
+| 34 | [Hudi BigQuery Integration (WIP)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980) | `IN PROGRESS` | 
+| 35 | [Make Flink MOR table writing streaming friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly)| `UNDER REVIEW` | 
+| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server)| `UNDER REVIEW` | 
+| 37 | [Hudi Metadata based Bloom Index](rfc-37/rfc-37.md) | `IN PROGRESS` | 
+| 38 | [Spark Datasource V2 Integration](./rfc-38/rfc-38.md) | `IN PROGRESS` | 
+| 39 | [Incremental source for Debezium](./rfc-39/rfc-39.md) | `ONGOING` | 
+| 40 | [Hudi Connector for Trino](./rfc-40/rfc-40.md) | `ONGOING` | 
+| 41 | [Hudi Snowflake Integration] | `UNDER REVIEW`| 
+| 42 | [Consistent Hashing Index](./rfc-42/rfc-42.md) | `IN PROGRESS` | 
+| 43 | [Compaction / Clustering Service](./rfc-43/rfc-43.md) | `UNDER REVIEW` | 
+| 44 | [Hudi Connector for Presto](./rfc-44/rfc-44.md) | `UNDER REVIEW` | 
+| 45 | [Asynchronous Metadata Indexing](./rfc-45/rfc-45.md) | `ONGOING` | 
+| 46 | [Optimizing Record Payload Handling](./rfc-46/rfc-46.md) | `UNDER REVIEW` | 
+| 47 | [Add Call Produce Command for Spark SQL](./rfc-47/rfc-47.md) | `UNDER REVIEW` | 
+| 48 | [LogCompaction for MOR tables](./rfc-48/rfc-48.md) | `UNDER REVIEW` | 
+| 49 | [Support sync with DataHub](./rfc-49/rfc-49.md)    | `ONGOING` |
--- a/rfc/rfc-49/DatasetProfile.png
+++ b/rfc/rfc-49/DatasetProfile.png
--- a/rfc/rfc-49/DatasetProperties.png
+++ b/rfc/rfc-49/DatasetProperties.png
--- a/rfc/rfc-49/SchemaMetadata.png
+++ b/rfc/rfc-49/SchemaMetadata.png
--- a/rfc/rfc-49/rfc-49.md
+++ b/rfc/rfc-49/rfc-49.md
@@ -0,0 +1,90 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-49: Support sync with DataHub
+
+
+## Proposers
+- @xushiyan
+
+## Approvers
+ - @vinothchandar 
+ - @Sivabalan
+
+## Status
+
+JIRA: [HUDI-3468](https://issues.apache.org/jira/browse/HUDI-3468)
+
+## Overview
+
+Support sync with [DataHub](https://datahubproject.io/) to provide rich metadata capabilities for Hudi tables.
+
+> DataHub is an open-source metadata platform for the modern data stack.
+
+Read more in https://datahubproject.io/docs/#introduction
+
+## Implementation
+
+To sync with DataHub, we can make use of existing `hudi-sync` abstraction by extending `org.apache.hudi.sync.common.AbstractSyncTool`.
+
+The sync mechanism can be implemented via [Java Emitter](https://datahubproject.io/docs/metadata-integration/java/as-a-library). The main work is about 
+
+- take in user's configurations to connect to an existing DataHub instance
+- compose desired metadata for sync based on [DataHub's metadata model](https://datahubproject.io/docs/metadata-modeling/metadata-model)
+
+## Configurations
+
+Necessary configurations will be added using pattern `hoodie.sync.datahub.*` to connect to the user-operated DataHub instance.
+
+## Metadata Model
+
+A Hudi table maps to a [Dataset entity](https://datahubproject.io/docs/generated/metamodel/entities/dataset) in DataHub.
+
+#### Identifier
+
+A Dataset can be identified by urn consists of [Data Platform](https://datahubproject.io/docs/generated/metamodel/entities/dataplatform) (default `hudi`), table identifier (`<db>.<table>`), and optional/configurable environment suffix. An example:
+
+```text
+urn:li:dataset:(urn:li:dataPlatform:hudi,mydb,mytable,prod)
+```
+
+#### Schema
+
+Schema can be sync'ed via the `SchemaMetadata` aspect. `platformSchema` (raw schema) will be sync'ed using the avro schema string persisted in the commit metadata.
+
+![](SchemaMetadata.png)
+
+#### Dataset Properties
+
+Key-value table properties, e.g., last sync'ed commit timestamp, can be sync'ed via the `DatasetProperties` aspect.
+
+![](DatasetProperties.png)
+
+#### Column Stats
+
+Column stats, e.g., min/max value of selected fields, can be retrieved from Hudi metadata table's column stats partition, and sync'ed via the `fieldProfiles` of `DatasetProfile` aspect.
+
+![](DatasetProfile.png)
+
+## Rollout/Adoption Plan
+
+This is a new feature to be enabled by configuration. Users can choose to turn on or off at any time. This feature won't interfere with existing Hudi tables' operations.
+
+## Test Plan
+
+- Unit tests
+- Run a PoC setup with DataHub integration to verify the desired metadata are sync'ed