diff --git a/rfc/README.md b/rfc/README.md index 4d8aba380..bc3bfde37 100644 --- a/rfc/README.md +++ b/rfc/README.md @@ -14,61 +14,74 @@ See the License for the specific language governing permissions and limitations under the License. --> + # RFCs - - The RFC process is documented on our [site](https://hudi.apache.org/contribute/rfc-process). Please familiarize yourself with it, before working a new RFC. - - Status can be one of these values: `UNDER REVIEW` (or) `IN PROGRESS` (or) `ABANDONED` (or) `COMPLETED`. + +The RFC process is documented on our [site](https://hudi.apache.org/contribute/rfc-process). Please familiarize yourself +with it, before working a new RFC. + +Status can be one of these values. + +| Status | Meaning | +| -------|-------------------------------------------------------| +| `UNDER REVIEW` | RFC has been proposed and community is actively debating the design/proposal. | +| `IN PROGRESS` | The initial phase of implementation is underway. | +| `ONGOING` | Some or most work has landed; community continues to improve or build follow on phases. | +| `ABANDONED` | The proposal was not implemented, due to various reasons. | +| `COMPLETED` | All work is deemed complete. | The list of all RFCs can be found here. > Older RFC content is still [here](https://cwiki.apache.org/confluence/display/HUDI/RFC+Process). -| RFC Number | Title | Status | -| ---| ---| --- | -| 1 | [CSV Source Support for Delta Streamer](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+01+%3A+CSV+Source+Support+for+Delta+Streamer) | `COMPLETED` | -| 2 | [ORC Storage in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708439) | `IN PROGRESS` | -| 3 | [Timeline Service with Incremental File System View Syncing](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708965) | `COMPLETED` | -| 4 | [Faster Hive incremental pull queries](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115513622) | `COMPLETED` | -| 5 | [HUI (Hudi WebUI)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233) | `ABANDONED` | -| 6 | [Add indexing support to the log file](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+06+%3A+Add+indexing+support+to+the+log+file) | `ABANDONED` | -| 7 | [Point in time Time-Travel queries on Hudi table](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table) | `COMPLETED` | -| 8 | [Record level indexing mechanisms for Hudi datasets](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets) | `UNDER REVIEW` | -| 9 | [Hudi Dataset Snapshot Exporter](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter) | `COMPLETED` | -| 10 | [Restructuring and auto-generation of docs](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs) | `COMPLETED` | -| 11 | [Refactor of the configuration framework of hudi project](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project) | `ABANDONED` | -| 12 | [Efficient Migration of Large Parquet Tables to Apache Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi) | `COMPLETED` | -| 13 | [Integrate Hudi with Flink](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=141724520) | `COMPLETED` | -| 14 | [JDBC incremental puller](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller) | `COMPLETED` | -| 15 | [HUDI File Listing Improvements](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements) | `COMPLETED` | -| 16 | [Abstraction for HoodieInputFormat and RecordReader](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader) | `COMPLETED` | -| 17 | [Abstract common meta sync module support multiple meta service](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service) | `COMPLETED` | -| 18 | [Insert Overwrite API](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API) | `COMPLETED` | -| 19 | [Clustering data for freshness and query performance](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance) | `COMPLETED` | -| 20 | [handle failed records](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records) | `IN PROGRESS` | -| 21 | [Allow HoodieRecordKey to be Virtual](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+21+%3A+Allow+HoodieRecordKey+to+be+Virtual) | `COMPLETED` | +| RFC Number | Title | Status | +| ---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --- | +| 1 | [CSV Source Support for Delta Streamer](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+01+%3A+CSV+Source+Support+for+Delta+Streamer) | `COMPLETED` | +| 2 | [ORC Storage in Hudi](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708439) | `ONGOING` | +| 3 | [Timeline Service with Incremental File System View Syncing](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113708965) | `COMPLETED` | +| 4 | [Faster Hive incremental pull queries](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=115513622) | `COMPLETED` | +| 5 | [HUI (Hudi WebUI)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233) | `ABANDONED` | +| 6 | [Add indexing support to the log file](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+06+%3A+Add+indexing+support+to+the+log+file) | `ABANDONED` | +| 7 | [Point in time Time-Travel queries on Hudi table](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table) | `COMPLETED` | +| 8 | [Record level indexing mechanisms for Hudi datasets](https://cwiki.apache.org/confluence/display/HUDI/RFC-08++Record+level+indexing+mechanisms+for+Hudi+datasets) | `ONGOING` | +| 9 | [Hudi Dataset Snapshot Exporter](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter) | `COMPLETED` | +| 10 | [Restructuring and auto-generation of docs](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+10+%3A+Restructuring+and+auto-generation+of+docs) | `COMPLETED` | +| 11 | [Refactor of the configuration framework of hudi project](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project) | `ABANDONED` | +| 12 | [Efficient Migration of Large Parquet Tables to Apache Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi) | `COMPLETED` | +| 13 | [Integrate Hudi with Flink](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=141724520) | `COMPLETED` | +| 14 | [JDBC incremental puller](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller) | `COMPLETED` | +| 15 | [HUDI File Listing Improvements](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements) | `COMPLETED` | +| 16 | [Abstraction for HoodieInputFormat and RecordReader](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader) | `COMPLETED` | +| 17 | [Abstract common meta sync module support multiple meta service](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+17+Abstract+common+meta+sync+module+support+multiple+meta+service) | `COMPLETED` | +| 18 | [Insert Overwrite API](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+18+Insert+Overwrite+API) | `COMPLETED` | +| 19 | [Clustering data for freshness and query performance](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance) | `COMPLETED` | +| 20 | [handle failed records](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+20+%3A+handle+failed+records) | `IN PROGRESS` | +| 21 | [Allow HoodieRecordKey to be Virtual](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+21+%3A+Allow+HoodieRecordKey+to+be+Virtual) | `COMPLETED` | | 22 | [Snapshot Isolation using Optimistic Concurrency Control for multi-writers](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+22+%3A+Snapshot+Isolation+using+Optimistic+Concurrency+Control+for+multi-writers) | `COMPLETED` | -| 23 | [Hudi Observability metrics collection](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+23+%3A+Hudi+Observability+metrics+collection) | `UNDER REVIEW` | -| 24 | [Hoodie Flink Writer Proposal](https://cwiki.apache.org/confluence/display/HUDI/RFC-24%3A+Hoodie+Flink+Writer+Proposal) | `UNDER REVIEW` | -| 25 | [Spark SQL Extension For Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+25%3A+Spark+SQL+Extension+For+Hudi) | `COMPLETED` | -| 26 | [Optimization For Hudi Table Query](https://cwiki.apache.org/confluence/display/HUDI/RFC-26+Optimization+For+Hudi+Table+Query) | `IN PROGRESS` | -| 27 | [Data skipping index to improve query performance](https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance) | `IN PROGRESS` | -| 28 | [Support Z-order curve](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181307144) | `IN PROGRESS` | -| 29 | [Hash Index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index) | `IN PROGRESS` | -| 30 | [Batch operation](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+30%3A+Batch+operation) | `UNDER REVIEW` | -| 31 | [Hive integration Improvment](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment) | `UNDER REVIEW` | -| 32 | [Kafka Connect Sink for Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi) | `IN PROGRESS` | -| 33 | [Hudi supports more comprehensive Schema Evolution](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution) | `IN PROGRESS` | -| 34 | [Hudi BigQuery Integration (WIP)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980) | `UNDER REVIEW` | -| 35 | [Make Flink MOR table writing streaming friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly) | `UNDER REVIEW` | -| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server) | `UNDER REVIEW` | -| 37 | [Hudi Metadata based Bloom Index](rfc-37/rfc-37.md) | `IN PROGRESS` | -| 38 | [Spark Datasource V2 Integration](./rfc-38/rfc-38.md) | `IN PROGRESS` | -| 39 | [Incremental source for Debezium](./rfc-39/rfc-39.md) | `IN PROGRESS` | -| 40 | [Hudi Connector for Trino](./rfc-40/rfc-40.md) | `IN PROGRESS` | -| 41 | [Hudi Snowflake Integration] | `UNDER REVIEW` | -| 42 | [Consistent Hashing Index](./rfc-42/rfc-42.md) | `UNDER REVIEW` | -| 43 | [Compaction / Clustering Service](./rfc-43/rfc-43.md) | `UNDER REVIEW` | -| 44 | [Hudi Connector for Presto](./rfc-44/rfc-44.md) | `UNDER REVIEW` | -| 45 | [Asynchronous Metadata Indexing](./rfc-45/rfc-45.md) | `IN PROGRESS` | -| 46 | [Optimizing Record Payload Handling](./rfc-46/rfc-46.md) | `UNDER REVIEW` | -| 47 | [Add Call Produce Command for Spark SQL](./rfc-47/rfc-47.md) | `UNDER REVIEW` | -| 48 | [LogCompaction for MOR tables](./rfc-48/rfc-48.md) | `UNDER REVIEW` | \ No newline at end of file +| 23 | [Hudi Observability metrics collection](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+23+%3A+Hudi+Observability+metrics+collection)| `ABANDONED` | +| 24 | [Hoodie Flink Writer Proposal](https://cwiki.apache.org/confluence/display/HUDI/RFC-24%3A+Hoodie+Flink+Writer+Proposal) | `COMPLETED` | +| 25 | [Spark SQL Extension For Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+25%3A+Spark+SQL+Extension+For+Hudi) | `COMPLETED` | +| 26 | [Optimization For Hudi Table Query](https://cwiki.apache.org/confluence/display/HUDI/RFC-26+Optimization+For+Hudi+Table+Query) | `ONGOING` | +| 27 | [Data skipping index to improve query performance](https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance) | `ONGOING` | +| 28 | [Support Z-order curve](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181307144) | `COMPLETED` | +| 29 | [Hash Index](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index) | `ONGOING` | +| 30 | [Batch operation](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+30%3A+Batch+operation) | `UNDER REVIEW` | +| 31 | [Hive integration Improvement](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+31%3A+Hive+integration+Improvment)| `UNDER REVIEW` | +| 32 | [Kafka Connect Sink for Hudi](https://cwiki.apache.org/confluence/display/HUDI/RFC-32+Kafka+Connect+Sink+for+Hudi)| `ONGOING` | +| 33 | [Hudi supports more comprehensive Schema Evolution](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution)| `ONGOING` | +| 34 | [Hudi BigQuery Integration (WIP)](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980) | `IN PROGRESS` | +| 35 | [Make Flink MOR table writing streaming friendly](https://cwiki.apache.org/confluence/display/HUDI/RFC-35%3A+Make+Flink+MOR+table+writing+streaming+friendly)| `UNDER REVIEW` | +| 36 | [HUDI Metastore Server](https://cwiki.apache.org/confluence/display/HUDI/%5BWIP%5D+RFC-36%3A+HUDI+Metastore+Server)| `UNDER REVIEW` | +| 37 | [Hudi Metadata based Bloom Index](rfc-37/rfc-37.md) | `IN PROGRESS` | +| 38 | [Spark Datasource V2 Integration](./rfc-38/rfc-38.md) | `IN PROGRESS` | +| 39 | [Incremental source for Debezium](./rfc-39/rfc-39.md) | `ONGOING` | +| 40 | [Hudi Connector for Trino](./rfc-40/rfc-40.md) | `ONGOING` | +| 41 | [Hudi Snowflake Integration] | `UNDER REVIEW`| +| 42 | [Consistent Hashing Index](./rfc-42/rfc-42.md) | `IN PROGRESS` | +| 43 | [Compaction / Clustering Service](./rfc-43/rfc-43.md) | `UNDER REVIEW` | +| 44 | [Hudi Connector for Presto](./rfc-44/rfc-44.md) | `UNDER REVIEW` | +| 45 | [Asynchronous Metadata Indexing](./rfc-45/rfc-45.md) | `ONGOING` | +| 46 | [Optimizing Record Payload Handling](./rfc-46/rfc-46.md) | `UNDER REVIEW` | +| 47 | [Add Call Produce Command for Spark SQL](./rfc-47/rfc-47.md) | `UNDER REVIEW` | +| 48 | [LogCompaction for MOR tables](./rfc-48/rfc-48.md) | `UNDER REVIEW` | +| 49 | [Support sync with DataHub](./rfc-49/rfc-49.md) | `ONGOING` | diff --git a/rfc/rfc-49/DatasetProfile.png b/rfc/rfc-49/DatasetProfile.png new file mode 100644 index 000000000..0f9fde187 Binary files /dev/null and b/rfc/rfc-49/DatasetProfile.png differ diff --git a/rfc/rfc-49/DatasetProperties.png b/rfc/rfc-49/DatasetProperties.png new file mode 100644 index 000000000..f3e584d7d Binary files /dev/null and b/rfc/rfc-49/DatasetProperties.png differ diff --git a/rfc/rfc-49/SchemaMetadata.png b/rfc/rfc-49/SchemaMetadata.png new file mode 100644 index 000000000..8af653258 Binary files /dev/null and b/rfc/rfc-49/SchemaMetadata.png differ diff --git a/rfc/rfc-49/rfc-49.md b/rfc/rfc-49/rfc-49.md new file mode 100644 index 000000000..675ccf444 --- /dev/null +++ b/rfc/rfc-49/rfc-49.md @@ -0,0 +1,90 @@ + + +# RFC-49: Support sync with DataHub + + +## Proposers +- @xushiyan + +## Approvers + - @vinothchandar + - @Sivabalan + +## Status + +JIRA: [HUDI-3468](https://issues.apache.org/jira/browse/HUDI-3468) + +## Overview + +Support sync with [DataHub](https://datahubproject.io/) to provide rich metadata capabilities for Hudi tables. + +> DataHub is an open-source metadata platform for the modern data stack. + +Read more in https://datahubproject.io/docs/#introduction + +## Implementation + +To sync with DataHub, we can make use of existing `hudi-sync` abstraction by extending `org.apache.hudi.sync.common.AbstractSyncTool`. + +The sync mechanism can be implemented via [Java Emitter](https://datahubproject.io/docs/metadata-integration/java/as-a-library). The main work is about + +- take in user's configurations to connect to an existing DataHub instance +- compose desired metadata for sync based on [DataHub's metadata model](https://datahubproject.io/docs/metadata-modeling/metadata-model) + +## Configurations + +Necessary configurations will be added using pattern `hoodie.sync.datahub.*` to connect to the user-operated DataHub instance. + +## Metadata Model + +A Hudi table maps to a [Dataset entity](https://datahubproject.io/docs/generated/metamodel/entities/dataset) in DataHub. + +#### Identifier + +A Dataset can be identified by urn consists of [Data Platform](https://datahubproject.io/docs/generated/metamodel/entities/dataplatform) (default `hudi`), table identifier (`.`), and optional/configurable environment suffix. An example: + +```text +urn:li:dataset:(urn:li:dataPlatform:hudi,mydb,mytable,prod) +``` + +#### Schema + +Schema can be sync'ed via the `SchemaMetadata` aspect. `platformSchema` (raw schema) will be sync'ed using the avro schema string persisted in the commit metadata. + +![](SchemaMetadata.png) + +#### Dataset Properties + +Key-value table properties, e.g., last sync'ed commit timestamp, can be sync'ed via the `DatasetProperties` aspect. + +![](DatasetProperties.png) + +#### Column Stats + +Column stats, e.g., min/max value of selected fields, can be retrieved from Hudi metadata table's column stats partition, and sync'ed via the `fieldProfiles` of `DatasetProfile` aspect. + +![](DatasetProfile.png) + +## Rollout/Adoption Plan + +This is a new feature to be enabled by configuration. Users can choose to turn on or off at any time. This feature won't interfere with existing Hudi tables' operations. + +## Test Plan + +- Unit tests +- Run a PoC setup with DataHub integration to verify the desired metadata are sync'ed