3.2 KiB
RFC-49: Support sync with DataHub
Proposers
- @xushiyan
Approvers
- @vinothchandar
- @Sivabalan
Status
JIRA: HUDI-3468
Overview
Support sync with DataHub to provide rich metadata capabilities for Hudi tables.
DataHub is an open-source metadata platform for the modern data stack.
Read more in https://datahubproject.io/docs/#introduction
Implementation
To sync with DataHub, we can make use of existing hudi-sync abstraction by extending org.apache.hudi.sync.common.AbstractSyncTool.
The sync mechanism can be implemented via Java Emitter. The main work is about
- take in user's configurations to connect to an existing DataHub instance
- compose desired metadata for sync based on DataHub's metadata model
Configurations
Necessary configurations will be added using pattern hoodie.sync.datahub.* to connect to the user-operated DataHub instance.
Metadata Model
A Hudi table maps to a Dataset entity in DataHub.
Identifier
A Dataset can be identified by urn consists of Data Platform (default hudi), table identifier (<db>.<table>), and optional/configurable environment suffix. An example:
urn:li:dataset:(urn:li:dataPlatform:hudi,mydb,mytable,prod)
Schema
Schema can be sync'ed via the SchemaMetadata aspect. platformSchema (raw schema) will be sync'ed using the avro schema string persisted in the commit metadata.
Dataset Properties
Key-value table properties, e.g., last sync'ed commit timestamp, can be sync'ed via the DatasetProperties aspect.
Column Stats
Column stats, e.g., min/max value of selected fields, can be retrieved from Hudi metadata table's column stats partition, and sync'ed via the fieldProfiles of DatasetProfile aspect.
Rollout/Adoption Plan
This is a new feature to be enabled by configuration. Users can choose to turn on or off at any time. This feature won't interfere with existing Hudi tables' operations.
Test Plan
- Unit tests
- Run a PoC setup with DataHub integration to verify the desired metadata are sync'ed


