A Flink SQL table has DDL that defines the table schema, we can use that
to infer the Avro schema and there is no need to declare a Avro schema
explicitly anymore.
But we still keep the config option for explicit Avro schema in case
there is corner cases that the inferred schema is not correct
(especially for the nullability).
Supports two read modes:
* Read the full data set starting from the latest commit instant and
subsequent incremental data set
* Read data set that starts from a specified commit instant
This is the #step 2 of RFC-24:
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal
This PR introduce a BucketAssigner that assigns bucket ID (partition
path & fileID) for each stream record.
There is no need to look up index and partition the records anymore in
the following pipeline for these records,
we actually decide the write target location before the write and each
record computes its location when the BucketAssigner receives it, thus,
the indexing is with streaming style.
Computing locations for a batch of records all at a time is resource
consuming so a pressure to the engine,
we should avoid that in streaming system.