In order to support object storage, we need these changes:
* Use the Hadoop filesystem so that we can find the plugin filesystem
* Do not fetch file size until the file handle is closed
* Do not close the opened filesystem because we want to use the
filesystem cache
A Flink SQL table has DDL that defines the table schema, we can use that
to infer the Avro schema and there is no need to declare a Avro schema
explicitly anymore.
But we still keep the config option for explicit Avro schema in case
there is corner cases that the inferred schema is not correct
(especially for the nullability).
Supports two read modes:
* Read the full data set starting from the latest commit instant and
subsequent incremental data set
* Read data set that starts from a specified commit instant
- The #initializeState executes before #open, thus, the
#checkPartitionsLoaded may see null `initialPartitionsToLoad`
- Only load the existing partitions
This is the #step 2 of RFC-24:
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal
This PR introduce a BucketAssigner that assigns bucket ID (partition
path & fileID) for each stream record.
There is no need to look up index and partition the records anymore in
the following pipeline for these records,
we actually decide the write target location before the write and each
record computes its location when the BucketAssigner receives it, thus,
the indexing is with streaming style.
Computing locations for a batch of records all at a time is resource
consuming so a pressure to the engine,
we should avoid that in streaming system.
* Added ability to pass in `properties` to payload methods, so they can perform table/record specific merges
* Added default methods so existing payload classes are backwards compatible.
* Adding DefaultHoodiePayload to honor ordering while merging two records
* Fixing default payload based on feedback