yuzhaojing
bc18c39835
[FLINK-1923] Exactly-once write for flink writer ( #3002 )
...
Co-authored-by: 喻兆靖 <yuzhaojing@bilibili.com >
2021-05-28 14:58:21 +08:00
Danny Chan
7fed7352bd
[HUDI-1865] Make embedded time line service singleton ( #2899 )
2021-05-27 13:38:33 +08:00
wangxianghu
e7020748b5
[HUDI-1920] Set archived as the default value of HOODIE_ARCHIVELOG_FOLDER_PROP_NAME ( #2978 )
2021-05-25 16:29:55 +08:00
Town
aba1eadbfc
[HUDI-1919] Type mismatch when streaming read copy_on_write table using flink ( #2986 )
...
* [HUDI-1919] Type mismatch when streaming read copy_on_write table using flink #2976
* Update ParquetSplitReaderUtil.java
2021-05-25 11:36:43 +08:00
zhangminglei
99b14a78e3
[HUDI-1918] Fix incorrect keyBy field cause serious data skew, to avoid multiple subtasks write to a partition at the same time ( #2972 )
2021-05-21 21:59:47 +08:00
swuferhong
928b09ea0b
[HUDI-1871] Fix hive conf for Flink writer hive meta sync ( #2968 )
2021-05-20 17:03:52 +08:00
Danny Chan
9b01d2f864
[HUDI-1915] Fix the file id for write data buffer before flushing ( #2966 )
2021-05-20 10:20:08 +08:00
Danny Chan
7d2971d4e2
[HUDI-1911] Reuse the partition path and file group id for flink write data buffer ( #2961 )
...
Reuse to reduce memory footprint.
2021-05-18 17:47:22 +08:00
Danny Chan
46a2399a45
[HUDI-1902] Global index for flink writer ( #2958 )
...
Supports deduplication for record keys with different partition path.
2021-05-18 13:55:38 +08:00
Danny Chan
8869b3b418
[HUDI-1902] Clean the corrupted files generated by FlinkMergeAndReplaceHandle ( #2949 )
...
Make the intermediate files of FlinkMergeAndReplaceHandle hidden, when
committing the instant, clean these files in case there was some
corrupted files left(in normal case, the intermediate files should be cleaned
by the FlinkMergeAndReplaceHandle itself).
2021-05-14 15:43:37 +08:00
Danny Chan
ad77cf42ba
[HUDI-1900] Always close the file handle for a flink mini-batch write ( #2943 )
...
Close the file handle eagerly to avoid corrupted files as much as
possible.
2021-05-14 10:25:18 +08:00
Danny Chan
b98c9ab439
[HUDI-1895] Close the file handles gracefully for flink write function to avoid corrupted files ( #2938 )
2021-05-12 18:44:10 +08:00
TeRS-K
be9db2c4f5
[HUDI-1055] Remove hardcoded parquet in tests ( #2740 )
...
* Remove hardcoded parquet in tests
* Use DataFileUtils.getInstance
* Renaming DataFileUtils to BaseFileUtils
Co-authored-by: Vinoth Chandar <vinoth@apache.org >
2021-05-11 10:01:45 -07:00
hiscat
7a5af806cf
[HUDI-1818] Validate required fields for Flink HoodieTable ( #2930 )
2021-05-11 11:11:19 +08:00
hiscat
511ac4881d
[MINOR] optimize FilePathUtils ( #2931 )
2021-05-10 06:47:56 -07:00
Danny Chan
c1b331bcff
[HUDI-1886] Avoid to generates corrupted files for flink sink ( #2929 )
2021-05-10 10:43:03 +08:00
Danny Chan
bfbf993cbe
[HUDI-1878] Add max memory option for flink writer task ( #2920 )
...
Also removes the rate limiter because it has the similar functionality,
modify the create and merge handle cleans the retry files automatically.
2021-05-08 14:27:56 +08:00
Danny Chan
528f4ca988
[HUDI-1880] Support streaming read with compaction and cleaning ( #2921 )
2021-05-07 20:04:35 +08:00
hiscat
0a5863939b
[HUDI-1821] Remove legacy code for Flink writer ( #2868 )
2021-05-07 10:58:49 +08:00
dijie
c5220b96e9
[HUDI-1781] Fix Flink streaming reader throws ClassCastException ( #2900 )
2021-05-01 19:13:15 +08:00
Danny Chan
6848a683bd
[HUDI-1867] Streaming read for Flink COW table ( #2895 )
...
Supports streaming read for Copy On Write table.
2021-04-29 20:44:45 +08:00
Danny Chan
6e9c5dd765
[HUDI-1863] Add rate limiter to Flink writer to avoid OOM for bootstrap ( #2891 )
2021-04-29 20:32:10 +08:00
hiscat
63fa2b6186
[HUDI-1836] Logging consuming instant to StreamReadOperator#processSplits ( #2867 )
2021-04-27 14:00:59 +08:00
Danny Chan
5be3997f70
[HUDI-1841] Tweak the min max commits to keep when setting up cleaning retain commits for Flink ( #2875 )
2021-04-27 10:58:06 +08:00
Danny Chan
d047e91d86
[HUDI-1837] Add optional instant range to log record scanner for log ( #2870 )
2021-04-26 16:53:18 +08:00
Danny Chan
1b27259b53
[HUDI-1844] Add option to flush when total buckets memory exceeds the threshold ( #2877 )
...
Current code supports flushing as per-bucket memory usage, while the
buckets may still take too much memory for bootstrap from history data.
When the threshold hits, flush out half of the buckets with bigger
buffer size.
2021-04-25 23:06:53 +08:00
Danny Chan
a5789c4067
[HUDI-1829] Use while loop instead of recursive call in MergeOnReadInputFormat to avoid StackOverflow ( #2862 )
...
Recursive all is risky for StackOverflow when there are too many.
2021-04-23 09:59:36 +08:00
hiscat
cc81ddde01
[HUDI-1812] Add explicit index state TTL option for Flink writer ( #2853 )
2021-04-21 20:13:30 +08:00
Danny Chan
ac3589f006
[HUDI-1814] Non partitioned table for Flink writer ( #2859 )
2021-04-21 20:07:27 +08:00
Danny Chan
d6d52c6063
[HUDI-1809] Flink merge on read input split uses wrong base file path for default merge type ( #2846 )
2021-04-20 21:27:09 +08:00
hj2016
62b8a341dd
[HUDI-1792] flink-client query error when processing files larger than 128mb ( #2814 )
...
Co-authored-by: huangjing <huangjing@clinbrain.com >
2021-04-16 13:59:19 +08:00
Danny Chan
b6d949b48a
[HUDI-1801] FlinkMergeHandle rolling over may miss to rename the latest file handle ( #2831 )
...
The FlinkMergeHandle may rename the N-1 th file handle instead of the
latest one, thus to cause data duplication.
2021-04-16 11:40:53 +08:00
hiscat
6d1aec604f
[HUDI-1798] Flink streaming reader should always monitor the delta commits files ( #2825 )
...
The streaming reader should only monitor the delta log files, if there are parquet commits but we recognize as logs, the reader would report FileNotFound exception.
2021-04-15 13:50:17 +08:00
Danny Chan
ab4a7b0b4a
[HUDI-1788] Insert overwrite (table) for Flink writer ( #2808 )
...
Supports `INSERT OVERWRITE` and `INSERT OVERWRITE TABLE` for Flink
writer.
2021-04-14 10:23:37 +08:00
Danny Chan
1ff99ca7d7
[HUDI-1786] Add option for merge max memory ( #2805 )
2021-04-12 17:03:58 +08:00
wangxianghu
040756d8c0
[HUDI-1785] Move OperationConverter to hudi-client-common for code reuse ( #2798 )
2021-04-12 16:22:33 +08:00
Danny Chan
6786581c48
[HUDI-1775] Add option for compaction parallelism ( #2785 )
2021-04-09 13:46:19 +08:00
hiscat
5b3608f149
[HUDI-1778] Add setter to CompactionPlanEvent and CompactionCommitEvent to have better SE/DE performance for Flink ( #2789 )
2021-04-08 19:40:37 +08:00
Danny Chan
9c369c607d
[HUDI-1757] Assigns the buckets by record key for Flink writer ( #2757 )
...
Currently we assign the buckets by record partition path which could
cause hotspot if the partition field is datetime type. Changes to assign
buckets by grouping the record whth their key first, the assignment is
valid if only there is no conflict(two task write to the same bucket).
This patch also changes the coordinator execution to be asynchronous.
2021-04-06 19:06:41 +08:00
Danny Chan
9804662bc8
[HUDI-1738] Emit deletes for flink MOR table streaming read ( #2742 )
...
Current we did a soft delete for DELETE row data when writes into hoodie
table. For streaming read of MOR table, the Flink reader detects the
delete records and still emit them if the record key semantics are still
kept.
This is useful and actually a must for streaming ETL pipeline
incremental computation.
2021-04-01 15:25:31 +08:00
Danny Chan
d415d45416
[HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer ( #2732 )
2021-03-29 10:47:29 +08:00
Danny Chan
8b774fe331
[HUDI-1495] Bump Flink version to 1.12.2 ( #2718 )
2021-03-26 14:25:57 +08:00
garyli1019
6e803e08b1
Moving to 0.9.0-SNAPSHOT on master branch.
2021-03-24 21:37:14 +08:00
Danny Chan
29b79c99b0
[hotfix] Log the error message for creating table source first ( #2711 )
2021-03-24 18:25:37 +08:00
Danny Chan
03668dbaf1
[HUDI-1710] Read optimized query type for Flink batch reader ( #2702 )
...
Read optimized query returns the records from:
* COW table: the latest parquet files
* MOR table: parquet file records from the latest compaction committed
2021-03-23 18:41:30 -07:00
Danny Chan
f74828fca1
[HUDI-1705] Flush as per data bucket for mini-batch write ( #2695 )
...
Detects the buffer size for each data bucket before flushing. So that we
avoid flushing data buckets with few records.
2021-03-19 16:30:54 +08:00
Danny Chan
f1e0018f12
[HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table ( #2694 )
...
The SQL PRIMARY KEY semantics is very same with Hoodie record key, using
PRIMARY KEY is more straight-forward way instead of a table option:
hoodie.datasource.write.recordkey.field.
After this change, both PRIMARY KEY and table option can define hoodie
record key, while the PRIMARY KEY has higher priority if both are
defined.
Note: a column with PRIMARY KEY constraint is forced to be non-nullable.
2021-03-18 20:21:52 +08:00
Danny Chan
968488fa3a
[HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down ( #2690 )
...
We should implement the interface HoodieTableSource.explainSource to
track the table source signature diff for all kinds of pushing down,
such as filter pushing or limit pushing.
2021-03-17 23:05:18 +08:00
Danny Chan
76bf2cc790
[HUDI-1692] Bounded source for stream writer ( #2674 )
...
Supports bounded source such as VALUES for stream mode writer.
2021-03-15 19:42:36 +08:00
Danny Chan
fc6c5f4285
[HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module ( #2669 )
...
* Add required dependencies for hudi-flink-bundle module
* Some packages reorganization of hudi-flink module
2021-03-15 16:02:05 +08:00