1
0
Commit Graph

183 Commits

Author SHA1 Message Date
Shiyan Xu
eee6a02f77 [HUDI-4456] Clean up test resources (#6203) 2022-07-25 10:13:06 -05:00
Rahil C
af10a97e7a [HUDI-4435] Fix Avro field not found issue introduced by Avro 1.10 (#6155)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-22 17:26:16 -07:00
Rahil C
2bf7920bd9 [MINOR] Add logger for HoodieCopyOnWriteTableInputFormat (#6161)
Co-authored-by: Wenning Ding <wenningd@amazon.com>
2022-07-21 22:27:18 +05:30
Alexey Kudinkin
a33bdd32e3 [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation (#5470) 2022-07-21 06:20:47 -07:00
Sivabalan Narayanan
7da97c8096 [HUDI-4171] Fixing Non partitioned with virtual keys in read path (#5747)
- When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.
2022-06-06 15:48:21 -04:00
Heap
47b764ec33 [HUDI-4134] Fix Method naming consistency issues in FSUtils (#5655) 2022-05-23 15:28:48 -07:00
Alexey Kudinkin
f7544e23ac [HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364)
- Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row
 - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not
 - Fallback to append partition values in cases when the source columns are not persisted in data-file
 - Fixing HoodieBaseRelation incorrectly handling mandatory columns
2022-04-20 19:30:27 +08:00
董可伦
b8e465fdfc [MINOR] Fix typos in log4j-surefire.properties (#5212) 2022-04-15 13:33:37 -07:00
Danny Chan
0281725c6b [MINOR] Inline the partition path logic into the builder (#5310) 2022-04-13 16:54:39 +05:30
Sagar Sumit
df87095ef0 [HUDI-3454] Fix partition name in all code paths for LogRecordScanner (#5252)
* Depend on FSUtils#getRelativePartitionPath(basePath, logFilePath.getParent) 
to get the partition.

* If the list of log file paths in the split is empty, then fallback to usual behaviour.
2022-04-08 09:59:36 +05:30
董可伦
6a8396420c [HUDI-3643] Fix hive count exception when the table is empty and the path depth is less than 3 (#5051) 2022-04-07 04:21:03 -07:00
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Y Ethan Guo
eaa4c4f2e2 [HUDI-1180] Upgrade HBase to 2.4.9 (#5004)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-03-24 19:04:53 -07:00
Sagar Sumit
fe2c3989e3 [HUDI-3689] Fix glob path and hive sync in deltastreamer tests (#5117)
* Remove glob pattern basePath from the deltastreamer tests.

* [HUDI-3689] Fix file scheme config

for CI failure in TestHoodieRealTimeRecordReader

Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-03-24 15:48:35 +05:30
Alexey Kudinkin
099c2c099a [HUDI-3457] Refactored Spark DataSource Relations to avoid code duplication (#4877)
Refactoring Spark DataSource Relations to avoid code duplication. 

Following Relations were in scope:

- BaseFileOnlyViewRelation
- MergeOnReadSnapshotRelaation
- MergeOnReadIncrementalRelation
2022-03-18 22:32:16 -07:00
冯健
bf191f8d46 [HUDI-3645] Fix NPE caused by multiple threads accessing non-thread-safe HashMap (#5028)
- Change HashMap in HoodieROTablePathFilter to ConcurrentHashMap
2022-03-17 14:20:28 +05:30
Alexey Kudinkin
5e8ff8d793 [HUDI-3514] Rebase Data Skipping flow to rely on MT Column Stats index (#4948) 2022-03-15 10:38:36 -07:00
Aditya Tiwari
051ad0b033 [HUDI-3130] Fixing Hive getSchema for RT tables addressing different partitions having different schemas (#4468)
* Fixing Hive getSchema for RT tables

* Addressing feedback

* temp diff

* fixing tests after spark datasource read support for metadata table is merged to master

* Adding multi-partition schema evolution tests to HoodieRealTimeRecordReader

Co-authored-by: Aditya Tiwari <aditya.tiwari@flipkart.com>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
2022-03-06 07:51:35 +05:30
Raymond Xu
b4362fac45 [HUDI-3348] Add UT to verify HoodieRealtimeFileSplit serde (#4951) 2022-03-04 11:19:16 +04:00
Raymond Xu
c77b2591d0 [HUDI-2439] Remove SparkBoundedInMemoryExecutor (#4860) 2022-02-26 08:02:12 -05:00
Sagar Sumit
6a5cfb45b9 [MINOR] Fix table type in input format test (#4912) 2022-02-25 13:51:53 -05:00
Alexey Kudinkin
85e8a5c4de [HUDI-1296] Support Metadata Table in Spark Datasource (#4789)
* Bootstrapping initial support for Metadata Table in Spark Datasource

- Consolidated Avro/Row conversion utilities to center around Spark's AvroDeserializer ; removed duplication
- Bootstrapped HoodieBaseRelation
- Updated HoodieMergeOnReadRDD to be able to handle Metadata Table
- Modified MOR relations to be able to read different Base File formats (Parquet, HFile)
2022-02-24 16:23:13 -05:00
Alexey Kudinkin
aaddaf524a [HUDI-3280] Cleaning up Hive-related hierarchies after refactoring (#4743) 2022-02-16 15:36:37 -08:00
Sivabalan Narayanan
ba4e732ba7 [HUDI-2987] Update all deprecated calls to new apis in HoodieRecordPayload (#4681) 2022-02-10 19:19:33 -05:00
Alexey Kudinkin
464027ec37 [HUDI-3239] Convert BaseHoodieTableFileIndex to Java (#4669)
Converting BaseHoodieTableFileIndex to Java, removing Scala as a dependency from "hudi-common"
2022-02-09 18:42:08 -05:00
Alexey Kudinkin
973087f385 [HUDI-3276] Rebased Parquet-based FileInputFormat impls to inherit from MapredParquetInputFormat (#4667)
Rebased Parquet-based FileInputFormat impls to inherit from MapredParquetInputFormat, to make sure that Hive is appropriately recognizing those impls and applying corresponding optimizations.

- Converted HoodieRealtimeFileInputFormatBase and HoodieFileInputFormatBase into standalone implementations that could be instantiated as standalone objects (which could be used for delegation)
- Renamed HoodieFileInputFormatBase > HoodieCopyOnWriteTableInputFormat, HoodieRealtimeFileInputFormatBase > HoodieMergeOnReadTableInputFormat
- Scaffolded HoodieParquetFileInputFormatBase for all Parquet impls to inherit from
- Rebased Parquet impls onto HoodieParquetFileInputFormatBase
2022-02-08 15:21:45 -05:00
Alexey Kudinkin
3f263b82ce [HUDI-3206] Unify Hive's MOR implementations to avoid duplication (#4559)
Unify Hive's MOR implementations to avoid duplication to avoid duplication across implementations for different file-formats (Parquet, HFile, etc)

- Extracted HoodieRealtimeFileInputFormatBase (extending COW HoodieFileInputFormatBase base)
- Rebased Parquet, HFile implementations onto HoodieRealtimeFileInputFormatBase
- Tidying up
2022-02-07 14:06:28 -05:00
Y Ethan Guo
b8601a9f58 [HUDI-2656] Generalize HoodieIndex for flexible record data type (#3893)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-02-03 20:24:04 -08:00
Alexey Kudinkin
69dfcda116 [HUDI-3191] Removing duplicating file-listing process w/in Hive's MOR FileInputFormats (#4556) 2022-02-03 14:01:41 -08:00
Alexey Kudinkin
a68e1dc2db [HUDI-431] Adding support for Parquet in MOR LogBlocks (#4333)
- Adding support for Parquet in MOR tables Log blocks

Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>
2022-02-02 14:35:05 -05:00
Manoj Govindassamy
f87c47352a [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field (#4449)
* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.

- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.

- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.

- Tests have been added to verify the default virtual keys and key
   deduplication support for the metadata table records.

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2022-01-26 13:34:04 -05:00
董可伦
56cd8ffae0 [HUDI-2837] Add support for using database name in incremental query (#4083) 2022-01-22 22:11:27 -08:00
Alexey Kudinkin
4bea758738 [HUDI-3191] Rebasing Hive's FileInputFormat onto AbstractHoodieTableFileIndex (#4531) 2022-01-18 14:54:51 -08:00
Yuwei XIAO
d36533735f [HUDI-3194] fix MOR snapshot query during compaction (#4540) 2022-01-17 17:24:24 -05:00
Alexey Kudinkin
75caa7d3d8 [HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines (#4520) 2022-01-16 22:46:20 -08:00
Alexey Kudinkin
6cdcd89afa [HUDI-3094] Unify Hive's InputFormat implementations to avoid duplication (#4417) 2022-01-11 15:02:13 -08:00
xuzifu666
f0c2912d35 [MINOR] Remove unused methods in HoodieColumnProjectionUtils (#4408) 2022-01-06 15:36:13 -08:00
Sivabalan Narayanan
a66212d204 [HUDI-2966] Closing LogRecordScanner in compactor (#4478)
* Closing LogRecordScanner in compactor

* Addressing comments
2022-01-05 10:57:18 +08:00
RexAn
f612a20815 [HUDI-2779] Cache BaseDir if HudiTableNotFound Exception thrown (#4014) 2021-12-09 16:04:11 +05:30
xuzifu666
c9e18d1e7d [HUDI-2942] add error message log in HoodieCombineHiveInputFormat (#4224) 2021-12-07 22:05:39 -08:00
xiarixiaoyao
57c4bf8152 [HUDI-2876] for hive/presto hudi should remove the temp file which created by HoodieMergedLogRecordSanner when the query finished. (#4139) 2021-12-06 21:33:10 +08:00
zhangyue19921010
5616830ae1 Revert "[HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests"
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
2021-12-04 08:26:53 +05:30
yuzhao.cyz
a1d0ff4209 Moving to 0.11.0-SNAPSHOT on master branch. 2021-11-27 17:22:10 +08:00
Sivabalan Narayanan
8340ccb503 [HUDI-2005] Removing direct fs call in HoodieLogFileReader (#3865) 2021-11-25 18:51:38 -05:00
Danny Chan
a2eb2b0b0a [HUDI-2480] FileSlice after pending compaction-requested instant-time… (#3703)
* [HUDI-2480] FileSlice after pending compaction-requested instant-time is ignored by MOR snapshot reader

* include file slice after a pending compaction for spark reader

Co-authored-by: garyli1019 <yanjia.gary.li@gmail.com>
2021-11-25 22:30:09 +08:00
Jimmy.Zhou
0d1e7ecdab [MINOR] Fix typo,'multipe' corrected to 'multiple' (#4068) 2021-11-22 17:20:23 -08:00
xiarixiaoyao
a0dae41409 [HUDI-2758] remove redundant code in the hoodieRealtimeInputFormatUitls.getRealtimeSplits (#3994) 2021-11-15 11:29:40 +08:00
xiarixiaoyao
a40ac62e0c [HUDI-2086]redo the logical of mor_incremental_view for hive (#3203) 2021-11-10 15:41:07 +08:00
Genmao Yu
f41539a9cb [HUDI-313] bugfix: NPE when select count start from a realtime table with Tez(#3630)
Co-authored-by: dylonyu <dylonyu@tencent.com>
2021-11-06 12:16:13 -04:00