1
0
Commit Graph

323 Commits

Author SHA1 Message Date
ForwardXu
c5c4cfec91 [HUDI-3507] Support export command based on Call Produce Command (#5901) 2022-06-19 18:48:22 +08:00
huberylee
fec49dc12b [HUDI-4165] Support Create/Drop/Show/Refresh Index Syntax for Spark SQL (#5761)
* Support Create/Drop/Show/Refresh Index Syntax for Spark SQL
2022-06-17 18:33:58 +08:00
KnightChess
0ff34b6974 [HUDI-4214] improve repeat init write schema in ExpressionPayload (#5820)
* [HUDI-4214] improve repeat init write schema in ExpressionPayload
2022-06-16 17:58:37 +08:00
KnightChess
2bf0a1906d [HUDI-4217] improve repeat init object in ExpressionPayload (#5825) 2022-06-15 20:21:28 +08:00
superche
7b946cf351 [HUDI-3499] Add Call Procedure for show rollbacks (#5848)
* Add Call Procedure for show rollbacks

* fix

* add ut for show_rollback_detail and exception handle

Co-authored-by: superche <superche@tencent.com>
2022-06-15 16:50:15 +08:00
Y Ethan Guo
97ccf5dd18 [HUDI-4223] Fix NullPointerException from getLogRecordScanner when reading metadata table (#5840)
When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.
2022-06-11 13:19:24 -07:00
xi chaomin
2b3a85528a [HUDI-3889] Do not validate table config if save mode is set to Overwrite (#5619)
Co-authored-by: xicm <xicm@asiainfo.com>
2022-06-09 19:23:51 -04:00
Alexey Kudinkin
35afdb4316 [HUDI-4178] Addressing performance regressions in Spark DataSourceV2 Integration (#5737)
There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs.  This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.
2022-06-07 16:30:46 -07:00
Sivabalan Narayanan
f85cd9b16d [HUDI-4200] Fixing sorting of keys fetched from metadata table (#5773)
- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  #5208
2022-06-07 08:19:52 -04:00
Alexey Kudinkin
4f7ea8c79a [HUDI-4176] Fixing TableSchemaResolver to avoid repeated HoodieCommitMetadata parsing (#5733)
As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs.
Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time.

- Rebased onto new APIs to avoid excessive Hadoop's Path allocations
- Eliminated hasOperationField completely to avoid repeatitive computations
- Cleaning up duplication in HoodieActiveTimeline
- Added caching for common instances of HoodieCommitMetadata
- Made tableStructSchema lazy;
2022-06-06 13:14:26 -04:00
Saisai Shao
bd26d633d7 [HUDI-4168] Add Call Procedure for marker deletion (#5738)
* Add Call Procedure for marker deletion
2022-06-05 11:05:38 +08:00
leesf
3759a38b99 [HUDI-4183] Fix using HoodieCatalog to create non-hudi tables (#5743) 2022-06-03 17:16:48 +08:00
Jin Xing
918c4f4e0b [HUDI-4149] Drop-Table fails when underlying table directory is broken (#5672) 2022-05-30 19:09:26 +08:00
ForwardXu
8fa8f26031 [MINOR] Fix Hive and meta sync config for sql statement (#5316) 2022-05-28 07:56:39 -07:00
Alexey Kudinkin
1767ff5e7c [HUDI-4161] Make sure partition values are taken from partition path (#5699) 2022-05-27 02:36:30 -07:00
Y Ethan Guo
752f956f03 [HUDI-3933] Add UT cases to cover different key gen (#5638) 2022-05-23 06:48:09 -07:00
Raymond Xu
271d1a79c0 [HUDI-4051] Allow nested field as primary key and preCombineField in spark sql (#5517)
* [HUDI-4051] Allow nested field as preCombineField in spark sql

* relax validation for primary key
2022-05-22 00:47:51 -07:00
uday08bce
32a5d268f5 [HUDI-3890] fix rat plugin issue with sql files (#5644) 2022-05-21 12:22:55 -04:00
Jin Xing
922f765ead [HUDI-4100] CTAS failed to clean up when given an illegal MANAGED table definition (#5588) 2022-05-21 22:41:18 +08:00
huberylee
85b146d3d5 [HUDI-3985] Refactor DLASyncTool to support read hoodie table as spark datasource table (#5532) 2022-05-20 22:25:32 +08:00
huberylee
6573469e73 [HUDI-4116] Unify clustering/compaction related procedures' output type (#5620)
* Unify clustering/compaction related procedures' output type

* Address review comments
2022-05-19 09:48:03 +08:00
Jin Xing
d422f69a0d [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand (#5564)
* [HUDI-4087] Support dropping RO and RT table in DropHoodieTableCommand

* Set hoodie.query.as.ro.table in serde properties
2022-05-17 14:12:50 +08:00
Yuwei XIAO
61030d8e7a [HUDI-3123] consistent hashing index: basic write path (upsert/insert) (#4480)
1. basic write path(insert/upsert) implementation
 2. adapt simple bucket index
2022-05-16 11:07:01 +08:00
董可伦
75f847691f [HUDI-4001] Filter the properties should not be used when create table for Spark SQL (#5495) 2022-05-16 09:50:29 +08:00
Sivabalan Narayanan
0cec955fa2 [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests (#5501)
- Added pure immutable test yamls to integ test framework. Added SparkBulkInsertNode as part of it.
- Added delete_partition support to integ test framework using spark-datasource.
- Added a single yaml to test all non core write operations (insert overwrite, insert overwrite table and delete partitions)
- Added tests for 4 concurrent spark datasource writers (multi-writer tests).
- Fixed readme w/ sample commands for multi-writer.
2022-05-12 21:01:55 -04:00
Jin Xing
7f0c1f3ddf [HUDI-4079] Supports showing table comment for hudi with spark3 (#5546) 2022-05-11 22:28:58 +08:00
Sivabalan Narayanan
6285a239a3 [HUDI-3995] Making perf optimizations for bulk insert row writer path (#5462)
- Avoid using udf for key generator for SimpleKeyGen and NonPartitionedKeyGen.
- Fixed NonPartitioned Key generator to directly fetch record key from row rather than involving GenericRecord.
- Other minor fixes around using static values instead of looking up hashmap.
2022-05-09 12:40:22 -04:00
Raymond Xu
c319ee9cea [HUDI-4017] Improve spark sql coverage in CI (#5512)
Add GitHub actions tasks to run spark sql UTs under spark 3.1 and 3.2.
2022-05-06 05:52:06 -07:00
KnightChess
6ec039ba42 [MINOR] Update alter rename command class type for pattern matching (#5381) 2022-04-26 19:39:51 -07:00
Sivabalan Narayanan
762623a15c [HUDI-3972] Fixing hoodie.properties/tableConfig for no preCombine field with writes (#5424)
Fixed instantiation of new table to set the null for preCombine if not explicitly set by the user.
2022-04-25 23:03:10 -04:00
Y Ethan Guo
c4bc2deea0 [HUDI-3936] Fix projection for a nested field as pre-combined key (#5379)
This PR fixes the projection logic around a nested field which is used as the pre-combined key field. The fix is to only check and append the root level field for projection, i.e., "a", for a nested field "a.b.c" in the mandatory columns.

- Changes the logic to check and append the root level field for a required nested field in the mandatory columns in HoodieBaseRelation.appendMandatoryColumns
2022-04-21 20:17:57 -04:00
xiarixiaoyao
037f89ee7c [HUDI-3921] Fixed schema evolution cannot work with HUDI-3855 (#5376)
- when columns names are renamed (schema evolution enabled), while copying records from old data file with HoodieMergeHande, renamed columns wasn't handled well.
2022-04-21 18:27:54 -04:00
Alexey Kudinkin
4b296f79cc [HUDI-3935] Adding config to fallback to enabled Partition Values extraction from Partition path (#5377) 2022-04-21 01:36:19 -07:00
Alexey Kudinkin
f7544e23ac [HUDI-3204] Fixing partition-values being derived from partition-path instead of source columns (#5364)
- Scaffolded `Spark24HoodieParquetFileFormat` extending `ParquetFileFormat` and overriding the behavior of adding partition columns to every row
 - Amended `SparkAdapter`s `createHoodieParquetFileFormat` API to be able to configure whether to append partition values or not
 - Fallback to append partition values in cases when the source columns are not persisted in data-file
 - Fixing HoodieBaseRelation incorrectly handling mandatory columns
2022-04-20 19:30:27 +08:00
Alexey Kudinkin
81bf771e56 [HUDI-3902] Fallback to HadoopFsRelation in cases non-involving Schema Evolution (#5352)
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-19 10:40:20 -07:00
董可伦
b8e465fdfc [MINOR] Fix typos in log4j-surefire.properties (#5212) 2022-04-15 13:33:37 -07:00
ForwardXu
6621f3cdbb [HUDI-3845] Fix delete mor table's partition with urlencode's error (#5282) 2022-04-14 01:49:00 -07:00
ForwardXu
44b3630b5d [HUDI-3826] Make truncate partition use delete_partition operation (#5272)
Make truncate partition and drop partition behave as drop partition with purge, which delete all records via Hudi DELETE_PARTITION; partition removed from metastore
2022-04-14 00:53:05 -07:00
Alexey Kudinkin
458fdd5611 [HUDI-3841] Fixing Column Stats in the presence of Schema Evolution (#5275)
Currently, Data Skipping is not handling correctly the case when column-stats are not aligned and, for ex, some of the (column, file) combinations are missing from the CSI.

This could occur in different scenarios (schema evolution, CSI config changes), and has to be handled properly when we're composing CSI projection for Data Skipping. This PR addresses that.

- Added appropriate aligning for the transposed CSI projection
2022-04-11 15:45:53 -04:00
Alexey Kudinkin
976840e8eb [HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs (#5244)
Addressing the problem of Data Skipping not respecting Metadata Table configs which might differ b/w write/read paths. More details could be found in HUDI-3812.

- Fixing Data Skipping configuration to respect MT configs (on the Read path)
- Tightening up DS handling of cases when no top-level columns are in the target query
- Enhancing tests to cover all possible case
2022-04-10 13:43:47 -04:00
Alexey Kudinkin
81b25c543a [HUDI-3825] Fixing Column Stats Index updating sequence (#5267) 2022-04-08 23:14:08 -07:00
KnightChess
7a6272fba1 [HUDI-3781] fix spark delete sql can not delete record (#5215) 2022-04-08 14:26:40 +08:00
Alexey Kudinkin
d43b4cd95e [HUDI-3739] Fix handling of the isNotNull predicate in Data Skipping (#5224)
- Fix handling of the isNotNull predicate in Data Skipping
2022-04-06 15:17:36 -04:00
rkkalluri
939b3d1b07 [HUDI-3726] Switching from non-partitioned to partitioned key gen does not throw any exception (#5205) 2022-04-06 13:35:32 -04:00
Alexey Kudinkin
9e87d164b3 [HUDI-3760] Adding capability to fetch Metadata Records by prefix (#5208)
- Adding capability to fetch Metadata Records by key prefix so that Data Skipping could fetch only Column Stats 
- Index records pertaining to the columns being queried by, instead of reading out whole Index.
- Fixed usages of HFileScanner in HFileReader. few code paths uses cached scanner if available. Other code paths uses its own HFileScanner w/ positional read. 

Brief change log
- Rebasing ColumnStatsIndexSupport to rely on HoodieBackedTableMetadata in lieu of reading t/h Spark DS
- Adding methods enabling key-prefix lookups to HoodiFileReader, HoodieHFileReader
- Wiring key-prefix lookup t/h LogRecordScanner impls
- Cleaning up HoodieHFileReader impl

Co-authored-by: sivabalan <n.siva.b@gmail.com>
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-06 12:11:08 -04:00
Raymond Xu
e96f08f355 Moving to 0.12.0-SNAPSHOT on master branch. 2022-04-06 15:24:10 +08:00
Yann Byron
3195f51562 [HUDI-3748] write and select hudi table when enable hoodie.datasource.write.drop.partition.columns (#5201) 2022-04-05 01:31:41 -07:00
ForwardXu
3449e86989 [HUDI-3780] improve drop partitions (#5178) 2022-04-05 11:52:33 +08:00
Prashant Wason
b28f0d6ceb [HUDI-3290] Different file formats for the partition metadata file. (#5179)
* [HUDI-3290] Different file formats for the partition metadata file.

Partition metadata files are stored in each partition to help identify the base path of a table. These files are saved in the properties file format. Some query engines do not work when non Parquet/ORC files are found in a partition.

Added a new table config 'hoodie.partition.metafile.use.data.format' which when enabled (default false for backward compatibility) ensures that partition metafiles will be saved in the same format as the base files of a dataset.

For new datasets, the config can be set via hudi-cli. Deltastreamer has a new parameter --partition-metafile-use-data-format which will create a table with this setting.

* Code review comments

- Adding a new command to migrate from text to base file formats for meta file.
- Reimplementing readFromFS() to first read the text format, then base format
- Avoid extra exists() checks in readFromFS()
- Added unit tests, enabled parquet format across hoodie-hadoop-mr
- Code cleanup, restructuring, naming consistency.

* Wiring in all the other Spark code paths to respect this config

 - Turned on parquet meta format for COW data source tests
 - Removed the deltastreamer command line to keep it shorter

* populate HoodiePartitionMetadata#format after readFromFS()

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>
2022-04-04 08:08:20 -07:00
Alexey Kudinkin
cc3737be50 [HUDI-3664] Fixing Column Stats Index composition (#5181)
Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
2022-04-02 17:15:52 -07:00