1
0
Commit Graph

57 Commits

Author SHA1 Message Date
xiarixiaoyao
081061e14b [HUDI-1719] hive on spark/mr,Incremental query of the mor table, the partition field is incorrect (#2720) 2021-05-20 11:00:08 -04:00
xiarixiaoyao
6f7ff7e8ca [HUDI-1722]Fix hive beeline/spark-sql query specified field on mor table occur NPE (#2722) 2021-05-12 20:52:37 +08:00
TeRS-K
be9db2c4f5 [HUDI-1055] Remove hardcoded parquet in tests (#2740)
* Remove hardcoded parquet in tests
* Use DataFileUtils.getInstance
* Renaming DataFileUtils to BaseFileUtils

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2021-05-11 10:01:45 -07:00
jsbali
aa398f77f1 [HUDI-1789] Support reading older snapshots (#2809)
* [HUDI-1789] In HoodieParquetInoutFormat we currently default to the latest version of base files.
This PR attempts to add a new jobConf
 `hoodie.%s.consume.snapshot.time`

This new config will allow us to read older snapshots.

-  Reusing hoodie.%s.consume.commit for point in time snapshot queries as well.
-  Adding javadocs and some more tests
2021-05-10 15:26:49 -07:00
xiarixiaoyao
1db904a12e [HUDI-1718] When query incr view of mor table which has Multi level partitions, the query failed (#2716) 2021-05-05 00:34:20 -04:00
Raymond Xu
faf3785a2d [HUDI-1811] Fix TestHoodieRealtimeRecordReader (#2873)
Pass basePath with scheme 'file://' to HoodieRealtimeFileSplit
2021-04-30 11:16:55 -07:00
xiarixiaoyao
929eca43fe [HUDI-1817] Fix getting incorrect partition path while using incr query by spark-sql (#2858) 2021-04-30 14:57:52 +08:00
satishkotha
7cc75e0be2 [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (#2611) 2021-03-04 17:43:31 -08:00
n3nash
ffcfb58bac [HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359)
1. Refactor rollback and move cleaning failed commits logic into cleaner
2. Introduce hoodie heartbeat to ascertain failed commits
3. Fix test cases
2021-02-19 20:12:22 -08:00
Gary Li
605b617cfa [HUDI-1434] fix incorrect log file path in HoodieWriteStat (#2300)
* [HUDI-1434] fix incorrect log file path in HoodieWriteStat

* HoodieWriteHandle#close() returns a list of WriteStatus objs

* Handle rolled-over log files and return a WriteStatus per log file written

 - Combined data and delete block logging into a single call
 - Lazily initialize and manage write status based on returned AppendResult
 - Use FSUtils.getFileSize() to set final file size, consistent with other handles
 - Added tests around returned values in AppendResult
 - Added validation of the file sizes returned in write stat

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-12-30 14:22:15 -08:00
Balaji Varadarajan
3ec9270e8e [HUDI-1490] Incremental Query should work even when there are partitions that have no incremental changes (#2371)
* Incremental Query should work even when there are  partitions that have no incremental changes

Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>
2020-12-26 12:17:49 -05:00
Raymond Xu
c5e10d668f [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2167)
Remove APIs in `HoodieTestUtils`
- `createCommitFiles`
- `createDataFile`
- `createNewLogFile`
- `createCompactionRequest`

Migrated usages in `TestCleaner#testPendingCompactions`.

Also improved some API names in `HoodieTestTable`.
2020-10-12 14:39:10 +08:00
Raymond Xu
3201665295 [HUDI-995] Use HoodieTestTable in more classes (#2079)
* [HUDI-995] Use HoodieTestTable in more classes

Migrate test data prep logic in
- TestStatsCommand
- TestHoodieROTablePathFilter

Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness
- Move relevant APIs to HoodieTestTable
- Migrate usages

After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils
2020-09-17 09:29:07 -07:00
Prashant Wason
6461927eac [HUDI-960] Implementation of the HFile base and log file format. (#1804)
* [HUDI-960] Implementation of the HFile base and log file format.

1. Includes HFileWriter and HFileReader
2. Includes HFileInputFormat for both snapshot and realtime input format for Hive
3. Unit test for new code
4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported)

Advantage:
HFile file format saves data as binary key-value pairs. This implementation chooses the following values:
1. Key = Hoodie Record Key (as bytes)
2. Value = Avro encoded GenericRecord (as bytes)

HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format.

Limitations:
HFile storage format has certain limitations when used as a general purpose data storage format.
1. Does not have a implemented reader for Presto and SparkSQL
2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning


Other changes: 
 - Remove databricks/avro from pom
 - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc
 - Breaking up limitFileSize(), per parquet and hfile base files
 - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction
 - Throw UnsupportedException in HFileReader.getRecordKeys()
 - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise)

* Fixing checkstyle

Co-authored-by: Vinoth Chandar <vinoth@apache.org>
2020-08-31 08:05:59 -07:00
Udit Mehrotra
8d04268264 [HUDI-1174] Changes for bootstrapped tables to work with presto (#1944)
The purpose of this pull request is to implement changes required on Hudi side to get Bootstrapped tables integrated with Presto. The testing was done against presto 0.232 and following changes were identified to make it work:

Annotation UseRecordReaderFromInputFormat is required on HoodieParquetInputFormat as well, because the reading for bootstrapped tables needs to happen through record reader to be able to perform the merge. On presto side, this annotation is already handled.

We need to internally maintain VIRTUAL_COLUMN_NAMES because presto's internal hive version hive-apache-1.2.2 has VirutalColumn as a class, versus the one we depend on in hudi which is an enum. 

Dependency changes in hudi-presto-bundle to avoid runtime exceptions.
2020-08-12 17:51:31 -07:00
Balaji Varadarajan
9bcd3221fd [HUDI-1144] Speedup spark read queries by caching metaclient in HoodieROPathFilter (#1919) 2020-08-05 09:19:10 -07:00
Raymond Xu
10e4268792 [HUDI-995] Use Transformations, Assertions and SchemaTestUtil (#1884)
- Consolidate transform functions for tests in Transformations.java
- Consolidate assertion functions for tests in Assertions.java
- Make use of SchemaTestUtil for loading schema from resource
2020-08-01 20:57:18 +08:00
Prashant Wason
2603cfb33e [HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687)
Notable changes:
    1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format
    2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock)
    3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format
    4. HiveSyncTool accepts the base file format as a CLI parameter
    5. HoodieDeltaStreamer accepts the base file format as a CLI parameter
    6. HoodieSparkSqlWriter accepts the base file format as a parameter
2020-06-25 23:46:55 -07:00
Satish Kotha
a7fd331624 Add unit test for snapshot reads in hadoop-mr 2020-06-13 10:23:05 -07:00
Gary Li
37838cea60 [HUDI-822] decouple Hudi related logics from HoodieInputFormat (#1592)
- Refactoring business logic out of InputFormat into Utils helpers.
2020-06-09 06:10:16 -07:00
Shen Hong
2901f5423a [HUDI-1002] Ignore case when setting incremental mode in hive query (#1715) 2020-06-08 19:38:32 +08:00
Raymond Xu
03f136361a [HUDI-811] Restructure test packages in hudi-common (#1644)
* [HUDI-811] Restructure test packages in hudi-common
2020-05-27 16:28:17 +08:00
Raymond Xu
0d4848b68b [HUDI-811] Restructure test packages (#1607)
* restructure hudi-spark tests
* restructure hudi-timeline-service tests
* restructure hudi-hadoop-mr hudi-utilities tests
* restructure hudi-hive-sync tests
2020-05-13 15:37:03 -07:00
Raymond Xu
366bb10d8c [HUDI-812] Migrate hudi common tests to JUnit 5 (#1590)
* [HUDI-812] Migrate hudi-common tests to JUnit 5
2020-05-06 19:15:20 +08:00
bschell
e21441ad83 Add changes for presto mor queries (#1578)
Adds the neccessary changes to hudi for support of presto querying hudi
merge-on-read table's realtime view.

Co-authored-by: Brandon Scheller <bschelle@amazon.com>
2020-05-04 11:27:14 -07:00
Raymond Xu
6e15eebd81 [HUDI-809] Migrate CommonTestHarness to JUnit 5 (#1530) 2020-04-22 14:10:25 +08:00
n3nash
332072bc6d [HUDI-371] Supporting hive combine input format for realtime tables (#1503) 2020-04-20 20:40:06 -07:00
Raymond Xu
acdc4a8d00 [HUDI-798] Migrate to Mockito Jupiter for JUnit 5 (#1521) 2020-04-16 16:07:32 +08:00
Raymond Xu
d65efe659d [HUDI-780] Migrate test cases to Junit 5 (#1504) 2020-04-15 12:35:01 -07:00
satishkotha
c0f96e0726 [HUDI-687] Stop incremental reader on RO table when there is a pending compaction (#1396) 2020-04-10 10:45:41 -07:00
Ramachandran Madtas Subramaniam
f5f34bb1c1 [HUDI-568] Improve unit test coverage
Classes improved:
* HoodieTableMetaClient
* RocksDBDAO
* HoodieRealtimeFileSplit
2020-04-09 10:15:34 -07:00
Abhishek Modi
996f761232 Trying git merge --squash 2020-04-09 08:18:02 -07:00
Ramachandran Madtas Subramaniam
639ec20412 [HUDI-562] Enable testing at debug log level
This is to ensure that tests will execute all code paths, even the ones
written under DEBUG log levels. This will improve coverage as well as
ensure there are no surprised when DEBUG log level is enabled in
production.
2020-04-02 11:14:35 -07:00
Suneel Marthi
fa36082554 [HUDI-746] Reduce build warnings < 10 (#1465) 2020-03-30 11:46:52 +08:00
vinoth chandar
e057c27603 [HUDI-744] Restructure hudi-common and clean up files under util packages (#1462)
- Brings more order and cohesion to the classes in hudi-common
 - Utils classes related to a particular concept (avro, timeline,...) are placed near to the package
 - common.fs package now contains all the filesystem level classes including wrapper filesystem
 - bloom.filter package renamed to just bloom
 - config package contains classes that help store properties
 - common.fs.inline package contains all the inline filesystem classes/impl
 - common.table.timeline now consolidates all timeline related classes
 - common.table.view consolidates all the classes related to filesystem view metadata
 - common.table.timeline.versioning contains all classes related to versioning of timeline
 - Fix few unit tests as a result
 - Moved the test packages around to match the source file move
 - Rename AvroUtils to TimelineMetadataUtils & minor fixes/typos
2020-03-29 10:58:49 -07:00
Zhiyuan Zhao
0241b21f77 [HUDI-65] commitTime rename to instantTime (#1431) 2020-03-22 18:06:00 -07:00
Suneel Marthi
24e73816b2 [MINOR] Code Cleanup, remove redundant code (#1337) 2020-02-15 22:03:29 +08:00
Suneel Marthi
594da28fbf [HUDI-595] code cleanup, refactoring code out of PR# 1159 (#1302) 2020-02-04 21:52:03 +08:00
Suneel Marthi
5b7bb142dc [HUDI-583] Code Cleanup, remove redundant code, and other changes (#1237) 2020-02-02 18:03:44 +08:00
Bhavani Sudha Saktheeswaran
d09eacdc13 [HUDI-25] Optimize HoodieInputformat.listStatus() for faster Hive incremental queries on Hoodie
Summary:
    - InputPathHandler class classifies  inputPaths into incremental, non incremental and non hoodie paths.
    - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions
    - listStatus() processes each category separately
2020-01-08 14:53:05 -08:00
vinoth chandar
9706f659db [HUDI-508] Standardizing on "Table" instead of "Dataset" across code (#1197)
- Docs were talking about storage types before, cWiki moved to "Table"
 - Most of code already has HoodieTable, HoodieTableMetaClient - correct naming
 - Replacing renaming use of dataset across code/comments
 - Few usages in comments and use of Spark SQL DataSet remain unscathed
2020-01-07 12:52:32 -08:00
Pratyaksh Sharma
dde21e7315 [HUDI-402]: code clean up in test cases 2019-12-31 11:10:49 -08:00
Balaji Varadarajan
9a1f698eef [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset 2019-12-15 21:26:30 -08:00
lamber-ken
ba514cfea0 [MINOR] Remove redundant plus operator (#1097) 2019-12-12 05:42:05 +08:00
lamber-ken
2745b7552f [HUDI-379] Refactor the codes based on new JavadocStyle code style rule (#1079) 2019-12-06 12:59:28 +08:00
lamber-ken
b3e0ebbc4a [checkstyle] Add ConstantName java checkstyle rule (#1066)
* add SimplifyBooleanExpression java checkstyle rule
* collapse empty tags in scalastyle file
2019-12-04 18:59:15 +08:00
谢磊
f9139c0f61 [HUDI-366] Refactor some module codes based on new ImportOrder code style rule (#1055)
[HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test / hudi- utilities based on new ImportOrder code style rule
2019-11-27 21:32:43 +08:00
谢磊
804e348d0e [HUDI-346] Set allowMultipleEmptyLines to false for EmptyLineSeparator rule (#1025) 2019-11-19 18:44:42 +08:00
vinoth chandar
e4c91ed13f [HUDI-290] Normalize test class name of all test classes (#951) 2019-10-22 20:19:11 -07:00
Udit Mehrotra
12523c379f [HUDI-298] Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables (#956)
* Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables
2019-10-16 02:05:53 -07:00