lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
xiarixiaoyao	081061e14b	[HUDI-1719] hive on spark/mr,Incremental query of the mor table, the partition field is incorrect (#2720 )	2021-05-20 11:00:08 -04:00
xiarixiaoyao	6f7ff7e8ca	[HUDI-1722]Fix hive beeline/spark-sql query specified field on mor table occur NPE (#2722 )	2021-05-12 20:52:37 +08:00
TeRS-K	be9db2c4f5	[HUDI-1055] Remove hardcoded parquet in tests (#2740 ) * Remove hardcoded parquet in tests * Use DataFileUtils.getInstance * Renaming DataFileUtils to BaseFileUtils Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-05-11 10:01:45 -07:00
jsbali	aa398f77f1	[HUDI-1789] Support reading older snapshots (#2809 ) * [HUDI-1789] In HoodieParquetInoutFormat we currently default to the latest version of base files. This PR attempts to add a new jobConf `hoodie.%s.consume.snapshot.time` This new config will allow us to read older snapshots. - Reusing hoodie.%s.consume.commit for point in time snapshot queries as well. - Adding javadocs and some more tests	2021-05-10 15:26:49 -07:00
xiarixiaoyao	1db904a12e	[HUDI-1718] When query incr view of mor table which has Multi level partitions, the query failed (#2716 )	2021-05-05 00:34:20 -04:00
Raymond Xu	faf3785a2d	[HUDI-1811] Fix TestHoodieRealtimeRecordReader (#2873 ) Pass basePath with scheme 'file://' to HoodieRealtimeFileSplit	2021-04-30 11:16:55 -07:00
xiarixiaoyao	929eca43fe	[HUDI-1817] Fix getting incorrect partition path while using incr query by spark-sql (#2858 )	2021-04-30 14:57:52 +08:00
satishkotha	7cc75e0be2	[HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (#2611 )	2021-03-04 17:43:31 -08:00
n3nash	ffcfb58bac	[HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359 ) 1. Refactor rollback and move cleaning failed commits logic into cleaner 2. Introduce hoodie heartbeat to ascertain failed commits 3. Fix test cases	2021-02-19 20:12:22 -08:00
Gary Li	605b617cfa	[HUDI-1434] fix incorrect log file path in HoodieWriteStat (#2300 ) * [HUDI-1434] fix incorrect log file path in HoodieWriteStat * HoodieWriteHandle#close() returns a list of WriteStatus objs * Handle rolled-over log files and return a WriteStatus per log file written - Combined data and delete block logging into a single call - Lazily initialize and manage write status based on returned AppendResult - Use FSUtils.getFileSize() to set final file size, consistent with other handles - Added tests around returned values in AppendResult - Added validation of the file sizes returned in write stat Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-12-30 14:22:15 -08:00
Balaji Varadarajan	3ec9270e8e	[HUDI-1490] Incremental Query should work even when there are partitions that have no incremental changes (#2371 ) * Incremental Query should work even when there are partitions that have no incremental changes Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>	2020-12-26 12:17:49 -05:00
Raymond Xu	c5e10d668f	[HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2167 ) Remove APIs in `HoodieTestUtils` - `createCommitFiles` - `createDataFile` - `createNewLogFile` - `createCompactionRequest` Migrated usages in `TestCleaner#testPendingCompactions`. Also improved some API names in `HoodieTestTable`.	2020-10-12 14:39:10 +08:00
Raymond Xu	3201665295	[HUDI-995] Use HoodieTestTable in more classes (#2079 ) * [HUDI-995] Use HoodieTestTable in more classes Migrate test data prep logic in - TestStatsCommand - TestHoodieROTablePathFilter Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness - Move relevant APIs to HoodieTestTable - Migrate usages After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils	2020-09-17 09:29:07 -07:00
Prashant Wason	6461927eac	[HUDI-960] Implementation of the HFile base and log file format. (#1804 ) * [HUDI-960] Implementation of the HFile base and log file format. 1. Includes HFileWriter and HFileReader 2. Includes HFileInputFormat for both snapshot and realtime input format for Hive 3. Unit test for new code 4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported) Advantage: HFile file format saves data as binary key-value pairs. This implementation chooses the following values: 1. Key = Hoodie Record Key (as bytes) 2. Value = Avro encoded GenericRecord (as bytes) HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format. Limitations: HFile storage format has certain limitations when used as a general purpose data storage format. 1. Does not have a implemented reader for Presto and SparkSQL 2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning Other changes: - Remove databricks/avro from pom - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc - Breaking up limitFileSize(), per parquet and hfile base files - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction - Throw UnsupportedException in HFileReader.getRecordKeys() - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise) * Fixing checkstyle Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-08-31 08:05:59 -07:00
Udit Mehrotra	8d04268264	[HUDI-1174] Changes for bootstrapped tables to work with presto (#1944 ) The purpose of this pull request is to implement changes required on Hudi side to get Bootstrapped tables integrated with Presto. The testing was done against presto 0.232 and following changes were identified to make it work: Annotation UseRecordReaderFromInputFormat is required on HoodieParquetInputFormat as well, because the reading for bootstrapped tables needs to happen through record reader to be able to perform the merge. On presto side, this annotation is already handled. We need to internally maintain VIRTUAL_COLUMN_NAMES because presto's internal hive version hive-apache-1.2.2 has VirutalColumn as a class, versus the one we depend on in hudi which is an enum. Dependency changes in hudi-presto-bundle to avoid runtime exceptions.	2020-08-12 17:51:31 -07:00
Balaji Varadarajan	9bcd3221fd	[HUDI-1144] Speedup spark read queries by caching metaclient in HoodieROPathFilter (#1919 )	2020-08-05 09:19:10 -07:00
Raymond Xu	10e4268792	[HUDI-995] Use Transformations, Assertions and SchemaTestUtil (#1884 ) - Consolidate transform functions for tests in Transformations.java - Consolidate assertion functions for tests in Assertions.java - Make use of SchemaTestUtil for loading schema from resource	2020-08-01 20:57:18 +08:00
Prashant Wason	2603cfb33e	[HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687 ) Notable changes: 1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format 2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock) 3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format 4. HiveSyncTool accepts the base file format as a CLI parameter 5. HoodieDeltaStreamer accepts the base file format as a CLI parameter 6. HoodieSparkSqlWriter accepts the base file format as a parameter	2020-06-25 23:46:55 -07:00
Satish Kotha	a7fd331624	Add unit test for snapshot reads in hadoop-mr	2020-06-13 10:23:05 -07:00
Gary Li	37838cea60	[HUDI-822] decouple Hudi related logics from HoodieInputFormat (#1592 ) - Refactoring business logic out of InputFormat into Utils helpers.	2020-06-09 06:10:16 -07:00
Shen Hong	2901f5423a	[HUDI-1002] Ignore case when setting incremental mode in hive query (#1715 )	2020-06-08 19:38:32 +08:00
Raymond Xu	03f136361a	[HUDI-811] Restructure test packages in hudi-common (#1644 ) * [HUDI-811] Restructure test packages in hudi-common	2020-05-27 16:28:17 +08:00
Raymond Xu	0d4848b68b	[HUDI-811] Restructure test packages (#1607 ) * restructure hudi-spark tests * restructure hudi-timeline-service tests * restructure hudi-hadoop-mr hudi-utilities tests * restructure hudi-hive-sync tests	2020-05-13 15:37:03 -07:00
Raymond Xu	366bb10d8c	[HUDI-812] Migrate hudi common tests to JUnit 5 (#1590 ) * [HUDI-812] Migrate hudi-common tests to JUnit 5	2020-05-06 19:15:20 +08:00
bschell	e21441ad83	Add changes for presto mor queries (#1578 ) Adds the neccessary changes to hudi for support of presto querying hudi merge-on-read table's realtime view. Co-authored-by: Brandon Scheller <bschelle@amazon.com>	2020-05-04 11:27:14 -07:00
Raymond Xu	6e15eebd81	[HUDI-809] Migrate CommonTestHarness to JUnit 5 (#1530 )	2020-04-22 14:10:25 +08:00
n3nash	332072bc6d	[HUDI-371] Supporting hive combine input format for realtime tables (#1503 )	2020-04-20 20:40:06 -07:00
Raymond Xu	acdc4a8d00	[HUDI-798] Migrate to Mockito Jupiter for JUnit 5 (#1521 )	2020-04-16 16:07:32 +08:00
Raymond Xu	d65efe659d	[HUDI-780] Migrate test cases to Junit 5 (#1504 )	2020-04-15 12:35:01 -07:00
satishkotha	c0f96e0726	[HUDI-687] Stop incremental reader on RO table when there is a pending compaction (#1396 )	2020-04-10 10:45:41 -07:00
Ramachandran Madtas Subramaniam	f5f34bb1c1	[HUDI-568] Improve unit test coverage Classes improved: * HoodieTableMetaClient * RocksDBDAO * HoodieRealtimeFileSplit	2020-04-09 10:15:34 -07:00
Abhishek Modi	996f761232	Trying git merge --squash	2020-04-09 08:18:02 -07:00
Ramachandran Madtas Subramaniam	639ec20412	[HUDI-562] Enable testing at debug log level This is to ensure that tests will execute all code paths, even the ones written under DEBUG log levels. This will improve coverage as well as ensure there are no surprised when DEBUG log level is enabled in production.	2020-04-02 11:14:35 -07:00
Suneel Marthi	fa36082554	[HUDI-746] Reduce build warnings < 10 (#1465 )	2020-03-30 11:46:52 +08:00
vinoth chandar	e057c27603	[HUDI-744] Restructure hudi-common and clean up files under util packages (#1462 ) - Brings more order and cohesion to the classes in hudi-common - Utils classes related to a particular concept (avro, timeline,...) are placed near to the package - common.fs package now contains all the filesystem level classes including wrapper filesystem - bloom.filter package renamed to just bloom - config package contains classes that help store properties - common.fs.inline package contains all the inline filesystem classes/impl - common.table.timeline now consolidates all timeline related classes - common.table.view consolidates all the classes related to filesystem view metadata - common.table.timeline.versioning contains all classes related to versioning of timeline - Fix few unit tests as a result - Moved the test packages around to match the source file move - Rename AvroUtils to TimelineMetadataUtils & minor fixes/typos	2020-03-29 10:58:49 -07:00
Zhiyuan Zhao	0241b21f77	[HUDI-65] commitTime rename to instantTime (#1431 )	2020-03-22 18:06:00 -07:00
Suneel Marthi	24e73816b2	[MINOR] Code Cleanup, remove redundant code (#1337 )	2020-02-15 22:03:29 +08:00
Suneel Marthi	594da28fbf	[HUDI-595] code cleanup, refactoring code out of PR# 1159 (#1302 )	2020-02-04 21:52:03 +08:00
Suneel Marthi	5b7bb142dc	[HUDI-583] Code Cleanup, remove redundant code, and other changes (#1237 )	2020-02-02 18:03:44 +08:00
Bhavani Sudha Saktheeswaran	d09eacdc13	[HUDI-25] Optimize HoodieInputformat.listStatus() for faster Hive incremental queries on Hoodie Summary: - InputPathHandler class classifies inputPaths into incremental, non incremental and non hoodie paths. - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions - listStatus() processes each category separately	2020-01-08 14:53:05 -08:00
vinoth chandar	9706f659db	[HUDI-508] Standardizing on "Table" instead of "Dataset" across code (#1197 ) - Docs were talking about storage types before, cWiki moved to "Table" - Most of code already has HoodieTable, HoodieTableMetaClient - correct naming - Replacing renaming use of dataset across code/comments - Few usages in comments and use of Spark SQL DataSet remain unscathed	2020-01-07 12:52:32 -08:00
Pratyaksh Sharma	dde21e7315	[HUDI-402]: code clean up in test cases	2019-12-31 11:10:49 -08:00
Balaji Varadarajan	9a1f698eef	[HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset	2019-12-15 21:26:30 -08:00
lamber-ken	ba514cfea0	[MINOR] Remove redundant plus operator (#1097 )	2019-12-12 05:42:05 +08:00
lamber-ken	2745b7552f	[HUDI-379] Refactor the codes based on new JavadocStyle code style rule (#1079 )	2019-12-06 12:59:28 +08:00
lamber-ken	b3e0ebbc4a	[checkstyle] Add ConstantName java checkstyle rule (#1066 ) * add SimplifyBooleanExpression java checkstyle rule * collapse empty tags in scalastyle file	2019-12-04 18:59:15 +08:00
谢磊	f9139c0f61	[HUDI-366] Refactor some module codes based on new ImportOrder code style rule (#1055 ) [HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test / hudi- utilities based on new ImportOrder code style rule	2019-11-27 21:32:43 +08:00
谢磊	804e348d0e	[HUDI-346] Set allowMultipleEmptyLines to false for EmptyLineSeparator rule (#1025 )	2019-11-19 18:44:42 +08:00
vinoth chandar	e4c91ed13f	[HUDI-290] Normalize test class name of all test classes (#951 )	2019-10-22 20:19:11 -07:00
Udit Mehrotra	12523c379f	[HUDI-298] Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables (#956 ) * Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables	2019-10-16 02:05:53 -07:00

1 2

57 Commits