lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Danny Chan	4bc45a391a	[HUDI-1445] Refactor AbstractHoodieLogRecordScanner to use Builder (#2313 )	2020-12-10 20:02:02 +08:00
lw0090	1f0d5c077e	[HUDI-1349] spark sql support overwrite use insert_overwrite_table (#2196 )	2020-12-03 12:26:21 -08:00
lw0090	5f5c15b0d9	[HUDI-892] RealtimeParquetInputFormat skip adding projection columns if there are no log files (#2190 ) * [HUDI-892] RealtimeParquetInputFormat skip adding projection columns if there are no log files * [HUDI-892] for test * [HUDI-892] fix bug generate array from split * [HUDI-892] revert test log	2020-11-02 20:00:12 -08:00
Gary Li	48a58c98a1	[MINOR] fix get classname for hive sync (#2008 )	2020-08-31 16:26:10 -07:00
Prashant Wason	6461927eac	[HUDI-960] Implementation of the HFile base and log file format. (#1804 ) * [HUDI-960] Implementation of the HFile base and log file format. 1. Includes HFileWriter and HFileReader 2. Includes HFileInputFormat for both snapshot and realtime input format for Hive 3. Unit test for new code 4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported) Advantage: HFile file format saves data as binary key-value pairs. This implementation chooses the following values: 1. Key = Hoodie Record Key (as bytes) 2. Value = Avro encoded GenericRecord (as bytes) HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format. Limitations: HFile storage format has certain limitations when used as a general purpose data storage format. 1. Does not have a implemented reader for Presto and SparkSQL 2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning Other changes: - Remove databricks/avro from pom - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc - Breaking up limitFileSize(), per parquet and hfile base files - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction - Throw UnsupportedException in HFileReader.getRecordKeys() - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise) * Fixing checkstyle Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-08-31 08:05:59 -07:00
Udit Mehrotra	8d04268264	[HUDI-1174] Changes for bootstrapped tables to work with presto (#1944 ) The purpose of this pull request is to implement changes required on Hudi side to get Bootstrapped tables integrated with Presto. The testing was done against presto 0.232 and following changes were identified to make it work: Annotation UseRecordReaderFromInputFormat is required on HoodieParquetInputFormat as well, because the reading for bootstrapped tables needs to happen through record reader to be able to perform the merge. On presto side, this annotation is already handled. We need to internally maintain VIRTUAL_COLUMN_NAMES because presto's internal hive version hive-apache-1.2.2 has VirutalColumn as a class, versus the one we depend on in hudi which is an enum. Dependency changes in hudi-presto-bundle to avoid runtime exceptions.	2020-08-12 17:51:31 -07:00
Gary Li	4f74a84607	[HUDI-69] Support Spark Datasource for MOR table - RDD approach (#1848 ) - This PR implements Spark Datasource for MOR table in the RDD approach. - Implemented SnapshotRelation - Implemented HudiMergeOnReadRDD - Implemented separate Iterator to handle merge and unmerge record reader. - Added TestMORDataSource to verify this feature. - Clean up test file name, add tests for mixed query type tests - We can now revert the change made in DefaultSource Co-authored-by: Vinoth Chandar <vchandar@confluent.io>	2020-08-07 00:28:14 -07:00
Balaji Varadarajan	9bcd3221fd	[HUDI-1144] Speedup spark read queries by caching metaclient in HoodieROPathFilter (#1919 )	2020-08-05 09:19:10 -07:00
vinoth chandar	539621bd33	[HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876 ) - [HUDI-418] Bootstrap Index Implementation using HFile with unit-test - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices - [HUDI-421] Bootstrap Write Client with tests - [HUDI-425] Added HoodieDeltaStreamer support - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly - [HUDI-424] Simplify Record reader implementation - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables Co-authored-by: Mehrotra <uditme@amazon.com> Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Balaji Varadarajan <varadarb@uber.com>	2020-08-03 20:19:21 -07:00
Bhavani Sudha Saktheeswaran	d5b593b7d9	[MINOR] change log.info to log.debug (#1883 )	2020-07-28 09:49:03 -07:00
wenningd	bf1d36fa63	[HUDI-1087] Handle decimal type for realtime record reader with SparkSQL (#1831 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2020-07-15 07:30:58 -07:00
Satish Kotha	086853c004	[HUDI-1080] Fix backward compatibility for com.uber inputformats	2020-07-08 15:30:07 -07:00
andreitaleanu	37ea79566d	[HUDI-539] Make HoodieROTablePathFilter implement Configurable (#1784 ) Co-authored-by: Andrei Taleanu <taleanu@adobe.com>	2020-07-03 13:39:53 -07:00
Prashant Wason	2603cfb33e	[HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687 ) Notable changes: 1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format 2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock) 3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format 4. HiveSyncTool accepts the base file format as a CLI parameter 5. HoodieDeltaStreamer accepts the base file format as a CLI parameter 6. HoodieSparkSqlWriter accepts the base file format as a parameter	2020-06-25 23:46:55 -07:00
Shen Hong	89e37d5273	[HUDI-908] Add some data types to HoodieTestDataGenerator and fix some some bugs. (#1690 )	2020-06-22 08:13:28 -07:00
Gary Li	37838cea60	[HUDI-822] decouple Hudi related logics from HoodieInputFormat (#1592 ) - Refactoring business logic out of InputFormat into Utils helpers.	2020-06-09 06:10:16 -07:00
lw0090	9e07cebece	[HUDI-974] Fix fields out of order in MOR mode when using Hive (#1711 )	2020-06-09 09:22:06 +08:00
Wenning Ding	7d40f19f39	HUDI-515 Resolve API conflict for Hive 2 & Hive 3	2020-06-08 14:18:38 -07:00
Shen Hong	2901f5423a	[HUDI-1002] Ignore case when setting incremental mode in hive query (#1715 )	2020-06-08 19:38:32 +08:00
hj2016	e0a5e0d343	[HUDI-1000] Fix incremental query for COW non-partitioned table with no data (#1708 )	2020-06-08 15:34:42 +08:00
Yajun Luo	a9a97d6af4	[HUDI-934] Add processing logic for the decimal LogicalType (#1677 )	2020-06-02 19:50:55 +08:00
bschell	e21441ad83	Add changes for presto mor queries (#1578 ) Adds the neccessary changes to hudi for support of presto querying hudi merge-on-read table's realtime view. Co-authored-by: Brandon Scheller <bschelle@amazon.com>	2020-05-04 11:27:14 -07:00
n3nash	332072bc6d	[HUDI-371] Supporting hive combine input format for realtime tables (#1503 )	2020-04-20 20:40:06 -07:00
Pratyaksh Sharma	6d7ca2cf7e	[HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema (#1427 )	2020-04-12 17:55:26 -07:00
satishkotha	c0f96e0726	[HUDI-687] Stop incremental reader on RO table when there is a pending compaction (#1396 )	2020-04-10 10:45:41 -07:00
Shaofeng Shi	78b3194e82	[HUDI-751] Fix some coding issues reported by FindBugs (#1470 )	2020-03-31 21:19:32 +08:00
Suneel Marthi	fa36082554	[HUDI-746] Reduce build warnings < 10 (#1465 )	2020-03-30 11:46:52 +08:00
vinoth chandar	e057c27603	[HUDI-744] Restructure hudi-common and clean up files under util packages (#1462 ) - Brings more order and cohesion to the classes in hudi-common - Utils classes related to a particular concept (avro, timeline,...) are placed near to the package - common.fs package now contains all the filesystem level classes including wrapper filesystem - bloom.filter package renamed to just bloom - config package contains classes that help store properties - common.fs.inline package contains all the inline filesystem classes/impl - common.table.timeline now consolidates all timeline related classes - common.table.view consolidates all the classes related to filesystem view metadata - common.table.timeline.versioning contains all classes related to versioning of timeline - Fix few unit tests as a result - Moved the test packages around to match the source file move - Rename AvroUtils to TimelineMetadataUtils & minor fixes/typos	2020-03-29 10:58:49 -07:00
Suneel Marthi	8c3001363d	HUDI-479: Eliminate or Minimize use of Guava if possible (#1159 )	2020-03-28 03:11:32 -04:00
vinoth chandar	e3019031d8	[HUDI-539] Make ROPathFilter conf member serializable (#1415 )	2020-03-17 12:52:48 -07:00
bschell	418f9bb2e9	Add constructor to HoodieROTablePathFilter (#1413 ) Allows HoodieROTablePathFilter to accept a configuration for initializing the filesystem. This fixes a bug with Presto's use of this pathfilter. Co-authored-by: Brandon Scheller <bschelle@amazon.com>	2020-03-16 15:19:16 -07:00
Suneel Marthi	99b7e9eb9e	[HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java (#1350 ) * [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java	2020-03-13 20:28:05 -04:00
Suneel Marthi	24e73816b2	[MINOR] Code Cleanup, remove redundant code (#1337 )	2020-02-15 22:03:29 +08:00
Suneel Marthi	594da28fbf	[HUDI-595] code cleanup, refactoring code out of PR# 1159 (#1302 )	2020-02-04 21:52:03 +08:00
Suneel Marthi	5b7bb142dc	[HUDI-583] Code Cleanup, remove redundant code, and other changes (#1237 )	2020-02-02 18:03:44 +08:00
lamber-ken	c06ec8bfc7	[MINOR] Fix assigning to configuration more times (#1291 )	2020-01-29 17:18:35 -05:00
vinoth chandar	c2c0f6b13d	[HUDI-509] Renaming code in sync with cWiki restructuring (#1212 ) - Storage Type replaced with Table Type (remaining instances) - View types replaced with query types; - ReadOptimized view referred as Snapshot Query - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views - HoodieDataFile renamed to HoodieBaseFile - Hive Sync tool will register RO tables for MOR with a `_ro` suffix - Datasource/Deltastreamer options renamed accordingly - Support fallback to old config values as well, so migration is painless - Config for controlling _ro suffix addition - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView	2020-01-16 23:58:47 -08:00
Bhavani Sudha Saktheeswaran	d09eacdc13	[HUDI-25] Optimize HoodieInputformat.listStatus() for faster Hive incremental queries on Hoodie Summary: - InputPathHandler class classifies inputPaths into incremental, non incremental and non hoodie paths. - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions - listStatus() processes each category separately	2020-01-08 14:53:05 -08:00
vinoth chandar	9706f659db	[HUDI-508] Standardizing on "Table" instead of "Dataset" across code (#1197 ) - Docs were talking about storage types before, cWiki moved to "Table" - Most of code already has HoodieTable, HoodieTableMetaClient - correct naming - Replacing renaming use of dataset across code/comments - Few usages in comments and use of Spark SQL DataSet remain unscathed	2020-01-07 12:52:32 -08:00
lamber-ken	ab6ae5cebb	[HUDI-482] Fix missing @Override annotation on methods (#1156 ) * [HUDI-482] Fix missing @Override annotation on methods	2019-12-31 11:44:56 +08:00
lamber-ken	ba514cfea0	[MINOR] Remove redundant plus operator (#1097 )	2019-12-12 05:42:05 +08:00
lamber-ken	d447e2d751	[checkstyle] Unify LOG form (#1092 )	2019-12-10 19:23:38 +08:00
lamber-ken	2745b7552f	[HUDI-379] Refactor the codes based on new JavadocStyle code style rule (#1079 )	2019-12-06 12:59:28 +08:00
谢磊	f9139c0f61	[HUDI-366] Refactor some module codes based on new ImportOrder code style rule (#1055 ) [HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test / hudi- utilities based on new ImportOrder code style rule	2019-11-27 21:32:43 +08:00
谢磊	804e348d0e	[HUDI-346] Set allowMultipleEmptyLines to false for EmptyLineSeparator rule (#1025 )	2019-11-19 18:44:42 +08:00
Nishith Agarwal	3a05edab01	- Fixing RT queries for HiveOnSpark that causes race conditions - Adding more comments to understand usage of reader/writer schema	2019-11-16 13:46:47 -08:00
Wenning Ding	b6057c5e0e	[HUDI-314] Fix multi partition keys error when querying a realtime table	2019-11-02 19:49:04 -07:00
Wenning Ding	ee0fd06de7	synchronized lock on conf object instead of class	2019-10-31 21:54:27 -07:00
Wenning Ding	3251d62bd3	[HUDI-313] Fix select count star error when querying a realtime table	2019-10-31 21:54:27 -07:00
Udit Mehrotra	12523c379f	[HUDI-298] Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables (#956 ) * Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables	2019-10-16 02:05:53 -07:00

1 2 3

105 Commits