lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
rmahindra123	8fef50e237	[HUDI-2044] Integrate consumers with rocksDB and compression within External Spillable Map (#3318 )	2021-07-28 01:31:03 -04:00
Danny Chan	ac75bda929	[HUDI-1969] Support reading logs for MOR Hive rt table (#3033 )	2021-07-13 23:43:30 -07:00
pengzhiwei	ca440ccf88	[HUDI-2107] Support Read Log Only MOR Table For Spark (#3193 )	2021-07-12 17:31:23 +08:00
wenningd	d412fb2fe6	[HUDI-89] Add configOption & refactor all configs based on that (#2833 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2021-06-30 14:26:30 -07:00
s-sanjay	0fb8556b0d	Add ability to provide multi-region (global) data consistency across HMS in different regions (#2542 ) [global-hive-sync-tool] Add a global hive sync tool to sync hudi table across clusters. Add a way to rollback the replicated time stamp if we fail to sync or if we partly sync Co-authored-by: Jagmeet Bali <jsbali@uber.com>	2021-06-24 20:26:26 -07:00
Wei	7865da1e15	[MINOR] Fix Javadoc wrong references (#3115 )	2021-06-18 21:51:54 -07:00
Jintao Guan	b8fe5b91d5	[HUDI-764] [HUDI-765] ORC reader writer Implementation (#2999 ) Co-authored-by: Qingyun (Teresa) Kang <kteresa@uber.com>	2021-06-15 15:21:43 -07:00
Danny Chan	c2383ee904	[HUDI-1967] Fix the NPE for MOR Hive rt table query (#3032 ) The HoodieInputFormatUtils.getTableMetaClientByBasePath returns the map with table base path as keys while the HoodieRealtimeInputFormatUtils query it with the partition path.	2021-06-05 01:06:34 -07:00
xiarixiaoyao	081061e14b	[HUDI-1719] hive on spark/mr,Incremental query of the mor table, the partition field is incorrect (#2720 )	2021-05-20 11:00:08 -04:00
xiarixiaoyao	6f7ff7e8ca	[HUDI-1722]Fix hive beeline/spark-sql query specified field on mor table occur NPE (#2722 )	2021-05-12 20:52:37 +08:00
TeRS-K	be9db2c4f5	[HUDI-1055] Remove hardcoded parquet in tests (#2740 ) * Remove hardcoded parquet in tests * Use DataFileUtils.getInstance * Renaming DataFileUtils to BaseFileUtils Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2021-05-11 10:01:45 -07:00
jsbali	aa398f77f1	[HUDI-1789] Support reading older snapshots (#2809 ) * [HUDI-1789] In HoodieParquetInoutFormat we currently default to the latest version of base files. This PR attempts to add a new jobConf `hoodie.%s.consume.snapshot.time` This new config will allow us to read older snapshots. - Reusing hoodie.%s.consume.commit for point in time snapshot queries as well. - Adding javadocs and some more tests	2021-05-10 15:26:49 -07:00
xiarixiaoyao	1db904a12e	[HUDI-1718] When query incr view of mor table which has Multi level partitions, the query failed (#2716 )	2021-05-05 00:34:20 -04:00
Raymond Xu	faf3785a2d	[HUDI-1811] Fix TestHoodieRealtimeRecordReader (#2873 ) Pass basePath with scheme 'file://' to HoodieRealtimeFileSplit	2021-04-30 11:16:55 -07:00
xiarixiaoyao	929eca43fe	[HUDI-1817] Fix getting incorrect partition path while using incr query by spark-sql (#2858 )	2021-04-30 14:57:52 +08:00
xiarixiaoyao	65844a8d29	[HUDI-1720] Fix RealtimeCompactedRecordReader StackOverflowError (#2721 )	2021-04-13 18:23:26 +08:00
xiarixiaoyao	02073235c3	[HUDI-1662] Fix hive date type conversion for mor table (#2634 )	2021-03-08 12:16:13 +08:00
satishkotha	7cc75e0be2	[HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (#2611 )	2021-03-04 17:43:31 -08:00
n3nash	ffcfb58bac	[HUDI-1486] Remove inline inflight rollback in hoodie writer (#2359 ) 1. Refactor rollback and move cleaning failed commits logic into cleaner 2. Introduce hoodie heartbeat to ascertain failed commits 3. Fix test cases	2021-02-19 20:12:22 -08:00
Sivabalan Narayanan	c9fcf964b2	[HUDI-1315] Adding builder for HoodieTableMetaClient initialization (#2534 )	2021-02-20 09:54:26 +08:00
satishkotha	0d91c451b0	[HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (#2583 )	2021-02-19 15:45:43 -08:00
vinoth chandar	5ca0625b27	[HUDI 1308] Harden RFC-15 Implementation based on production testing (#2441 ) Addresses leaks, perf degradation observed during testing. These were regressions from the original rfc-15 PoC implementation. * Pass a single instance of HoodieTableMetadata everywhere * Fix tests and add config for enabling metrics - Removed special casing of assumeDatePartitioning inside FSUtils#getAllPartitionPaths() - Consequently, IOException is never thrown and many files had to be adjusted - More diligent handling of open file handles in metadata table - Added config for controlling reuse of connections - Added config for turning off fallback to listing, so we can see tests fail - Changed all ipf listing code to cache/amortize the open/close for better performance - Timelineserver also reuses connections, for better performance - Without timelineserver, when metadata table is opened from executors, reuse is not allowed - HoodieMetadataConfig passed into HoodieTableMetadata#create as argument. - Fix TestHoodieBackedTableMetadata#testSync	2021-01-19 21:20:28 -08:00
n3nash	749f657856	[HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (#2424 )	2021-01-14 12:47:50 -08:00
Udit Mehrotra	7ce3ac778e	[HUDI-1479] Use HoodieEngineContext to parallelize fetching of partiton paths (#2417 ) * [HUDI-1479] Use HoodieEngineContext to parallelize fetching of partition paths * Adding testClass for FileSystemBackedTableMetadata Co-authored-by: Nishith Agarwal <nagarwal@uber.com>	2021-01-10 21:19:52 -08:00
Gary Li	79ec7b4894	[HUDI-920] Support Incremental query for MOR table (#1938 )	2021-01-09 08:02:08 -08:00
rmpifer	1a0579ca7d	[HUDI-1312] [RFC-15] Support for metadata listing for snapshot queries through Hive/SparkSQL (#2366 ) Co-authored-by: Ryan Pifer <ryanpife@amazon.com>	2021-01-04 07:59:47 -08:00
Udit Mehrotra	4e64226844	[HUDI-1450] Use metadata table for listing in HoodieROTablePathFilter (apache#2326) [HUDI-1394] [RFC-15] Use metadata table (if present) to get all partition paths (apache#2351)	2021-01-04 07:59:47 -08:00
Gary Li	605b617cfa	[HUDI-1434] fix incorrect log file path in HoodieWriteStat (#2300 ) * [HUDI-1434] fix incorrect log file path in HoodieWriteStat * HoodieWriteHandle#close() returns a list of WriteStatus objs * Handle rolled-over log files and return a WriteStatus per log file written - Combined data and delete block logging into a single call - Lazily initialize and manage write status based on returned AppendResult - Use FSUtils.getFileSize() to set final file size, consistent with other handles - Added tests around returned values in AppendResult - Added validation of the file sizes returned in write stat Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-12-30 14:22:15 -08:00
Balaji Varadarajan	3ec9270e8e	[HUDI-1490] Incremental Query should work even when there are partitions that have no incremental changes (#2371 ) * Incremental Query should work even when there are partitions that have no incremental changes Co-authored-by: Sivabalan Narayanan <sivabala@uber.com>	2020-12-26 12:17:49 -05:00
Danny Chan	4bc45a391a	[HUDI-1445] Refactor AbstractHoodieLogRecordScanner to use Builder (#2313 )	2020-12-10 20:02:02 +08:00
lw0090	1f0d5c077e	[HUDI-1349] spark sql support overwrite use insert_overwrite_table (#2196 )	2020-12-03 12:26:21 -08:00
lw0090	5f5c15b0d9	[HUDI-892] RealtimeParquetInputFormat skip adding projection columns if there are no log files (#2190 ) * [HUDI-892] RealtimeParquetInputFormat skip adding projection columns if there are no log files * [HUDI-892] for test * [HUDI-892] fix bug generate array from split * [HUDI-892] revert test log	2020-11-02 20:00:12 -08:00
Raymond Xu	c5e10d668f	[HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable (#2167 ) Remove APIs in `HoodieTestUtils` - `createCommitFiles` - `createDataFile` - `createNewLogFile` - `createCompactionRequest` Migrated usages in `TestCleaner#testPendingCompactions`. Also improved some API names in `HoodieTestTable`.	2020-10-12 14:39:10 +08:00
Raymond Xu	3201665295	[HUDI-995] Use HoodieTestTable in more classes (#2079 ) * [HUDI-995] Use HoodieTestTable in more classes Migrate test data prep logic in - TestStatsCommand - TestHoodieROTablePathFilter Re-implement methods for create new commit times in HoodieTestUtils and HoodieClientTestHarness - Move relevant APIs to HoodieTestTable - Migrate usages After changing to HoodieTestTable APIs, removed unused deprecated APIs in HoodieTestUtils	2020-09-17 09:29:07 -07:00
Gary Li	48a58c98a1	[MINOR] fix get classname for hive sync (#2008 )	2020-08-31 16:26:10 -07:00
Prashant Wason	6461927eac	[HUDI-960] Implementation of the HFile base and log file format. (#1804 ) * [HUDI-960] Implementation of the HFile base and log file format. 1. Includes HFileWriter and HFileReader 2. Includes HFileInputFormat for both snapshot and realtime input format for Hive 3. Unit test for new code 4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported) Advantage: HFile file format saves data as binary key-value pairs. This implementation chooses the following values: 1. Key = Hoodie Record Key (as bytes) 2. Value = Avro encoded GenericRecord (as bytes) HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format. Limitations: HFile storage format has certain limitations when used as a general purpose data storage format. 1. Does not have a implemented reader for Presto and SparkSQL 2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning Other changes: - Remove databricks/avro from pom - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc - Breaking up limitFileSize(), per parquet and hfile base files - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction - Throw UnsupportedException in HFileReader.getRecordKeys() - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise) * Fixing checkstyle Co-authored-by: Vinoth Chandar <vinoth@apache.org>	2020-08-31 08:05:59 -07:00
Udit Mehrotra	8d04268264	[HUDI-1174] Changes for bootstrapped tables to work with presto (#1944 ) The purpose of this pull request is to implement changes required on Hudi side to get Bootstrapped tables integrated with Presto. The testing was done against presto 0.232 and following changes were identified to make it work: Annotation UseRecordReaderFromInputFormat is required on HoodieParquetInputFormat as well, because the reading for bootstrapped tables needs to happen through record reader to be able to perform the merge. On presto side, this annotation is already handled. We need to internally maintain VIRTUAL_COLUMN_NAMES because presto's internal hive version hive-apache-1.2.2 has VirutalColumn as a class, versus the one we depend on in hudi which is an enum. Dependency changes in hudi-presto-bundle to avoid runtime exceptions.	2020-08-12 17:51:31 -07:00
Gary Li	4f74a84607	[HUDI-69] Support Spark Datasource for MOR table - RDD approach (#1848 ) - This PR implements Spark Datasource for MOR table in the RDD approach. - Implemented SnapshotRelation - Implemented HudiMergeOnReadRDD - Implemented separate Iterator to handle merge and unmerge record reader. - Added TestMORDataSource to verify this feature. - Clean up test file name, add tests for mixed query type tests - We can now revert the change made in DefaultSource Co-authored-by: Vinoth Chandar <vchandar@confluent.io>	2020-08-07 00:28:14 -07:00
Balaji Varadarajan	9bcd3221fd	[HUDI-1144] Speedup spark read queries by caching metaclient in HoodieROPathFilter (#1919 )	2020-08-05 09:19:10 -07:00
vinoth chandar	539621bd33	[HUDI-242] Support for RFC-12/Bootstrapping of external datasets to hudi (#1876 ) - [HUDI-418] Bootstrap Index Implementation using HFile with unit-test - [HUDI-421] FileSystem View Changes to support Bootstrap with unit-tests - [HUDI-424] Implement Query Side Integration for querying tables containing bootstrap file slices - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices - [HUDI-421] Bootstrap Write Client with tests - [HUDI-425] Added HoodieDeltaStreamer support - [HUDI-899] Add a knob to change partition-path style while performing metadata bootstrap - [HUDI-900] Metadata Bootstrap Key Generator needs to handle complex keys correctly - [HUDI-424] Simplify Record reader implementation - [HUDI-423] Implement upsert functionality for handling updates to these bootstrap file slices - [HUDI-420] Hoodie Demo working with hive and sparkSQL. Also, Hoodie CLI working with bootstrap tables Co-authored-by: Mehrotra <uditme@amazon.com> Co-authored-by: Vinoth Chandar <vinoth@apache.org> Co-authored-by: Balaji Varadarajan <varadarb@uber.com>	2020-08-03 20:19:21 -07:00
Raymond Xu	10e4268792	[HUDI-995] Use Transformations, Assertions and SchemaTestUtil (#1884 ) - Consolidate transform functions for tests in Transformations.java - Consolidate assertion functions for tests in Assertions.java - Make use of SchemaTestUtil for loading schema from resource	2020-08-01 20:57:18 +08:00
Bhavani Sudha Saktheeswaran	d5b593b7d9	[MINOR] change log.info to log.debug (#1883 )	2020-07-28 09:49:03 -07:00
wenningd	bf1d36fa63	[HUDI-1087] Handle decimal type for realtime record reader with SparkSQL (#1831 ) Co-authored-by: Wenning Ding <wenningd@amazon.com>	2020-07-15 07:30:58 -07:00
Satish Kotha	086853c004	[HUDI-1080] Fix backward compatibility for com.uber inputformats	2020-07-08 15:30:07 -07:00
andreitaleanu	37ea79566d	[HUDI-539] Make HoodieROTablePathFilter implement Configurable (#1784 ) Co-authored-by: Andrei Taleanu <taleanu@adobe.com>	2020-07-03 13:39:53 -07:00
Prashant Wason	2603cfb33e	[HUDI-684] Introduced abstraction for writing and reading different types of base file formats. (#1687 ) Notable changes: 1. HoodieFileWriter and HoodieFileReader abstractions for writer/reader side of a base file format 2. HoodieDataBlock abstraction for creation specific data blocks for base file formats. (e.g. Parquet has HoodieAvroDataBlock) 3. All hardocded references to Parquet / Parquet based classes have been abstracted to call methods which accept a base file format 4. HiveSyncTool accepts the base file format as a CLI parameter 5. HoodieDeltaStreamer accepts the base file format as a CLI parameter 6. HoodieSparkSqlWriter accepts the base file format as a parameter	2020-06-25 23:46:55 -07:00
Shen Hong	89e37d5273	[HUDI-908] Add some data types to HoodieTestDataGenerator and fix some some bugs. (#1690 )	2020-06-22 08:13:28 -07:00
Satish Kotha	a7fd331624	Add unit test for snapshot reads in hadoop-mr	2020-06-13 10:23:05 -07:00
Gary Li	37838cea60	[HUDI-822] decouple Hudi related logics from HoodieInputFormat (#1592 ) - Refactoring business logic out of InputFormat into Utils helpers.	2020-06-09 06:10:16 -07:00
lw0090	9e07cebece	[HUDI-974] Fix fields out of order in MOR mode when using Hive (#1711 )	2020-06-09 09:22:06 +08:00

1 2 3

110 Commits