lanyuanxiaoyao/hudi - hudi - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Y Ethan Guo	20781a5fa6	[DOCS] Add commit activity, twitter badgers, and Hudi logo in README (#5336 )	2022-04-22 16:51:07 +08:00
Raymond Xu	2e6e302efe	[HUDI-3859] Fix spark profiles and utilities-slim dep (#5297 )	2022-04-12 15:33:08 -07:00
Raymond Xu	84783b9779	[HUDI-3843] Make flink profiles build with scala-2.11 (#5279 )	2022-04-12 08:33:48 -07:00
Y Ethan Guo	f6ff95f97c	[MINOR][DOCS] Update hudi-utilities-slim-bundle docs (#5184 )	2022-03-30 21:48:54 -07:00
Y Ethan Guo	4fb1a590b1	[HUDI-3700] Add hudi-utilities-slim-bundle excluding hudi-spark-datasource modules (#5176 )	2022-03-30 18:08:35 -07:00
Y Ethan Guo	9830005e9b	[HUDI-3681] Provision additional hudi-spark-bundle with different versions (#5171 )	2022-03-30 17:35:56 -07:00
Alexey Kudinkin	e5a2baeed0	[HUDI-3549] Removing dependency on "spark-avro" (#4955 ) Hudi will be taking on promise for it bundles to stay compatible with Spark minor versions (for ex 2.4, 3.1, 3.2): meaning that single build of Hudi (for ex "hudi-spark3.2-bundle") will be compatible with ALL patch versions in that minor branch (in that case 3.2.1, 3.2.0, etc) To achieve that we'll have to remove (and ban) "spark-avro" as a dependency, which on a few occasions was the root-cause of incompatibility b/w consecutive Spark patch versions (most recently 3.2.1 and 3.2.0, due to this PR). Instead of bundling "spark-avro" as dependency, we will be copying over some of the classes Hudi depends on and maintain them along the Hudi code-base to make sure we're able to provide for the aforementioned guarantee. To workaround arising compatibility issues we will be applying local patches to guarantee compatibility of Hudi bundles w/in the Spark minor version branches. Following Hudi modules to Spark minor branches is currently maintained: "hudi-spark3" -> 3.2.x "hudi-spark3.1.x" -> 3.1.x "hudi-spark2" -> 2.4.x Following classes hierarchies (borrowed from "spark-avro") are maintained w/in these Spark-specific modules to guarantee compatibility with respective minor version branches: AvroSerializer AvroDeserializer AvroUtils Each of these classes has been correspondingly copied from Spark 3.2.1 (for 3.2.x branch), 3.1.2 (for 3.1.x branch), 2.4.4 (for 2.4.x branch) into their respective modules. SchemaConverters class in turn is shared across all those modules given its relative stability (there're only cosmetical changes from 2.4.4 to 3.2.1). All of the aforementioned classes have their corresponding scope of visibility limited to corresponding packages (org.apache.spark.sql.avro, org.apache.spark.sql) to make sure broader code-base does not become dependent on them and instead relies on facades abstracting them. Additionally, given that Hudi plans on supporting all the patch versions of Spark w/in aforementioned minor versions branches of Spark, additional build steps were added to validate that Hudi could be properly compiled against those versions. Testing, however, is performed against the most recent patch versions of Spark with the help of Azure CI. Brief change log: - Removing spark-avro bundling from Hudi by default - Scaffolded Spark 3.2.x hierarchy - Bootstrapped Spark 3.1.x Avro serializer/deserializer hierarchy - Bootstrapped Spark 2.4.x Avro serializer/deserializer hierarchy - Moved ExpressionCodeGen,ExpressionPayload into hudi-spark module - Fixed AvroDeserializer to stay compatible w/ both Spark 3.2.1 and 3.2.0 - Modified bot.yml to build full matrix of support Spark versions - Removed "spark-avro" dependency from all modules - Fixed relocation of spark-avro classes in bundles to assist in running integ-tests.	2022-03-29 14:44:47 -04:00
Yann Byron	0c950181aa	[HUDI-3423] upgrade spark to 3.2.1 (#4815 )	2022-02-21 16:52:21 -08:00
Raymond Xu	f74cd57320	[HUDI-3195] Fix spark 3 pom (#4554 ) - drop 3.0.x profile - update readme - update build CI bot.yml - fix spark 3 bundle name	2022-01-10 19:11:22 -08:00
Raymond Xu	0273f2e65d	[MINOR] Update README.md (#4492 ) Update Spark 3 build instructions	2022-01-02 20:34:37 -08:00
Alexey Kudinkin	cbcbec4d38	[MINOR] Fixed checkstyle config to be based off Maven root-dir (requires Maven >=3.3.1 to work properly); (#4009 ) Updated README	2021-11-16 21:30:16 -05:00
vinoth chandar	ea59a7ff5f	[HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409 ) Update Azure CI ubuntu from 16.04 to 18.04 due to 16.04 will be removed soon Fixed some consistently failed tests * fix TestCOWDataSourceStorage TestMORDataSourceStorage * reset mocks Also update readme badge Co-authored-by: Raymond Xu <2701446+xushiyan@users.noreply.github.com>	2021-09-07 09:44:30 -07:00
Raymond Xu	f922837064	[HUDI-1950] Fix Azure CI failure in TestParquetUtils (#2984 ) * fix azure pipeline configs * add pentaho.org in maven repositories * Make sure file paths with scheme in TestParquetUtils * add azure build status to README	2021-06-15 03:45:17 -07:00
Harshit Mittal	e692c704da	[MINOR] Fix deprecated build link for travis (#2778 )	2021-04-07 08:57:10 +08:00
vinoyang	302bd29dab	[MINOR] Add clustering to feature list (#2568 )	2021-02-13 07:39:14 -08:00
wenningd	fce1453fa6	[HUDI-1040] Make Hudi support Spark 3 (#2208 ) * Fix flaky MOR unit test * Update Spark APIs to make it be compatible with both spark2 & spark3 * Refactor bulk insert v2 part to make Hudi be able to compile with Spark3 * Add spark3 profile to handle fasterxml & spark version * Create hudi-spark-common module & refactor hudi-spark related modules Co-authored-by: Wenning Ding <wenningd@amazon.com>	2020-12-09 15:52:23 -08:00
Raymond Xu	111a9753a0	[MINOR] Update README.md (#2010 ) - add maven profile to test running commands - remove -DskipITs for packaging commands	2020-08-24 09:28:29 -07:00
Prashant Wason	b71f25f210	[HUDI-92] Provide reasonable names for Spark DAG stages in HUDI. (#1289 )	2020-07-19 10:29:25 -07:00
leesf	492f324bc7	[HUDI-938] Removing incubating/incubator from project (#1658 )	2020-05-24 18:28:13 +08:00
vinoth chandar	45acccdb8a	[MINOR] Remove incubating from README	2020-05-23 14:51:58 -07:00
Bhavani Sudha Saktheeswaran	5f85c26704	[HUDI-584] Relocate spark-avro dependency by maven-shade-plugin (#1290 )	2020-03-04 11:01:49 -08:00
lamber-ken	425e3e6c78	[HUDI-585] Optimize the steps of building with scala-2.12 (#1293 )	2020-02-05 23:13:10 +08:00
lamber-ken	56a4e0d56c	[MINOR] Fix invalid issue url & quickstart url (#1282 )	2020-01-27 16:53:03 +08:00
Bhavani Sudha Saktheeswaran	1e79cbc259	[HUDI-549] update Github README with instructions to build with Scala 2.12 (#1275 )	2020-01-22 23:43:50 -08:00
vinoth chandar	c2c0f6b13d	[HUDI-509] Renaming code in sync with cWiki restructuring (#1212 ) - Storage Type replaced with Table Type (remaining instances) - View types replaced with query types; - ReadOptimized view referred as Snapshot Query - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views - HoodieDataFile renamed to HoodieBaseFile - Hive Sync tool will register RO tables for MOR with a `_ro` suffix - Datasource/Deltastreamer options renamed accordingly - Support fallback to old config values as well, so migration is painless - Config for controlling _ro suffix addition - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView	2020-01-16 23:58:47 -08:00
Y Ethan Guo	480fc7869d	[HUDI-319] Add a new maven profile to generate unified Javadoc for all Java and Scala classes (#1195 ) * Add javadoc build command in README, links to javadoc plugin and rename profile. * Make java version configurable in one place.	2020-01-08 10:38:09 -08:00
lamber-ken	7498ca71cb	[MINOR] Add slack invite icon in README (#1108 )	2019-12-17 15:15:03 +08:00
leesf	845e261658	[MINOR] Update some urls from http to https in the README file (#1074 )	2019-12-03 17:35:27 +08:00
lamber-ken	89f0968097	[DOCS] Update the build source link (#1071 )	2019-12-03 09:56:09 +08:00
lamber-ken	ff688107fa	[DOCS] Update Hudi Readme (#1058 ) - Add build status - Clean up layout	2019-12-02 09:25:43 -08:00
vinoth chandar	1483b97018	[DOCS] Change Hudi acronyms to plural	2019-11-10 12:39:58 -08:00
Jeff G	1ce3d891ce	[DOCS] Update to align with original Uber whitepaper (#999 )	2019-11-10 12:38:13 -08:00
Bhavani Sudha Saktheeswaran	04834817c8	[MINOR] Add features and instructions to build Hudi in README (#992 )	2019-11-03 01:48:06 -08:00
Anurag870	c052167c06	[Docs] Update README.md (#955 )	2019-10-13 21:02:25 -07:00
Balaji Varadarajan	834c591955	[MINOR] Add incubating to NOTICE and README.md Please enter the commit message for your changes. Lines starting	2019-10-09 21:42:29 -07:00
Balaji Varadarajan	6e8a28bcae	HUDI-121 : Address comments during RC2 voting 1. Remove dnl utils jar from git 2. Add LICENSE Headers in missing files 3. Fix NOTICE and LICENSE in all HUDI packages and in top-level 4. Fix License wording in certain HUDI source files 5. Include non java/scala code in RAT licensing check 6. Use whitelist to include dependencies as part of timeline-server bundling	2019-09-30 15:42:15 -07:00
vinoyang	68464c7d02	[HUDI-181] Fix the Bold markdown grammar issue of README file (#808 )	2019-07-30 03:47:53 -07:00
vinoth chandar	8c9980f4f5	Update README.md	2019-06-17 18:19:34 -07:00
vinoth chandar	ef0d6f2218	Update site url in README	2019-02-15 21:28:39 -08:00
vinoth chandar	a1c0d0dbad	Update README.md Reflect hudi	2017-12-10 07:50:37 -08:00
Yash Sharma	d6f94b998d	Hoodie operability with S3	2017-03-28 05:08:54 -07:00
Vinoth Chandar	40a63fcab4	Shorten README and point to site	2017-01-09 11:30:46 -08:00
vinoth chandar	e91cfe68e7	Update README.md	2016-12-20 16:51:58 -08:00
prazanna	41905b7903	Added contributing section	2016-12-19 13:01:37 -08:00
Vinoth Chandar	81874a8406	Importing Hoodie Client from internal repo Abberved History: * 25c6991 Removed non-opensource modules * a62abf3 Removing email from pom.xml * 0931b68 Misspelt in the copyright * c1cac7d Preperation for OSS: Added License and rat plugin check. Also added meta information about the project in pom.xml * 16b07b3 Preparation of OSS - Remove hoodie specific URL from hoodie cli * fd3e0dd Small code cleanups * 8aa7e34 Adding a de-duplication command to CLI * b464842 Adding a de-duplication command to CLI * 59265b1 RegisterDataset should pass the right zkNodeName after the support for multiple databases added * b295f70 [maven-release-plugin] prepare for next development iteration * 1006e4b [maven-release-plugin] prepare release hoodie-0.2.4 * 4c99437 Move to using hdrone release 0.7.4 * 1891939 Auto tuning the buckets needed for plain inserts also - Off by default for now - Enhanced an existing unit test * b4563bd Change HoodieReadClient to use commit metadata for incremental pull * ee20183 Add full file path onto HoodieWriteStat - This will become an issue later on for incremental processing use cases - Tested with cli, that is able to read older HoodieCommitMetadata * 7dcd5d5 Address skew in cleaner work distribution * 8d7c15d Fixing bug around partial failures of rollback * d4ada1d Empty RDD should not throw java.lang.IllegalArgumentException: Positive number of slices required * 076bea9 Dont clean if there are no partitions to clean * c014f80 Minor changes to SQLStreamer * a96d4df Minor changes to SQLStreamer * bc289cc [maven-release-plugin] prepare for next development iteration * 4160107 [maven-release-plugin] prepare release hoodie-0.2.3 * 409b07a [maven-release-plugin] prepare for next development iteration * 3d71514 [maven-release-plugin] prepare release hoodie-0.2.2 * 4969d52 Fix test failures * ac62609 Implement Review Comments for: Parallelize cleaning and including cleaning time and commit archival time in commit time graphite reporting * cebe65a Parallelize cleaning and including cleaning time and commit archival time in commit time graphite reporting * 2e5b372 Migrating to CDH 5.7.2 * 899ae12 Remove filtering of /tmp/hive/hive paths from HoodieInputFormat. This fixes Join with temporary tables with HoodieCombineHiveInputFormat * 69a68f6 Implement equals and hashCode for HoodieTableMetadata, its used in hash based structures * 12d29c6 Update hive staging url * 1c5c88a Copy filterExists to WriteClient * 76aee67 [maven-release-plugin] prepare for next development iteration * 1f0a715 [maven-release-plugin] prepare release hoodie-0.2.1 * dbfd1d4 HoodieReadClient and HoodieWriteClient separation * c39a98b Revamped HoodieRecordPayload API that supports merging of old & new values during update * 79e5bbd Add a helper to configure SparkConf for SparkSQL on Hoodie tables * f56f423 [maven-release-plugin] prepare for next development iteration * 780fc44 [maven-release-plugin] prepare release hoodie-0.2 * 1ea2238 Modifying the git utl * b0af8dc Depending on hdrone release version * 7753693 Removing a System.out.println which got in by mistake * 1f5b019 Adding HBase Config to HoodieClientConfig * 2fce97f Implement Review comments and merge into master * f389820 Bunch of API changes * 909a856 HoodieClientConfig split up and revamp * c2ad946 Fix TestHoodieClient to not double persist in testFilterExists * 3ab0da6 Fix breaking test * 2860542 CR feedback for small inserts turned to updates * 0dfce57 Small inserts are now turned into upserts * bb1a8b3 Add filterExist API for Hoodie Records * d983c24 Implement review comments * c0bd5d1 Implement HoodieClient.checkExists() * db078f6 Pick up HoodieTable based on hoodie.properties * ad023e9 Refactor upsert() using HoodieTable interface * ee9b9b3 Refactor upsert() using HoodieTable interface * 2d6fdc9 Adding a utility to generate the percentage of updates in commit * ea3ad58 Adding additional optimizations to remove similar queries from the perf test (using levenshtein distance) * 1e443a0 Add test case for the added support for SchemaEvolution during updates * 1cadcbb Add more logging * 6163dfe Parquet read of old file should have the right read schema specified * 29c746a Few fixes in ReduceByKey parallelism, HoodieInputFormat.filterFiles for non-hoodie paths and more logging in upsert schema issues * 5a33af6 Fixing an issue in HoodieReader, target temp directory not created * 09a5e8e Adding more logging in HoodieReader * 1474250 Adding more logging in HoodieReader * a3b0567 Make targetDb not required in HoodieReader * e9c08b9 Setting the inputformat as the CombineHiveInputFormat in the HoodieReader * 61c75d2 Hoodie Query Performance: Add Support for CombineHiveInputFormat and implement CombineFileInputFormat * 38c6e44 Improvements to Hoodie Reader * ac7398a Add totalWriteErrors to HoodieCommitMetadata * fc0536e Change archive location to be under .hoodie * e313294 Implement Hive Perf comparison for Hoodie and non-Hoodie datasets * 17cfe2a Fix bug in HoodieInputFormat, where it filters out files from archived commits * 30de990 Add note about showpartitions command to README * 8634ffb Add commits showpartitions command to show break down per partition * 324b24e Adding a CLI command to print file size stats * 56532ff T484792. Deterministically report metrics during shutdown * 3571768 Fixes to Hoodie Cleaner. Upgrade HDrone version. Changes to HoodieReader. * a02c97f Bumping hdrone-api to 0.7.2 * b29ce67 Bug in RegisterDataset dataset creation * 5a15a9a Fixing bug in cleaning up partial files * dbf6669 Comment out predicate pushdown test * 44ed4d1 Merge branch 'lazyitr-fixes-1' \|\ \| * e913d3b Fixing bug in LazyInsertIterable \| * 8a1fecd Wrapping upsert() inside HoodieUpsertException \| * 39cfe39 Fixing bug in LazyInsertIterable - Return a List<WriteStatus> to handle last record in itr, belonging to a separate file - Remove insert() related code form UpsertMapFunction \| * 00252e5 Making TestHoodieBloomIndex less flaky * \| 6f2d417 Making TestHoodieBloomIndex less flaky * \| 63ebbdc fs.mkdirs does not honor permission umask passed. Need to use the static method FileSystem.mkdirs for that. * \| f49ef67 Adding more logging to Hoodie Reader * \| 9f5a699 Fixing permission on the base intermediate folder created in HoodieReader \|/ * 70e501f Fixing the drop table before create table in HoodieReader * 120cda8 Hoodie tools jar should not require jars in the CDH classpath to be available. Needed for HoodieReader to run in Docker. * 60b59de Adding client configurations. Needed to run the HoodieReader in Docker (where CDH is not installed) * fece98d Merge conflicts w/ master * 64e58b0 Auto tuning parallelism in BloomIndex & Upsert() * 930199e Fixing skew in Index join when new partition paths dont exist yet * 9a3e511 Adding subpartitioning to scale join in HoodieBloomIndex * 57512a7 Changing sort key for IndexLookup to (filename, record) to split more evenly * 3ede14c Major changes to BloomIndex & Upsert DAG * 1c4071a Implement Dataset creation if a Hoodie dataset was not already registered * 944f007 Implement Review comments * 6a5b675 Implement Review Comments * bfde3a9 Implement review comments * d195ab3 Implementing Commit Archiving * 8af656b Exception refactor - part 2 * 697a699 HoodieTableMetadata refactor and Exception refactor * 7804ca3 Adding HoodieAppendLog (fork of SequenceFile) & Initial Impl of HoodieCommitArchiveLog * 2db4931 Adjust partitionFileRDD parallelism to max(recordRDD partitions, total partitions) * 23405c5 Config name changes * 5e673ea Implementing more CLI commands * 918cfce Moving to 0.1.1-SNAPSHOT * afad497 Change the master branch to 0.2-SNAPSHOT * 832c1a7 Make sure the bloom filter reading and tagging has a parellel factor >= group by parallelism * 0a6a6d3 Prepare the v0.1 version * 72cfbe2 The snapshoter should also copy hoodie.properties file * 3b0ee45 Add one more metric * 488f1c7 Add switch for cleaning out inflight commits * a259b6f Adding textutils jar to hoodie build * 36e3118 Fix Hoodie CLI - ClassNotFound and added more logging to JDBC Incremental pull * 2c8f554 Fix Predicate pushdown during incremental pull * 888ec20 Add one more graphite metrics * a671dfc Ensure files picked for cleaning are part of some valid commit * ba5cd65 Adding cleaning based on last X commits * 7dc76d3 Organize config values by category * 9da6474 Move cleaning logic into HoodieCleaner class * 7becba9 Change the update metric name * d32b1f3 Fix some graphite issues * 365ee14 hot fix a stupid bug I made * 93eab43 Adding a hoodie.table.type value to hoodie.properties on init * 075c646 Add the database name to the sync * 3bae059 Adding HoodieKey as metadata field into Record * 61513fa Add stats and more cli commands * b0cb112 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * aaa1bf8 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * 3a3db73 New Hoodie CLI Framework. Implement CLI function parity with the current CLI * c413342 Fail the job if exception during writing old records * 7304d3d Exclude javax.servlet from hive-jdbc * 3d65b50 Add the datestr <> '0000-00-00' back to the incremental sql * 0577661 HoodieIncrementalConfig not used anymore * 5338004 Fixing multiple minor issues we found during the SQLStreamer demo preperation * 0744283 Fix the Hive server and Spark Hive client mismatch by setting userClassPathFirst=true and creating a assembly jar with all hadoop related dependencies excluded * c189dc0 Kickoff hdrone sync after SQLStreamer finishing committing to target hoodie dataset * 1eb8da0 Check if the .commit file is empty * f95386a Add support for rollbacking .inflight commit in Admin CLI * 97595ea Update the record count when upserting * 49139cd Remove table config and add _SUCCESS tag * 8500a48 Catch the exception when upserting * 10bcc19 Merge branch 'sqlload' \|\ \| * 10fcc88 More log statements \| * ca6b71d Merge with master \| \|\ \| \| * b33db25 Merge remote-tracking branch 'origin/sqlload' into sqlload \| \| \|\ \| \| \| * 8fca7c6 insert() takes a JavaRDD<HoodieRecord> again \| \| * \| 63db8c6 Fix test breakage from javax.servlet pom dependency \| \| * \| b2cff33 insert() takes a JavaRDD<HoodieRecord> again \| \| * \| 0162930 Minor Fixes \| \| * \| a0eb0b8 Minor Fixes \| \| * \| 5853e7c Minor fixed to HoodieSQLStreamer \| \| * \| 379bbed HoodieSQLStreamer improvements \| \| * \| 22bf816 Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload \| \| * \| 4cacde6 Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload \| \| * \| 5f985f3 Refactor of AvroParquetIO and create proper abstraction for StorageWriter \| \| * \| 6b90bb0 Refactor to introduce proper abstractions for RawTripPayload and implement HoodieSQLStreamer \| \| * \| ff24ce8 Implementation of HoodieSQLStreamer \| \| * \| abae08a Implementation of HoodieSQLStreamer \| * \| \| c2d306d Fixes to HoodieSQLStreamer \| \| \|/ \| \|/\| \| * \| 70bad72 Minor Fixes \| * \| 8da6abf Minor Fixes \| * \| 6b9d16b Minor fixed to HoodieSQLStreamer \| * \| f76f5b8 HoodieSQLStreamer improvements \| * \| 5f1425e Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload \| * \| 616e2ee Remove setJsonPayload() and other non-generic calls from HoodieRecordPayload \| * \| 9e77ef9 Refactor of AvroParquetIO and create proper abstraction for StorageWriter \| * \| 14e4812 Refactor to introduce proper abstractions for RawTripPayload and implement HoodieSQLStreamer \| * \| 3b05f04 Implementation of HoodieSQLStreamer \| * \| 1484c34 Implementation of HoodieSQLStreamer * \| \| b3b9754 Standardize UTF-8 for getBytes() calls \| \|/ \|/\| * \| 8cde079 Add graphite metrics to HoodieClient * \| b94afad Add testcase for the snapshot copy \|/ * 8567225 T417977. WriteStatus for failed records * 11d7cd2 Add code to deflate the HoodieRecord after writing it to storage * 9edafb4 Add a daily snapshot job * 2962bf6 Fix the last file non-closed issue * d995b6b SizeAwareParquetWriter will now have a fixed compression ratio * 6b5f67f HoodieWrapperFileSystem should initialize the underlying filesystem with default uri * 2a607c2 Merging conflicts with master * ac9852d Auto size parquet files to just under block size based on incoming records size * 3c4c0d0 Remove client code leaks & add parallelism config for sorting * 1e51e30 Add UpsertHandle * 685ca1f Add hoodie cli * ded7f6c CR feedback incorporated * d532089 Change the return type to a RDD * 22533c1 Fix bug in cleanup logic by using TaskContext.getPartitionId() in place of unitNumber * 86532fb Implement insert() using sorting, to align file sizes easily * 0967e1c Add hook to compare old record with new incoming record * f48b048 Merge branch 'sort-based-dag' \|\ \| * 3614cec Rename write() -> upsert() and load() -> insert() * \| 65cf631 Parquet version mismatch in HoodieInputFormat * \| 160303b Formatting change * \| 2c079c8 Formatting change \|/ * e4eb658 Fix formatting * 025114a Add test for HoodieAvroWriteSupport * 6fd11ef Fix small bug in HoodieCommits & correct doc to reflect exclusivity of findCommitsInRange - Added simple unit test * 05659c9 Add tests around HoodieClient apis * 8d3f73e Fix some small bugs * 7f1c4bc Modify HoodieInputFormatTest to make it certain that incremental pull is only pulling the required records * 2b73ba0 Remove direct versioning in pom * dd5695f Comment change * f62eef7 Unit test for predicate pushdown * 9941dad Fixing an issue which results in unsorted commits * 5e71506 Update README * 219e103 InputFormat unit tests * 8f1c7ba Enable cobertura coverage to be run with mvn test * 01f76e3 Call out self-join limitation in README * 4284a73 Defaulting to Google Java Style and reformatting existing code * de2cbda Making sure that incremental does not send duplicate records * f6a3833 Implement Review comments * 1de5025 Refactor in HoodieTableMetadata, HoodieInputFormat * 549ad9a Fixing broken test schemas * fbb2190 update the unit number * 9353ba9 Change the io number to 1 for old load data * e28f0cf Add commit metadata fields to create_table.sql * d06e93d Pull avroFn & dedupeFn into a single HoodieClientHooks class * b6d387f Changes to sequence_no/commit metadata addition * 212d237 Add some benchmark results to the code * 70d7715 Add commit rollback logic * 54a4d0f Use FSUtils helper to detemine fileId * 4b672ad Core classes refactoring * f705fab Move partitionPath back into HoodieKey * 39b3ff3 Cleanup Sample job & add a detailed quickstart * 981c6f7 fix the hoodie-query-meta pom * 371ab34 Publish hoodie to uber internal artifactory * b4e83bc improvement on the bloom index tag job * 779b502 Change to use hadoop's bloom filter * cfbd9e6 Add bloom filter indexing mechanism * f519c47 Initial Implementation of storing the client metadata for hoodie queries * d5eccea Initial Implementation of storing the client metadata for hoodie queries * ef34482 Pass on the HDrone configuration profile as an argument * 5578cd3 Implement initial incremental tailing support in InputFormat and provide a seperate module for Hdrone registration to be created as a oozie trigger * b08e5ff Merge branch 'master' into AddBloomFilterWriteSupport * 20b7e8e fix a typo * 4c39407 Quick fix for the HBASE indx duplicates records issue * 6dca38f Adding code to sync to hive using hdrone * 55a1d44 Fixes to InputFormat. Created a placeholder OutputFormat. * beda7ed Revise the globPartitions to avoid the bad partition paths * 5d889c0 Fix a wrong config * a60fbdf First version to add load function * 4b90944 Adding detailed metadata to each commit * 4a97a6c Changes to backfill script + enabling spark event log * ada2b79 Discard records without partition path & move parquet writer to snappy * 954c933 Adding backfill script - Cleanups & additional cmd line options to job - Changed iounit logic to special case 2010-2014 again * 8b5e288 Breaking apart backfill job & single run into two classes * ebdcbea Handle partial failures in update() * 4bf6ffe Fixing an issue where file name is not present * e468bff Fix couple of issues with Hbase indexing and commit ts checks * 17da30c Changing de-dupe implementation to be a Spark reduceByKey * 248c725 removed coalescing which was put in there for testing * 1b3f929 Implement compression when storing large json strings in memory * 5bada98 Changes to accomodate task failure handling, on top of cleaner * 66f895a Clean out files generated by previous failed attempts * 9cbe370 Implementing a rudimentary cleaner & avro conversion rewrite * 3606658 Adding configs for iounits & reduce parallelism * 066c2f5 Registering the Hoodie classes with Kryo * 342eed1 Implementing a rudimentary cleaner * 0d20d1d Merge branch 'trip-test-run' \|\ \| * 6eafdbb Adding de-dupe step before writing/shuffling * \| 34baba7 Packaging hadoop-common with the hadoop-mr InputFormat JAR \|/ * d5856db Merge HoodieInputFormat with existing code. Factor out common logic into hadoop-common. Tune the partitions, spark executors, parquet parameters to be able to run on a single day of input data * e8885ce Introduce IOUnit to split parallelize inserts * ab1977a Pushing in a real Spark job that works off real data * 0c86645 HoodirInputFormat with TestDataSimulator * 6af483c Initial checkin for HoodieInputFormat * 99c58f2 Implementing HBase backed index * 4177529 First major chunk of Hoodie Spark Client Impl * 29fad70 Benchmark bloom filter file read performance * 18f52a4 Checking in the simulation code, measuring cost of trip's file-level updates * 885f444 Adding basic datastructures for Client, key & record. * 72e7b4d Initial commit	2016-12-16 14:34:42 -08:00

45 Commits