hudi/RELEASE_NOTES.md at faf93b6340388ec5b2558a950e5fd4582be068b1

lanyuanxiaoyao/hudi

Fork 0

Files

Vinoth Chandar 0b1a949a87 Release notes for 0.4.4

2018-09-12 23:39:57 +05:30

10 KiB

Raw Blame History

Release 0.4.4

Highlights

Dependencies are now decoupled from CDH and based on apache versions!
Support for Hive 2 is here!! Use -Dhive11 to build for older hive versions
Deltastreamer tool reworked to make configs simpler, hardended tests, added Confluent Kafka support
Removed dependency on commons lang3, to ease use with different hadoop/spark versions
Better CLI support and docs for managing async compactions
New CLI commands to manage datasets

Full PR List

@saravsars - Updated jcommander version to fix NPE in HoodieDeltaStreamer tool #443
@n3nash - Removing dependency on apache-commons lang 3, adding necessary classes as needed #444
@n3nash - Small file size handling for inserts into log files. #413
@vinothchandar - Update Gemfile.lock with higher ffi version
@bvaradar - Simplify and fix CLI to schedule and run compactions #447
@n3nash - Fix a failing test case intermittenly in TestMergeOnRead due to incorrect prev commit time #448
@bvaradar- CLI to create and desc hoodie table #446
@vinothchandar- Reworking the deltastreamer tool #449
@bvaradar- Docs for describing async compaction and how to operate it #445
@n3nash- Adding check for rolling stats not present in existing timeline to handle backwards compatibility #451
@bvaradar @vinothchandar - Moving all dependencies off cdh and to apache #420
@bvaradar- Reduce minimum delta-commits required for compaction #452
@bvaradar- Use spark Master from environment if set #454

Release 0.4.3

Highlights

Ability to run compactions asynchrously & in-parallel to ingestion/write added!!!
Day based compaction does not respect IO budgets i.e agnostic of them
Adds ability to throttle writes to HBase via the HBaseIndex
(Merge on read) Inserts are sent to log files, if they are indexable.

Full PR List

@n3nash - Adding ability for inserts to be written to log files #400
@n3nash - Fixing bug introducted in rollback for MOR table type with inserts into log files #417
@n3nash - Changing Day based compaction strategy to be IO agnostic #398
@ovj - Changing access level to protected so that subclasses can access it #421
@n3nash - Fixing missing hoodie record location in HoodieRecord when record is read from disk after being spilled #419
@bvaradar - Async compaction - Single Consolidated PR #404
@bvaradar - BUGFIX - Use Guava Optional (which is Serializable) in CompactionOperation to avoid NoSerializableException #435
@n3nash - Adding another metric to HoodieWriteStat #434
@n3nash - Fixing Null pointer exception in finally block #440
@kaushikd49 - Throttling to limit QPS from HbaseIndex #427

Release 0.4.2

Highlights

Parallelize Parquet writing & input record read resulting in upto 2x performance improvement
Better out-of-box configs to support upto 500GB upserts, improved ROPathFilter performance
Added a union mode for RT View, that supports near-real time event ingestion without update semantics
Added a tuning guide with suggestions for oft-encountered problems
New configs for configs for compression ratio, index storage levels

Full PR List

@jianxu - Use hadoopConf in HoodieTableMetaClient and related tests #343
@jianxu - Add more options in HoodieWriteConfig #341
@n3nash - Adding a tool to read/inspect a HoodieLogFile #328
@ovj - Parallelizing parquet write and spark's external read operation. #294
@n3nash - Fixing memory leak due to HoodieLogFileReader holding on to a logblock #346
@kaushikd49 - DeduplicateRecords based on recordKey if global index is used #345
@jianxu - Checking storage level before persisting preppedRecords #358
@n3nash - Adding config for parquet compression ratio #366
@xjodoin - Replace deprecated jackson version #367
@n3nash - Making ExternalSpillableMap generic for any datatype #350
@bvaradar - CodeStyle formatting to conform to basic Checkstyle rules. #360
@vinothchandar - Update release notes for 0.4.1 (post) #371
@bvaradar - Issue-329 : Refactoring TestHoodieClientOnCopyOnWriteStorage and adding test-cases #372
@n3nash - Parallelized read-write operations in Hoodie Merge phase #370
@n3nash - Using BufferedFsInputStream to wrap FSInputStream for FSDataInputStream #373
@suniluber - Fix for updating duplicate records in same/different files in same pa… #380
@bvaradar - Fixit : Add Support for ordering and limiting results in CLI show commands #383
@n3nash - Adding metrics for MOR and COW #365
@n3nash - Adding a fix/workaround when fs.append() unable to return a valid outputstream #388
@n3nash - Minor fixes for MergeOnRead MVP release readiness #387
@bvaradar - Issue-257: Support union mode in HoodieRealtimeRecordReader for pure insert workloads #379
@n3nash - Enabling global index for MOR #389
@suniluber - Added a new filter function to filter by record keys when reading parquet file #395
@vinothchandar - Improving out of box experience for data source #295
@xjodoin - Fix wrong use of TemporaryFolder junit rule #411

Release 0.4.1

Highlights

Good enhancements for merge-on-read write path : spillable map for merging, evolvable log format, rollback support
Cloud file systems should now work out-of-box for copy-on-write tables, with configs picked up from SparkContext
Compaction action is no more, multiple delta commits now lead to a commit upon compaction
API level changes include : compaction api, new prepped APIs for higher plugability for advanced clients

Full PR List

@n3nash - Separated rollback as a table operation, implement rollback for MOR #247
@n3nash - Implementing custom payload/merge hooks abstractions for application #275
@vinothchandar - Reformat project & tighten code style guidelines #280
@n3nash - Separating out compaction() API #282
@n3nash - Enable hive sync even if there is no compaction commit #286
@n3nash - Partition compaction strategy #281
@n3nash - Removing compaction action type and associated compaction timeline operations, replace with commit action type #288
@vinothchandar - Multi/Cloud FS Support for Copy-On-Write tables #293
@vinothchandar - Update Gemfile.lock #298
@n3nash - Reducing memory footprint required in HoodieAvroDataBlock and HoodieAppendHandle #290
@jianxu - Add FinalizeWrite in HoodieCreateHandle for COW tables #285
@n3nash - Adding global indexing to HbaseIndex implementation #318
@n3nash - Small File Size correction handling for MOR table type #299
@jianxu - Use FastDateFormat for thread safety #320
@vinothchandar - Fix formatting in HoodieWriteClient #322
@n3nash - Write smaller sized multiple blocks to log file instead of a large one #317
@n3nash - Added support for Disk Spillable Compaction to prevent OOM issues #289
@jianxu - Add new APIs in HoodieReadClient and HoodieWriteClient #327
@jianxu - Handle inflight clean instants during Hoodie instants archiving #332
@n3nash - Introducing HoodieLogFormat V2 with versioning support #331
@n3nash - Re-factoring Compaction as first level API in WriteClient similar to upsert/insert #330

Release 0.4.0

Highlights

Spark datasource API now supported for Copy-On-Write datasets, across all views
BloomIndex can now prune based on key ranges & cut down index tagging time dramatically, for time-prefixed/ordered record keys
Hive sync tool registers RO and RT tables now.
Client application can now specify the partitioner to be used by bulkInsert(), useful for low-level control over initial record placement
Framework for metadata tracking inside IO handles, to implement Spark accumulator-style counters, that are consistent with the timeline
Bug fixes around cleaning, savepoints & upsert's partitioner.

Full PR List

@gekath - Writes relative paths to .commit files #184
@kaushikd49 - Correct clean bug that causes exception when partitionPaths are empty #202
@vinothchandar - Refactor HoodieTableFileSystemView using FileGroups & FileSlices #201
@prazanna - Savepoint should not create a hole in the commit timeline #207
@jianxu - Fix TimestampBasedKeyGenerator in HoodieDeltaStreamer when DATE_STRING is used #211
@n3nash - Sync Tool registers 2 tables, RO and RT Tables #210
@n3nash - Using FsUtils instead of Files API to extract file extension #213
@vinothchandar - Edits to documentation #219
@n3nash - Enabled deletes in merge_on_read #218
@n3nash - Use HoodieLogFormat for the commit archived log #205
@n3nash - fix for cleaning log files in master branch (mor) #228
@vinothchandar - Adding range based pruning to bloom index #232
@n3nash - Use CompletedFileSystemView instead of CompactedView considering deltacommits too #229
@n3nash - suppressing logs (under 4MB) for jenkins #240
@jianxu - Add nested fields support for MOR tables #234
@n3nash - adding new config to separate shuffle and write parallelism #230
@n3nash - adding ability to read archived files written in log format #252
@ovj - Removing randomization from UpsertPartitioner #253
@ovj - Replacing SortBy with custom partitioner #245
@esmioley - Update deprecated hash function #259
@vinothchandar - Adding canIndexLogFiles(), isImplicitWithStorage(), isGlobal() to HoodieIndex #268
@kaushikd49 - Hoodie Event callbacks #251
@vinothchandar - Spark Data Source (finally) #266

Previous Releases

Refer to github

10 KiB Raw Blame History

Release 0.4.4

Highlights

Full PR List

Release 0.4.3

Highlights

Full PR List

Release 0.4.2

Highlights

Full PR List

Release 0.4.1

Highlights

Full PR List

Release 0.4.0

Highlights

Full PR List

Previous Releases

10 KiB

Raw Blame History