1
0

Introducing HoodieLogFormat V2 with versioning support

- HoodieLogFormat V2 has support for LogFormat evolution through versioning
			- LogVersion is associated with a LogBlock not a LogFile
			- Based on a version for a LogBlock, approporiate code path is executed
		- Implemented LazyReading of Hoodie Log Blocks with Memory / IO tradeoff
		- Implemented Reverse pointer to be able to traverse the log in reverse
		- Introduce new MAGIC for backwards compatibility with logs without versions
This commit is contained in:
Nishith Agarwal
2018-02-15 11:01:25 -08:00
committed by vinoth chandar
parent dfd1979c51
commit 5405a6287b
32 changed files with 2066 additions and 677 deletions

View File

@@ -23,7 +23,7 @@ topnav_dropdowns:
url: /api_docs.html
output: web
- title: Code Structure
url: /code_structure.html
url: /code_and_design.html
output: web
- title: Roadmap
url: /roadmap.html

38
docs/code_and_design.md Normal file
View File

@@ -0,0 +1,38 @@
---
title: Code Structure
keywords: usecases
sidebar: mydoc_sidebar
permalink: code_and_design.html
---
## Code & Project Structure
* hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
* hoodie-common : Common code shared between different artifacts of Hoodie
## HoodieLogFormat
The following diagram depicts the LogFormat for Hoodie MergeOnRead. Each logfile consists of one or more log blocks.
Each logblock follows the format shown below.
| Field | Description |
|-------------- |------------------|
| MAGIC | A magic header that marks the start of a block |
| VERSION | The version of the LogFormat, this helps define how to switch between different log format as it evolves |
| TYPE | The type of the log block |
| HEADER LENGTH | The length of the headers, 0 if no headers |
| HEADER | Metadata needed for a log block. For eg. INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA etc. |
| CONTENT LENGTH | The length of the content of the log block |
| CONTENT | The content of the log block, for example, for a DATA_BLOCK, the content is (number of records + actual records) in byte [] |
| FOOTER LENGTH | The length of the footers, 0 if no footers |
| FOOTER | Metadata needed for a log block. For eg. index entries, a bloom filter for records in a DATA_BLOCK etc. |
| LOGBLOCK LENGTH | The total number of bytes written for a log block, typically the SUM(everything_above). This is a LONG. This acts as a reverse pointer to be able to traverse the log in reverse.|
{% include image.html file="hoodie_log_format_v2.png" alt="hoodie_log_format_v2.png" %}

View File

@@ -1,16 +0,0 @@
---
title: Code Structure
keywords: usecases
sidebar: mydoc_sidebar
permalink: code_structure.html
---
## Code & Project Structure
* hoodie-client : Spark client library to take a bunch of inserts + updates and apply them to a Hoodie table
* hoodie-common : Common code shared between different artifacts of Hoodie

View File

@@ -74,6 +74,14 @@ summary: "Here we list all possible configurations and what they mean"
<span style="color:grey">Should hoodie dynamically compute the insertSplitSize based on the last 24 commit's metadata. Turned off by default. </span>
- [approxRecordSize](#approxRecordSize) () <br/>
<span style="color:grey">The average record size. If specified, hoodie will use this and not compute dynamically based on the last 24 commit's metadata. No value set as default. This is critical in computing the insert parallelism and bin-packing inserts into small files. See above.</span>
- [withCompactionLazyBlockReadEnabled](#withCompactionLazyBlockReadEnabled) (true) <br/>
<span style="color:grey">When a CompactedLogScanner merges all log files, this config helps to choose whether the logblocks should be read lazily or not. Choose true to use I/O intensive lazy block reading (low memory usage) or false for Memory intensive immediate block read (high memory usage)</span>
- [withMaxNumDeltaCommitsBeforeCompaction](#withMaxNumDeltaCommitsBeforeCompaction) (maxNumDeltaCommitsBeforeCompaction = 10) <br/>
<span style="color:grey">Number of max delta commits to keep before triggering an inline compaction</span>
- [withCompactionReverseLogReadEnabled](#withCompactionReverseLogReadEnabled) (false) <br/>
<span style="color:grey">HoodieLogFormatReader reads a logfile in the forward direction starting from pos=0 to pos=file_length. If this config is set to true, the Reader reads the logfile in reverse direction, from pos=file_length to pos=0</span>
- [withMaxMemorySizePerCompactionInBytes](#withMaxMemorySizePerCompactionInBytes) (maxMemorySizePerCompactionInBytes = 1GB) <br/>
<span style="color:grey">HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map.</span>
- [withMetricsConfig](#withMetricsConfig) (HoodieMetricsConfig) <br/>
<span style="color:grey">Hoodie publishes metrics on every commit, clean, rollback etc.</span>

Binary file not shown.

After

Width:  |  Height:  |  Size: 218 KiB