Magyar oldal English site

Site map
2024-06-04 17:36:03 (Originally published at: 2017-09-21)

ZLIB + Deflate file format

This writing is a quick reference for those who would like to implement the inflate algorithm on their own or would like to make their compressors produce a bit stream in deflate format.

Overall ZLIB format

Bits are numbered from lowest to highest starting from zero. So 0 is the lowest order bit, 7 is the highest order one.

Size in bytes Name Description
1 compression method and window size (CMF)

This is essentially a magic number. It's 0x78 most often.

  • Bits 0..3: The compression method 8 means the compression method is deflate. 15 is reserved.
  • Bits 4..7: the maximum look behind window size. 7 means 32768 bytes long window, which is the maximum supported size by deflate. The formula is: $2^{8+w}$, where $w$ is the value of those 4 bits.
1 checksum and flags (FLG)
  • Bits 0..4: Must be set so (256*FLG + CMF) is divisible by 31.
  • Bit 5: Setting this bit means the stream contains the checksum of a prefix dictionary.
  • Bits 6..7: Compression level. 0 is the lowest, 3 is the highest.
4 (optional) preset dictionary checksum If the bit 5 in FLG is set, this will be the Alder32 checksum of the data that's initially fed to the decompressor without consuming input. So the compressed data can refer to it. This can be used, for example, to decompress a stream that consists of multiple parts of ZLIB compressed streams. When bit 5 in FLG is not set, this field is not present. This checksum is in big-endian byte order (most significant byte first).
varies compressed data The compressed data itself. We will describe the format of this later.
4 checksum This is the Alder32 checksum at the end of the stream, used to validate the integrity of the decompressed data. Preset dictionaries are not considered in this checksum. Later we will describe how to calculate it. This checksum is in big-endian byte order (most significant byte first).

Deflate compression format

Deflate is a frequently used compression method in implementations of GZIP, ZIP and others. This section discusses the format of this compression format. The deflate algorithm uses a combination of LZ algorithm and Huffman coding.

Before moving forward we should understand how is the data encoded.

The compressed data is a bit stream on the lowest level. Bytes of the compressed stream is consumed one after the other. Bits in these bytes are consumed by consuming the lowest order bit first and the highest order bit last.

The data contains bit fields. Bits in these fields are in little-endian bit order. So the least significant bit will be read first and the most significant one is read last.

Overview of the LZ compressed data

The point of the LZ algorithm is the elimination of duplicate sequences from the input. This means we don't write the same thing twice, instead we write a reference where the is sequence is found. If the sequence is long this can save space.

Let Δ(X, Y) mean "Go back X places and copy Y bytes from there to the output". This allows the decompressor copy sequences to the output that's previously occurred in the output. Consider the following string: ABCDABCDEFGHABCDEFG. At the 5th position we can see that the sequence ABC repeats, so we can replace it by the copy code: ABCDΔ(4,3)DEFGHABCDEFG. Then we can see the sequence ABCDEFG also repeats so instead of writing it down again we can replace it with a copy command too: ABCDΔ(4,3)DEFGHΔ(8,7). And these copy codes can shorten the data greatly when very long sequences repeat.

The sequence and its next occurrence may overlap. Consider the following: ABCABCABCABCABC. This can be reduced to: ABCΔ(3,12). So we tell the decompressor go back 3 positions and copy 12 bytes from there. There are not enough bytes in the output to copy 12 bytes right away, but as the decompressor executes the copy the output is being written so the decompressor will copy these previously written bytes again. And this allows the very efficient storage of repeated data. But this should be a note for implementers: you shouldn't use memory copying functions to copy because that would break this behavior.

Encoding of the LZ compressed data in Deflate

The deflate format uses two alphabets to encode LZ compressed data. One is the literal/length alphabet and the other is the distance alphabet. The literal/length alphabet contains the symbols for literals. And the length part of the copy command. The distance alphabet is a supplementary alphabet used to determine the distance.

When we say symbol we refer to the element of the alphabet. When we say code we refer to the bit pattern that represent the symbol.

The symbols of the literal/length alphabet means the following:

CodeLength baseExtra bits
25730
25840
25950
26060
26170
26280
26390
264100
265111
266131
CodeLength baseExtra bits
267151
268171
269192
270232
271272
272312
273353
274433
275513
276593
CodeLength baseExtra bits
277674
278834
279994
2801154
2811315
2821635
2831955
2842275
2852580

Although these values indeed have a mathematical formula, it's easier to shove them into an array instead of calculating it.

After we know the length we need to read the distance using the distance alphabet. Each symbol in the distance alphabet specifies a distance base and a number of extra bits to read to be added to the distance base just like with lengths.

CodeLength baseExtra bits
010
120
230
340
451
571
692
7132
8173
9253
CodeLength baseExtra bits
10334
11494
12655
13975
141296
151936
162577
173857
185138
197698
CodeLength baseExtra bits
2010259
2115379
22204910
23307310
24409711
25614511
26819312
271228912
281638513
292457713

So for example the Δ(13,12) copy command is encoded by first having the symbol 265 (length base 11), then an extra bit read that will contain 1 to increase it to 12, followed by the distance symbol of 7 from which gives the distance base 13 followed by reading 2 extra bits of 0 to keep it 13.

Overview of Huffman coding

The LZ algorithm already reduced the size of the data. But it can be reduced further by assigning shorter code to symbols that appear often and assign longer codes to symbols that appear rarely. This is what's the Huffman coding is about.

The longest allowed code length in the deflate format is 15. The shortest one is 1. The code length 0 reserved to mark the symbols that doesn't occur in the decompressed data at all. The deflate doesn't store the Huffman codes themselves instead it stores only the lengths of the code, and calculates the actual code from them. To determine the Huffman codes we start from an array of lengths. Consider the following list of lengths: [2, 0, 3, 4, 4, 0, 2, 4, 4, 0, 4, 5, 5]. These are the code lengths corresponding to 13 symbols indexed from 0 to 12. We can see the symbols 1, 5 and 9 are unused because their code length is 0.

The first step is counting the occurrences of each code length and populate an array with the counts. In our example this is: [0, 0, 2, 1, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]. Let's call this array $F$. And refer to the $i$th element of it as $F_i$. So for example $F_3 = 1$, because there are 1 elements with length 3. The 0 length is reserved for unused symbols, so there are no code with 0 length, so $F_0 = 0$ by definition.

The second step is determining the initial codes corresponding to each code length. So another array of 16 elements is created, let's call it $N$. Let's call the $i$th element of it as $N_i$. The formula to calculate this array is: $N_i = 2 \left( N_{i-1} + F_{i-1}\right)$. The $N_0$ is set 0 by definition. So the generation should start from $i = 1$. This array will be in our example: [0, 0, 0, 4, 10, 30, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]

The third step is computing the actual codes. Let's denote the $i$th symbol's code with $S_i$. The operation to calculate the code is: $S_i = N_{F_i}; N_{F_i} := N_{F_i} + 1$. So the $N$ values increase after each assignment as we are doing progress. Using this step we can obtain the array of Huffman codes for each symbol: [0, 0, 4, 10, 11, 0, 1, 12, 13, 0, 14, 30, 31]. If we express them in binary, considering the code lengths, we have these codes: [00, 0, 100, 1010, 1011, 0, 01, 1100, 1101, 0, 1110, 11110, 11111]. A these codes will represent the symbols. Indexes are shown to see better: [00 -> 0, 0, 100 -> 2, 1010 -> 3, 1011 -> 4, 0, 01 -> 6, 1100 -> 7, 1101 -> 8, 0, 1110 -> 10, 11110 -> 11, 11111 -> 12]

Here is an example Python code used to calculate the Huffman codes.


"""Huffman stuff"""

# Example LENGTHS
LENGTHS = [2, 0, 3, 4, 4, 0, 2, 4, 4, 0, 4, 5, 5]

# Maximum Huffman code length is 15
MAX_HUFF = 16

def genhuff(lengths):
    "Generates Huffman codes"

    # Generate zeroed arrays.
    n_lengths = [0] * MAX_HUFF
    next_codes = [0] * MAX_HUFF
    huff_codes = [0] * len(lengths)

    # Count lengths
    for length in lengths:
        n_lengths[length] += 1

    print n_lengths

    # Calculate next code
    n_lengths[0] = 0
    for i in range(1, len(n_lengths)):
        next_codes[i] = 2*(next_codes[i - 1] + n_lengths[i - 1])

    print next_codes

    # Assign Huffman codes
    for i  in range(0, len(huff_codes)):
        if lengths[i] == 0:
            continue
        huff_codes[i] = next_codes[lengths[i]]
        next_codes[lengths[i]] += 1

    # Print it
    print huff_codes
    print '[{}]'.format(','.join(bin(x)[2:].zfill(y) for x, y in zip(huff_codes, lengths)))

    return

genhuff(LENGTHS)

These codes are stored in big-endian bit order in the file. For example the symbol 4 has the code 1011. This means we read the bits 1, 0, 1, 1 in this order, not vice versa unlike any other bit fields. The reason for this is that Huffman code is prefix code. This means no code word is a prefix of other code word. In example above, there is no symbol which has the code 1, 10 or 101. So any sequence of bits will unambiguously map to the symbols denoted by the codes (usually). Without needing to specify length before or an end symbol after each code. Consider the bit sequence 10101010101011111000101000010110111101010, which I randomly typed in. This represent these code words: 1010 1010 1010 11111 00 01 01 00 00 1011 01 1110 1010. Which encode: 3, 3, 3, 12, 0, 6, 6, 0, 0, 4, 6, 10, 4.

Code length alphabet

We only store the lengths of the literal/length and the distance alphabets. This means 315 lengths at most. This is a lot. It can be shortened by encoding with another alphabet: the code length alphabet. It has 19 symbols.

So basically having the sequence, 3, 4, 16 (extra bits: 11), 17 (extra bits: 100) means the code length sequence. 3, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0.

And even this encoding is further compressed using Huffman code which reduces its size even further.

Layout of the compressed data

The compressed data is a series of blocks. Each block can be arbitrary long. The format of the block is the following:

Size in bits Name Description
1 final block flag (BFINAL) When this bit is set it indicates that this block is the last block in the compressed data.
2 block type (BTYPE)

The type of block:

  • 0: No compression the block is just stored. This is used when the data cannot be compressed further.
  • 1: Compression used using fixed Huffman codes. This is used when the data to compress is small.
  • 2: Compress using dynamic Huffman codes. The codes are stored within the block.
  • 3: Invalid mode. This shouldn't appear.
varies the actual block data The layout of this data depends on the type of the block.

Layout of stored blocks

When BTYPE is 0 that means the block is a stored. When this happens the decompressor discards the remaining bits of the current byte it's consuming and resumes reading the streams byte-by-byte. The format of the stored block is:

Size in bytes Name Description
2 length of the data (LEN) 2 byte length of the block. The length is stored in little-endian order. And this means the block length is limited to 65536 bytes.
2 bitwise negation of LEN (NLEN) This is a sort of check word. Its bits must be set the opposite way than LEN.
LEN block data The stored block data. It should be just appended to the output.

After the block is read. The bit based processing resumes starting from the next byte after the block.

Layout of the compressed blocks

The layout of the compressed blocks is the following:

Size in bits Name Description
5 number of symbols in the literal length alphabet (NLIT - 257) There are NLIT symbols in the literal/length alphabet. Note the value of this field is 257 less than the actual number.
5 number of symbols in the distance alphabet (NDIST - 1) There are NDIST symbols in the distance alphabet. Note the value of this field is 1 less than the actual number.
4 number of symbols in the code length alphabet (NCLEN - 4) There are NCLEN symbols in the code length alphabet. Note the value of this field is 4 less than the actual number.
3*NCLEN lengths of the prefix codes used to encode the symbols in the code length alphabet

These are 3 bit long bit fields that describe the prefix code lengths used for the code length alphabet. Using these values the Huffman tree for the code length alphabet can be built.

The index order in this table is weird. The length codes are assigned in the following order: 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15. If NCLEN is smaller than the maximum possible 19, the unset fields are set zero.

varies lengths of the codes for the literal/length alphabet and the distance alphabet There are NLIT + NDIST number of elements in this list encoded by the code length alphabet and whose symbols are encoded using the prefix code described by previous row. First the prefix code lengths for the the literal/length alphabet, then the prefix code lengths for the distance alphabet.
varies the compressed data itself After processing the lengths and building the Huffman tree for the literal/length alphabet and the distance alphabet, the compressed data follows. It's a series of commands that describe how to generate the uncompressed file.

Layout of compressed blocks using a fixed Huffman tree

In this case the block contains the compressed data only. The code lengths for both the literal/length alphabet and the distance alphabet is hard coded. All codes of the distance alphabet is 5 bits long. For the literal/length alphabet the code lengths vary in the following intervals: 8 bit long for the symbols between 0..143, 9 bits between 144..255, 7 bits between 256..279, 8 bits between 280..287. (286 and 287 won't appear within the compressed data, but takes part in the code generation.)

Alder32 checksum

The alder32 checksum is computed the following way: Let's have two integer variables $A$ and $B$. Initialitze them as: $A = 1$, $B = 0$. Let's have $C$ to denote the next byte in the stream. For each byte in the file going from start to end update the two variables the following way: $A := (A + C) \text{ mod } 65521$, $B := (A + B) \text{ mod } 65521$. Once all bytes processed, the checksum will be: $65536 B + A$.

If you want to discuss this post, share it on social media.
Recently updated:

RSS

Logo