(I taught this class at MIT Splash 2018; see the presentation here)
The term gzip refers to both the common UNIX command line utility for file compression as well as the file format that it uses to store compressed data files. The utility itself is based on the DEFLATE algorithm, detailed in RFC 1951, and the file format is detailed in RFC 1952. According to the RFC, the purpose of DEFLATE is to give a lossless compression format that
 Is independent of CPU type, operating system, file system, and character set, and hence can be used for interchange;
 Can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage, and hence can be used in data communications or similar structures such as Unix filters;
 Compresses data with efficiency comparable to the best currently available generalpurpose compression methods, and in particular considerably better than the "compress" program;
 Can be implemented readily in a manner not covered by patents, and hence can be practiced freely;
 Is compatible with the file format produced by the current widely used gzip utility, in that conforming decompressors will be able to read data produced by the existing gzip compressor.
We will cover the entire DEFLATE algorithm at a very detailed level to get a good understanding of compression methodology.
All data can be thought of as represented by some alphabet, consisting of characters. As a very trivial example, we can think of English text, and of the (26 character) English alphabet as the alphabet that represents this data. As an extension upon this example, we may consider our alphabet to be the ASCII or Unicode character set, much larger alphabets that in turn allow us to represent different kinds of data. However, given the digital nature of our modern computer, alphabets cannot be directly stored into data files; they must be encoded with binary values to be saved to disk. ASCII and Unicode in fact represent not only alphabets, but also the binary encodings for these alphabets; for example, ASCII uses the encoding 0b1001110
to represent the character N
, and the encoding 0b1111011
to represent the character {
. Given these encodings, we are able to go from text in the English alphabet to a data file that is stored on the computer. However, one may ask for a given text file whether ASCII gives the optimal binary encoding. Is there a different option that will make the resulting text file smaller in size? As it turns out, there is almost always a better way to encode a specific piece of data, because we are able to make use of the specific frequencies of characters that appear in this data.
In general, there are two key methods of binary data encoding: fixedlength and variablelength. ASCII is a fixedlength encoding scheme: it uses the same number of bits for every character in the alphabet, providing unambiguous and easilyinterpretable data (just read every 2 bytes in). However, it is easy to see that fixedlength encodings will not perform optimally for any given piece of text. Intuitively, we would want the more frequent characters to have shorter codes and the less frequent characters to have longer codes, so that the resulting data file would be smaller on disk. Variablelength encodings leverage the actual frequencies of the characters in the data to be encoded in order to choose better encodings. The result, however, is that the encoding is highly specific to the given body of text, and must be specified along with the file in order to make it interpretable.
One key point to note about variablelength encodings is that they must have the prefix property: no encoding can also be the prefix of a different, longer encoding, as this would make the resulting data ambiguous.
As it turns out, the problem of finding an optimal binary encoding for an alphabet, given the data to be encoded, was solved by David A. Huffman in 1952. His Huffman encoding scheme provides an efficient method for generating a binary encoding based on frequencies of characters in a given text. Read about the full details of the scheme here and here.
The above discussion makes a crucial simplifying assumption, which allows for the optimality claim. In particular, it assumes that we are interested in a singlecharacter binary encoding for the alphabet. However, if we remove this restriction, we are able to make far more efficient encodings because specific sequences of characters may repeat several times throughout the text (for example, in English, it is often beneficial to create a single code for the sequence "qu" rather than for "q" and "u" separately). This problem was also solved to optimality (in an information theoretical sense  the expected encoded length tends to the entropy bound) by Abraham Lempel, Yaakov Ziv, and Terry Welch in 1984 in their LZW compression algorithm. There is in fact an entire family of LZ compression techniques; two earlier versions were LZ77 and LZ78, published by Lempel and Ziv in 1977 and 1978, respectively. Read about LZW encoding here (an exceptional resource) and here.
Interestingly, there were some patent controversies surrounding the LZW compression algorithm throughout the late 1980s and 1990s (see bullet point 4 in the DEFLATE goals, listed above). Thus, as compression techniques were fleshed out into standard utilities, the computing community quickly ended up using the LZ77 based DEFLATE algorithm for general purpose data compression. The LZ77 algorithm is fairly similar to the LZW compression algorithm, but it keeps a sliding window for the purpose of dictionary building, rather than allowing matches from the entire preceding text. You can read about the entire LZ compression family, and the comparisons between different variations, here.
The DEFLATE algorithm provides a specification that uses LZ77 and Huffman encoding to compress data. It is widely used, most prominently in the standard gzip
utility.
The DEFLATE algorithm consists roughly of 3 steps: first, the original data is encoded using the LZ77 algorithm. Next, the LZ77 encoded data is encoded using Huffman coding in order to optimize the representation of this data. Finally, the actual descriptions of the Huffman codes are themselves Huffman coded to optimize the storage of these Huffman codes. The algorithm also defines three alphabets, the socalled "literal," "distance," and "codelength" alphabets, to represent specific portions of the data, as explained below.
As mentioned above, in the first step, the original text data is encoded using the LZ77 algorithm described in the previous section. In order to represent this encoding, DEFLATE uses a special pair of predefined alphabets, given before. To understand them, we first have to remember the output of LZ77: the final, compressed data will consist of three types of values: either the literal byte value (char
) that should be present at a given spot, or a pair of the form (length, distance)
, indicating that length
bytes should be repeated, starting from distance
bytes previous. The DEFLATE algorithm combines the first two of these value types, literal (byte value) and length, into a single alphabet of 285 characters, specified below as the "literal" alphabet. It also defines a second "distance" alphabet (also above) of 29 characters to represent the distance values. Thus, the DEFLATE algorithm first runs LZ77 and saves the encoded data in the given alphabets. As an example, the file
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

would be encoded as
1 2 3 4 5 6 7 

As we see, LZ77 takes massive advantage of the repetition in the data, and it only encodes the string once in its literal form, and then it simply indicates that the previous 12 characters should be repeated several times.
Note that the first 256 values in the literal alphabet directly map to byte (ASCII) values, and the values from 257285 represent lengths, and have corresponding back distances. Also note that the (+0)
type demarcations above represent the values in the extra bits of the encoding (length codes 285
and 259
have 0 extra bits). See the implementation details below for the actual alphabet values.
From the previous step, we have a file of data written in two interleaving alphabets. Naively, we could directly write this data down in the direct fixedlength binary encoding of the alphabet characters. However, the entire goal of DEFLATE is to minimize space usage, so we will make use of our Huffman coding knowledge to further decrease the space usage. Instead of using fixedlength binary encoding, which would necessitate 9 bits for every literal value, for example, we Huffman encode the literal and distance alphabets based on the document from the previous step, and save the data in this encoded form.
This encoding brings up a new issue, however: we must actually save and transmit the Huffman codes, as it will change with every new document (or block, as we will see later). Keeping this in mind, we move on to step 2.5.
Directly saving every single Huffman code for every character of the alphabet would be incredibly spaceinefficient, and would render meaningless any space savings we were able to make through LZ77. Because of this, DEFLATE uses a specific ordering convention for its Huffman codes. In particular,
 All codes of a given bit length have lexicographically consecutive values, in the same order as the symbols they represent;
 Shorter codes lexicographically precede longer codes.
For example, if we are encoding A, B, C, D as 00, 011, 1, 010, we can keep the same code lengths (which are the only part that actually represent the relative frequencies of the characters and thus the space savings) but follow the given rules by instead using the codes 10, 110, 0, 111. The purpose of this rule is that we can specify the entire Huffman tree simply by giving the consecutive code lengths. For example, in our example, it is enough to say (2, 3, 1, 3) to specify the entire Huffman coding (try it out: there is only one way to follow the given rules and use the given code lengths). In particular, we can use the following code to take a set of lengths and convert them to a Huffman tree (see the repo at the end for the full function, build_tree()
).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

As you can see, we are able to directly take the given code lengths and generate the Huffman codes from them by taking into account the prefix uniqueness of Huffman codes.
Using this convention, we are able to compress the Huffman codes by applying a run length encoding to them (essentially, instead of giving a list of code lengths in order, we give a list of runs of code lengths). For example, the RLE of (5, 5, 5, 5, 6, 6, 6, 6, 6, 6) if represented as ((5, 4), (6, 6)), because 5 is repeated 4 times and 6 is repeated 6 times.
In order to actually represent this runlength encoding of the Huffman trees, we will make use of the codelength alphabet mentioned at the beginning. This alphabet allows us to relatively concisely represent run length encodings, because it allows us to represent the code lengths from 015 directly, and then give repeating sections using special characters (e.g. 16 + 2 extra bits tells us to repeat the previous code 36 times)  see the Alphabets section below to see the exact alphabet values.
From the previous section, we have two Huffman trees represented as RLE sequences in the codelength alphabet. If we squint at this situation, we may realize it looks exactly like the beginning of Step 2  a large length of text in some given alphabet! So, we repeat our previous strategy a second time  we will Huffman encode the codelength alphabet based on the frequencies in the RLE representations of the previous Huffman trees.
Finally, we need to specify this Huffman tree for the codelength alphabet. However, we don't want to get stuck in an infinite loop here, so we stop after this second encoding. This time, we directly record the RLE representation of the codelength alphabet with fixedlength 3bit codes (note that this effectively restricts the Huffman codes for the codelength alphabet to 8 bits). There is one final tricky optimization at this step, to squeeze out any extra space usage. In particular, the RLE representation of the tree is given in the order (16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15), rather than in order from 018. This is because the authors of the DEFLATE method decided (based on analysis) that the characters were in general most likely to be used in this order, and so if, for example, the relatively unlikely code lengths of 12, 14, 1, 15 bits are not used at all in the literal and distance Huffman trees, we can simply only specify the lengths of the codes up to 13, and stop. Note that in general, the shortest and longest code lengths are the least likely, because this represents that there are outlier characters in the document, whereas most documents are fairly well distributed amongst the alphabet.
As we saw in the section above, the DEFLATE algorithm gives us the Huffman coding for a codelength alphabet, which is in turn used to represent the Huffman coding for the literal and distance alphabets. Finally, the original data is fully compressed using LZ77 compression, encoded in the literal and distance alphabets, with the specified Huffman alphabets.
There are some implementation details that we glossed over before for clarity of the algorithm; we will specify these below.
RFC 1951 defines three different alphabets, used for different steps of the compression algorithm. These alphabets are given below (taken directly from the RFC).
Code  Extra Bits  Represented Lengths 

0255  0  Literal Byte Value 
256  0  End of Block (EOB) 
257  0  3 
258  0  4 
259  0  5 
260  0  6 
261  0  7 
262  0  8 
263  0  9 
264  0  10 
265  1  11,12 
266  1  13,14 
267  1  15,16 
268  1  17,18 
269  2  1922 
270  2  2326 
271  2  2730 
272  2  3134 
273  3  3542 
274  3  4350 
275  3  5158 
276  3  5966 
277  4  6782 
278  4  8398 
279  4  99114 
280  4  115130 
281  5  131162 
282  5  163194 
283  5  195226 
284  5  227257 
285  0  258 
Code  Extra Bits  Represented Distances 

0  0  1 
1  0  2 
2  0  3 
3  0  4 
4  1  5,6 
5  1  7,8 
6  2  912 
7  2  1316 
8  3  1724 
9  3  2532 
10  4  3348 
11  4  4964 
12  5  6596 
13  5  97128 
14  6  129192 
15  6  193256 
16  7  257384 
17  7  385512 
18  8  513768 
19  8  7691024 
20  9  10251536 
21  9  15372048 
22  10  20493072 
23  10  30734096 
24  11  40976144 
25  11  61458192 
26  12  819312288 
27  12  1228916384 
28  13  1638524576 
29  13  2457732768 
Code  Extra Bits  Meaning 

015  0  Code length from 015 
16  2  Repeat previous code 36 times. 
17  3  Repeat code length 0 310 times. 
18  7  Repeat code length 0 11138 times. 
All 3 alphabets consist of some basic set of characters ("code"), each of which augmented with "extra bits," which are simply a specific number of extra bits that occur after the character to specify its meaning from a given range. For example, if we see the literal code 276
, we will then have to read the following 3 bits to determine which of lengths from 5966 it represents (000
is 59, etc.).
Note that the fact that the codelength alphabet only has literal characters for the values ranging from 015 limits our Huffman codes to be only of these lengths, 015. This complicates the actual Huffman coding scheme used by DEFLATE, because it cannot directly use the optimal coding, but has to take into account an upper limit on code lengths. Fortunately, we do not have to consider this difficulty when implementing a decompressor, which simply has to follow the values specified in the zipped file.
The data is not actually represented entirely as a single large file, but is instead broken up into several (mostly) independently encoded blocks. Each block has its own Huffman encodings, based on the frequencies of the characters within this blocks. The blocks are not entirely independent because their LZ77 encodings can refer back to previous blocks with the distance parameter. As a limit, however, the LZ77 encoding used is limited to pointing back at most 32 Kb. Each block consists first of the block type, and then the Huffman tree specifications (if applicable), and then the actual LZ77 encoded data.
It wasn't mentioned before, but there are actually 3 types of blocks. The first, btype = 00
, represents incompressible data, and it simply directly contains the uncompressed data after the block header (block type). The second, btype=01
, represents data encoded with fixed Huffman codes. The DEFLATE specification gives some general purpose literal and distance Huffman codes that can be used if the data would do worse by specifying new Huffman trees rather than just using these predefined Huffman trees (generally the case in smaller files, where the length of the Huffman tree specification would be comparable to the length saved by using Huffman coding in the first place). In this block type, there are no Huffman tree representations, and we directly proceed to the LZ77 encoded data. Finally, the third is btype=10
, which represents data encoded with dynamic Huffman codes. This is exactly the case discussed above, and it includes the block header, the codelength Huffman representation, the literal and distance Huffman representations, and finally, the LZ77encoded data.
As RFC 1951 states,
Each byte is most to least significant; bytes are least to most significant
Data elements are packed into bytes from least to most significant bit;  besides Huffman codes: packed starting with least significant bit  Huffman codes: packed starting with most significant bit
In particular, the bytes are packed in littleendian order, but the bits within each byte are packed in LSBMSB order, except for the Huffman codes, which must be in the opposite order so that we can read them bit by bit and compare against the Huffman tree. This is a very complicated data representation scheme, but it can be abstracted away using the code below (assuming an Intel machine, so that bytes are read in littleendian order):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

As we see, deflate_stream
allows us to read the bytes of the file in one at a time into a buffer and decode the bits in order (by masking with pos
). Further, read_bits
allows us to read the bits in either order, depending on whether the data we are reading represents a Huffman code or not. In this way, we can abstract away the fairly confusing data format.
See my repo here to see a full implementation of a DEFLATEcompliant decompressor. By reviewing the steps taken to decompress data, you can check your understanding of the entire algorithm.