On this page:
13.1 The algorithm
13.2 Data design
13.3 Methods
8.6

Assignment 13: Huffman Coding

A character encoding is a way of representing an alphabet (set of characters) with some kind of code. In computer science, that code is often made of single bits, 0s and 1s, since that is how all data is ultimately stored.

Unicode’s industry-standard encoding is what lets us have emojis. :-)

For example, if you had the following encoding:

Letter

 

Code

a

 

0

b

 

1,0

c

 

1,1

Then "abc" would be encoded as the five bits 01011, and 1110 would be decoded as "cb".

There are far fewer short codes than long codes. Why is this?

For this assignment, you are going to implement Huffman coding, which is a strategy for encoding an alphabet when you know how relatively popular the letters are. For example, a, e, i, o, and u are heavily used in English, whereas j, q, x, and z are not. It makes sense to try to use shorter codes to represent the popular letters, and let the less-common letters be stuck with longer codes.

While there are various properties that make Huffman codings interesting, the one that is most helpful is that it is a prefix code. This means that no encoding of one character is a prefix of any other character’s encoding. The example given above is a prefix code. The following, however, is not:

Letter

 

Code

e

 

0

f

 

1

g

 

01

What would 01 decode to, ef or g? Prefix codes help eliminate these types of ambiguities.

13.1 The algorithm

The best way to explain the Huffman coding algorithm is to illustrate it.

For example, say you begin with the following letters:

"a", "b", "c", "d", "e", "f"

Where each letter has the following relative frequencies:

12, 45, 5, 13, 9, 16

To begin, you create 6 leaves, where each leaf stores the letter and its relative frequency.
image
You then sort them in ascending order by frequency.
image
Then, you combine the two smallest leaves into a node, whose value is now the sum of its subtrees. The first leaf you place under the red branch, and the second leaf under the blue. You place this node inside of our forest (list of trees) such that it is still sorted.
image
and repeat...
image
and repeat...
image
and repeat...
image
...until there is just one tree.
image

This process is guaranteed to produce a prefix code. Why? Could you choose red as 1, and blue as 0? Could you choose differently at every level of the tree?

To encode a character, follow the path from the root to the leaf with that letter, where a red line represents a 0, and a blue a 1. So, encoding "eba" with the tree above would produce 11010100. Decoding a sequence of 0s and 1s is (essentially) the inverse operation, so decoding 101111 would yield "df".

Composing the encoding and decoding operations in either order should produce what is effectively the identity function (on either strings or bit-sequences), but they aren’t quite inverse functions. Why not?

13.2 Data design
13.3 Methods