Assignment 13: Huffman Coding

8.6

Assignment 13: Huffman Coding

A character encoding is a way of representing an alphabet (set of characters) with some kind of code. In computer science, that code is often made of single bits, 0s and 1s, since that is how all data is ultimately stored.

Unicode’s industry-standard encoding is what lets us have emojis. :-)

For example, if you had the following encoding:

Letter		Code
a		0
b		1,0
c		1,1

Then "abc" would be encoded as the five bits 01011, and 1110 would be decoded as "cb".

There are far fewer short codes than long codes. Why is this?

For this assignment, you are going to implement Huffman coding, which is a strategy for encoding an alphabet when you know how relatively popular the letters are. For example, a, e, i, o, and u are heavily used in English, whereas j, q, x, and z are not. It makes sense to try to use shorter codes to represent the popular letters, and let the less-common letters be stuck with longer codes.

While there are various properties that make Huffman codings interesting, the one that is most helpful is that it is a prefix code. This means that no encoding of one character is a prefix of any other character’s encoding. The example given above is a prefix code. The following, however, is not:

Letter		Code
e		0
f		1
g		01

What would 01 decode to, ef or g? Prefix codes help eliminate these types of ambiguities.

13.1 The algorithm

The best way to explain the Huffman coding algorithm is to illustrate it.

For example, say you begin with the following letters:

"a", "b", "c", "d", "e", "f"

Where each letter has the following relative frequencies:

12, 45, 5, 13, 9, 16

To begin, you create 6 leaves, where each leaf stores the letter and its relative frequency.

You then sort them in ascending order by frequency.

Then, you combine the two smallest leaves into a node, whose value is now the sum of its subtrees. The first leaf you place under the red branch, and the second leaf under the blue. You place this node inside of our forest (list of trees) such that it is still sorted.

and repeat...

...until there is just one tree.

This process is guaranteed to produce a prefix code. Why? Could you choose red as 1, and blue as 0? Could you choose differently at every level of the tree?

To encode a character, follow the path from the root to the leaf with that letter, where a red line represents a 0, and a blue a 1. So, encoding "eba" with the tree above would produce 11010100. Decoding a sequence of 0s and 1s is (essentially) the inverse operation, so decoding 101111 would yield "df".

Composing the encoding and decoding operations in either order should produce what is effectively the identity function (on either strings or bit-sequences), but they aren’t quite inverse functions. Why not?

13.2 Data design

Design the Huffman class, which should be constructed with two ArrayLists, one of strings and the other of integers (in that order), representing the letters of the alphabet in question and their relative frequencies. You may assume the integers are always positive and that the strings are always of size 1 (and never equal to "?"), but you should throw an IllegalArgumentException if the lists are of different lengths or if the length of the list of strings is less than 2 (after all, there wouldn’t be much point of encoding a 1-letter alphabet).
Your forest should be represented by an ArrayList of some kind. You may need a helper class; you may need multiple ArrayLists; the design is up to you.
To achieve a consistent encoding, the leaves and nodes must be sorted consistently. When sorting the initial leaves, you should use a stable sorting algorithm. This means that if two different elements have the same value for the purposes of sorting, their relative order in the original list should be preserved. (For instance, if two letters have the same frequency, then they have the same value for sorting even though they are not the same letter.)
Similarly, when you insert a new node into the forest, you should ensure that it appears as deep in the list as possible, so that if it ties with something else already in the list, the existing value "wins" and should come first. You should be sure to carefully test these properties of your sort, and you should not have to attempt to re-sort the entire list when inserting a new node.

Hint: To achieve this, design the following methods:

// sort a list of ITrees in ascending order by frequency void sort(ArrayList<ITree> trees) {}

// insert tree into an array list of trees that are sorted along the range [0, maxIndexPossible) // 0 <= maxIndexPossible <= sortedTrees.size() // after this method, the list is sorted along the range [0, maxIndexPossible] void insert(ITree tree, ArrayList<ITree> sortedTrees, int maxIndexPossible) {}

// insert the tree into this completely sorted list so that the final list is also completely sorted void insert(ITree tree, ArrayList<ITree> sortedTrees) {}

13.3 Methods

Design the encode method, which takes a string and returns an ArrayList<Boolean>, where 0 is represented by false and 1 by true. If the string contains any character not in the alphabet, an IllegalArgumentException should be thrown with the exact error "Tried to encode <the character in question> but that is not part of the language.", where "<the character in question>" is replaced with the first character in the string that is not in the alphabet.
Design the decode method, which takes an ArrayList<Boolean>, where 0 is represented by false and 1 by true, and returns the string represented by this encoding. If at the end of the list of booleans you run out of booleans before you can reach a terminal node in the tree, you should end the string with a single "?". For example, in the above tree, the list of false, true, false, would decode to "b?".

contents ← prev up next →

	General Course Information
	Assignments
	Documentation
	Installation & New Project Instructions

	Assignment 1: Designing complex data
	Assignment 2: Delegation
	Assignment 3: Tree-Shaped Data
	Assignment 4: Accumulators
	Assignment 5: Abstracting over Data Definitions
	Assignment 6: Customizing Constructors
	Assignment 7: Mastermind
	Assignment 8: Abstraction
	Assignment 9: Generics
	Assignment 10: Fibonacci
	Assignment 11: Course Registration
	Assignment 12: Who’s On Deque?
	Assignment 13: Huffman Coding
	Assignment 14: Traversal
	Assignment 15: Arrays
	Assignment 16: Bloom Filters
	Assignment 17: Optionals
	Assignment 18: Streams
	Assignment 19: Chess