Huffman Coding

Problem

Definition (encoding)

an encoding \(c\) is a mapping from alphabet set \(\Sigma\) to binary string of finite length.

\[ c: \Sigma \rightarrow \cup_{i\ge 1} \left\{ 0, 1 \right\} \]

For instance, the encoding of a word of letters \(w = (\sigma_1, \sigma_2, \ldots, \sigma_n)\) is \(c(w) = c(\sigma_1)c(\sigma_2)\ldots c(\sigma_n)\)

Definition (unique decoding property)

We say that code \(c\) has unique decoding property iff for every two distinct finite length strings \(s_1 \ne s_2\), we have \(c(s_1) \ne c(s_2)\).

Definition (prefix code)

A code \(c\) for alphabet \(\Sigma\) is a prefix code iff for all \(\sigma, \sigma ^\prime \in \Sigma\), such that \(\sigma \ne \sigma ^\prime\): \(c(\sigma)\) is not a prefix of \(c(\sigma ^\prime)\)

For instance, the code \(\left\{ a:00, b:0010 \right\}\) is not a prefix code.

Claim 1

If \(c\) is a prefix code, then it has unique decoding property nd efficient decoding algorithm.

We want to encode as short as possible and be able to decode correctly and efficiently.

For instance, we can use five binary digits to encode \(26\) letters since \(2^5 = 32 > 26\). But this is wasteful since the letter \(x\) appears much less frequently that \(e\). A solution is to use short encoding for frequent letters and long encoding for infrequent letters.

Let the frequency of letter \(a\) in document \(T\) of length \(n\) be

\[ f(\sigma) = \frac{\text{# times $a$ appears in T}}{n} \]

The total code length of the text \(\left\vert c(T) \right\vert\) is

\[ \left\vert c(T) \right\vert = \Sigma_{\sigma \in T} \left\vert c(\sigma) \right\vert f(\sigma) n \]

We want to find optimal prefix encoding \(c\) to minimize the average code length

\[ \min \, \frac{1}{n} \left\vert c(T) \right\vert = \min \, \Sigma_{\sigma \in T} \left\vert c(\sigma) \right\vert f(\sigma) \]

Analysis

Claim 2

There is a one-to-one correspondence between a prefix code and a binary tree. Every leaf corresponds to a letter. Left edge means \(0\) and right edge means \(1\).

Fig. 32 A binary code tree

What is the cost of this binary code tree?

Properties
  • code length of a letter \(c(\sigma)\) is the depth of leaf labelled \(c(\sigma)\) in the tree, denoted \(d(\sigma)\).

  • total cost is \(\mathrm{cost}(T) = \Sigma_{\sigma \in T} f(\sigma) d(\sigma)\)

Now we have converted the code problem to a tree problem to minimize the weighted depth.

Definition (full binary tree)

A tree is a full binary tree if every non-leaf vertex has exactly 2 children.

Claim 3

If \(T\) is an optimal tree, it is a full binary tree.

Claim 4

Suppose there are two letters \(x, y\), with lowest frequency \(f(x) \le f(y) \le f(z)\) for any \(z \ne x, y\). There exists an optimal binary full tree \(T\), where \(x,y\) are siblings, and they lie at the deepest level of the tree.

Claim 5

Suppose in a tree \(T\) there are two sibling letters \(x, y\), with frequencies \(f(x), f(y)\) and parent \(z\). Then for the three \(T^{\prime} = T^ \backslash \left\{ x, y \right\} \cup \left\{ z \right\}\), we have

\[\mathrm{cost}(T ) = \mathrm{cost}(T ^\prime) + f(x_1) + f(x_2)\]

The above claims suggest that we can put infrequent letters at the lower levels.

Algorithm

Suppose \(\left\vert \Sigma \right\vert > 2\). We use a recursive algorithm \(\mathrm{BuildTree}(\Sigma)\) to build an optimal three \(T\) for alpha bet \(\Sigma\).

  • When \(\left\vert \Sigma \right\vert = 2\), assign two leaves to the root.

  • When \(\left\vert \Sigma \right\vert > 2\)

    • Let \(x, y\) be 2 letters with the smallest frequencies (breaking ties arbitrarily). Consider a new letter \(z\) with frequency \(f(z) = f(x) + f(y)\) and a new alphabet set \(\Sigma ^\prime = \left( \Sigma \backslash \left\{ x,y \right\} \right) \cup \left\{ z \right\}\)

    • Assign \(T ^\prime \leftarrow \mathrm{BuildTree}(\Sigma ^\prime)\)

    • Repeat until obtain a final tree \(T\) by adding \(x, y\) as child vertices of \(z\) and deleting the label of \(z\)

Correctness Proof

Proof by induction

  • Induction base: When \(\left\vert \Sigma \right\vert = 2\), the algorithm computes a correct solution.

  • Induction step: When \(\left\vert \Sigma \right\vert \ge 2\), suppose \(T ^\prime\) is an optimal tree for \(\Sigma ^\prime\), then \(T\) by adding \(x,y\) as siblings lying at the lowest level in \(T ^\prime\) is the optimal solution for \(T = \left( \Sigma \cup \left\{ x,y \right\} \right) \backslash \left\{ z \right\}\).

Complexity

  • Number of recursive calls: at most \(\left\vert \Sigma \right\vert\)

    • Time in each call to find the two letters with the smallest frequencies: \(\le \left\vert \Sigma \right\vert\)

Total: \(O(\left\vert \Sigma \right\vert ^2)\)

Review

This algorithm code letter by letter. If some words, phrase, or sentences appear frequently, it is better to code them directly.