Compression, part 1: Huffman encoding

Author: Ole Friis (ole_f@post3.tele.dk)
Date: Sep, 2000. Updated July 2001.
Part: 1 | 2 | 3 | 4 | 5 | 6

Overview
Huffman encoding
Generating the Huffman codes
Constructing a Huffman tree in Rebol
Encoding a character
Encoding a string
Decoding a character
Decoding a string
Examples
Comments

Overview

If you have ever wondered how it is possible to compress files to much less the original size, these articles might be of your interest. They will guide you through the making of an actually pretty good compression program, written entirely in Rebol.

Of course, you could just use Rebol's COMPRESS and DECOMPRESS commands directly in case you wanted to perform compression and decompression. But where's the fun in that? We want to know how the internals of a compression scheme can work.

In this first article, we'll have a look at a technique called Huffman encoding. This is just the first of a series of techniques that in the end of this article series will put together.

Huffman encoding

The idea presented here was developed by a guy called David Huffman, and is therefore called Huffman encoding. The basic idea is this: If a certain character in a file appears considerably more often than any other character in the file, then why waste as much space on each occurrence of this character as the others? We would gain some space if we used less bits on this character, and then in return used a little more bits on the other characters. You might wonder if it is possible at all to use less space on some characters and instead use more space on others, but as you will see, it is.

Let's take an example: Suppose we have an alphabet containing only these 4 characters: A, B, C, and D. If we should encode these in a file, we would use 2 bits for each character: A would become 00, B would become 01, C would become 10, and D would become 11. So, the encoding of ABBA would become 00010100. But now suppose that we encounter a huge amount of A's and very few C's and D's. Then it would make a lot of sense to re-arrange it this way: A becomes 0, B becomes 10, C becomes 110, and D becomes 111. Then, the encoding of ABBA would become 010100, and the encoding of ADD would become 0111111.

As you see, we can easily make up texts for which this new representation scheme would require more bits than with the "standard" representation, but the point was exactly that these texts are very rare, since C's and D's are rare.

Generating the Huffman codes

I will not go into much detail here about why things work as they do - if you're interested in that, then you'd better buy a book on the subject. An excellent book is "Introduction to Data Compression" by Khalid Sayood. This book also explains a lot more than just the Huffman encoding technique.

Anyway, let's see how we can make up Huffman codes for our example above. Let's say that 60% of all characters are A's, 20% are B's, 10% are C's, and 10% are D's. Then we first list the characters and their probabilities like this:

A     B     C     D
60    20    10    10

Then we find the two nodes with least probabilities; in our case, they are the C and D nodes. We combine these into a new node with probability equal to the sum of the two old nodes:

A     B     C     D
60    20    \     /
             \   /
              20

We repeat the procedure, merging the two nodes marked "20" in the above:

A     B     C     D
60    \     \     /
       \     \   /
        \      +
         \    /
           40

Now, all there is to do now is to merge the two remaining nodes:

A     B     C     D
\     \     \     /
 \     \     \   /
  \     \      +
   \     \    /
    \      +
     \     /
      \   /
       100

So, now we have a single node with probability 100%. As you can see, by going from the "100" node down the lines (and through the "+" nodes), we can reach all the letters A, B, C, and D. From each node, mark the two lines going towards the letters with 0 and 1, respectively:

A     B     C     D
\     \     \     /
 \     \   0 \   / 1
  \   0 \      +
   \     \    / 1
  0 \      +
     \     /
      \   / 1
       100

I hope that despite ugly ASCII graphics, you can see that going from the bottom node to C forces us to go through the lines called 1, 1, and 0, respectively. Hooray, that was the very same code that we assigned to C in the example before.

Of course, real-life examples are not restricted to just 4 letters, but the above technique works equally well for larger alphabets.

Constructing a Huffman tree in Rebol

Implementing the above in Rebol should be pretty straightforward. Here's my suggestion:

construct-huffman: func [
  "Constructs a Huffman tree, using the given statistics"
  stats [block!] "The statistics"
  /local probs symbol-table node1 node2 new-node temp-list
][
  ; First, make a flat list of all the characters
  probs: make block! 256
  for i 1 256 1 [
    append/only probs reduce [pick stats i  i - 1  none]
]
 
; Then construct the tree, joining two nodes each time
  symbol-table: copy probs
  sort probs
  while [1 < length? probs][
    ; Pick the two nodes with least probability
    node1: first probs
    node2: second probs
    remove/part probs 2
    ; Construct a father to node1 and node2
    new-node: reduce [(first node1) + (first node2)  node1  node2  none]
    change/only (back tail node1) new-node
    change/only (back tail node2) new-node
    ; Insert the new node correctly in the "probs" list
    temp-list: probs
    while [all [not tail? temp-list  (first temp-list) < new-node]][
      temp-list: next temp-list
    ]
    insert/only temp-list new-node
  ]
  ; Return the top element of the Huffman tree and the
  ;   original flattened tree.
  reduce [first probs  symbol-table]
]

This function expects a block containing 256 elements. The i'th element in this block should be a number corresponding to the probability that the character with value (i-1) occurs. This is used as the basis to build the Huffman tree.

Each internal node in the tree consists of 4 entries: The statistics associated with the given node, the first child, the second child, and the parent. Leaf nodes contain 3 entries: The statistics associated with the node, the number that this node represents, and the parent.

All of this allows us to traverse the Huffman tree both from the top to the bottom and the other way.

The complete Huffman tree as returned from this function is two elements: The top node of the tree and the original flattened tree (all the leaf nodes from the tree). The former is practical when we have the Huffman code and need to find the associated character, and the latter is practical when we have the character and need to find the associated Huffman code. Just below you will see why.

Encoding a character

Using the flattened list from the construct-huffman function, it is easy to construct Huffman codes by going backwards in the Huffman tree:

encode-huffman-char: func [
  "Huffman-encodes a character"
  symbol [char!]  "Character to encode"
  tree   [block!] "The Huffman tree to use"
  /local node code parent
][
  node: pick (second tree) (to-integer symbol) + 1
  code: copy ""
  while [found? parent: last node][
    insert code either node == third parent ["0"] ["1"]
    node: parent
  ]
  code
]

As you can see, we represent a sequence of bits as a string containing 0's and 1's. This is not as efficient as if we could describe bit-streams directly in Rebol, but I don't know if this is possible. If you know a better way of doing this, please let me know.

Encoding a string

Now that we have the above function for encoding single characters, we can of course use it multiple times to encode a string:

encode-huffman: func [
  "Huffman-encodes a string"
  str  [string!] "String to encode"
  tree [block!]  "The Huffman tree to use"
  /local result
][
  result: copy ""
  while [not tail? str][
    append result encode-huffman-char first str tree
    str: next str
  ]
  result
]

Decoding a character

Decoding a character is just as easy as encoding a character: Start at the top of the Huffman tree, then dive into the tree, choosing the first or second child of each node according to the next character in the Huffman code:

decode-huffman-char: func [
  "Huffman-decodes a character"
  code [string!] "The Huffman code to decipher (will be altered)"
  tree [block!]  "The Huffman tree to use"
  /local node
][
  node: first tree
  until [
    either (first code) = #"1" [node: second node] [node: third node]
    remove code
    (length? node) = 3
  ]
  to-char second node
]

This function will actually delete the contents of the code as the characters are being used. This might seem stupid, but it is smart when we want to use this function to decode Huffman-encoded strings instead of just one character.

Decoding a string

Using the function above, decoding several encoded characters is a breeze:

decode-huffman: func [
  "Huffman-decodes a string"
  code [string!] "The Huffman code to decipher"
  tree [block!]  "The Huffman tree to use"
][
  code: copy code
  result: copy ""
  while [not tail? code][
    append result decode-huffman-char code tree
  ]
  result
]

Examples

Let's just have a quick look at some examples. Suppose that we know that in a text we want to encode, the letter "a" appears considerably more often than any other character. First, we then construct a statistics block:

>> stats: []
== []
>> insert/dup stats 1 256
== []
>> poke stats 98 300
== [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1...
>> tree: construct-huffman stats
== [[555 [255 [127 [63 [31 [15 [7 [3 [1 255 [...]] [2 [1 0 [...]]
   [1 1 [...]] [...]] [...]] [4 [2 [1 2 [...]] [1 3 [...]] [...]] [2...

(By the way, the character "a" has ASCII value 97, so the 98'th entry of stats should be big - the 0th character corresponds to the first entry in stats, the 1st character corresponds to the second entry in stats, and so on.)

Now let's construct some Huffman codes:

>> encode-huffman "aabca" tree
== "001100111001100110110"
>> encode-huffman "bbcde" tree
== "110011100110011100110011011110011010110011001"

Going in the other direction is also easy:

decode-huffman "001100111001100110110" tree
== "aabca"
>> decode-huffman "110011100110011100110011011110011010110011001" tree
== "bbcde"

Now, by encoding a whole string of characters into a bitstream of Huffman codes, we could write this bitstream to file instead of the original string, and if certain characters appear with high probability, we would save some space.

Perhaps you will not yet see the big advantage in the Huffman scheme, but in the next compression articles we will see how we can transform a text file into a character stream containing a lot of small values (and very few large values). Using Huffman encoding on such a character stream can compress the stream a lot. So, look forward to the forthcoming articles.

Comments

I know that probably a lot can be done to refine the functions listed above. If you happen to write nicer versions of the functions, please send them to me at Ole Friis. and then I will use them instead of these functions (if I do like your functions, at least ;-)

Rebol Forces

Zine Building a Server Engine Metaphors Recursion, Iteration and Algebra A nice juicy Pair! Writing your own Protocol Handler Rebol and the Shell Series from the Ground Up Compression Dialecting and the Console View FAQ Links