Compression, part 2: Move-to-Front encoding

Author: Ole Friis (ole_f@post3.tele.dk)
Date: Jul, 2001
Part: 2 of 6

Another encoding technique?
The idea
Implementation
Examples

Another encoding technique?

In the last article, we saw an encoding technique, Huffman encoding, which is good for compressing a series of values where certain values appear very frequently. This time we will have a look at another encoding technique which actually does not compress anything at all. "What's the big idea in that?", you may rightly wonder.

The technique described here, called Move-to-Front encoding, takes as input a string and gives as output a series of numbers, one for each character in the input string. If the input string has lots of sequences containing the same letter in a row, then the output series will have a lot of small values, and not so many big values.

This means: If we have an input string with lots of sequences containing the same letter in a row, then we can first apply Move-to-Front encoding. This gives us a new series which contains lots of small numbers. Then we can apply Huffman encoding, in which we choose to use less space on small numbers than on bigger numbers. The end result should then be a shorter sequence than the original one.

In the next article, we will look at a technique which exactly converts normal text into text where there are many sequences containing the same letter. By putting all of these three techniques together, we will actually get a good compression program.

The idea

In fact, the idea behind Move-to-Front encoding is extremely simple. We start by creating a list of all possible characters that may occur in an input stream. If we are allowed to use the letters a, b, c, d, and e, the list could look like this:

0 1 2 3 4
a b c d e

Notice that a is at position 0 in the list, while c is at position 2. Very simple. The idea is that while we encode an input text, one character at a time, we will use the above list to decide what to output, after which we will change the list before we look at the next character in the input list. An example will show you completely what I mean. Let's say that our input text is "ccdcabb":

First input character: "c"
The list is: 0 1 2 3 4
             a b c d e

We see that c is at position 2 in the list, so we output "2". At the same time, we move c to position 0, pushing a and b one place to the right. Now the situation is this:

Next input character: "c" (the second c in "ccdcabb")
The list is: 0 1 2 3 4
             c a b d e

Using the same procedure, we output "0", since c is now at position 0. Again we move c to the front, but since it's already there, this doesn't change the list. We continue for a little while with the next character from the input, namely d:

Next input character: "d"
The list is: 0 1 2 3 4
             c a b d e

This will output "3", since d is at position 3, and now we will move d to the front, pushing c, a, and b to the right. We can now look at the next character, c:

Next input character: "c"
The list is: 0 1 2 3 4
             d c a b e

This, of course, gives an output of "1", which means that the first 4 letters from our input, "ccdc", gives an output of "2031". You can try to continue yourself with this technique, encoding the remaining "abb" of the input.

As you can see from the above, if the input looks like "abbbbbbc", then the sequence of b's will result in lots of 0's in the output, since b will be moved to the front of our list when the first b is encountered.

Implementation

Implementing Move-to-Front encoding in Rebol is very easy:

encode-mtf: func [
  "Encodes a string using Move-to-Front"
  str   [string!] "The string to encode"
  table [block!]  "The initial table to use"
  /local result index
][
  table: copy table
  result: make block! (length? str)
  foreach letter str [
    index: find/case table letter
    append result (index? index) - 1
    remove index
    insert table letter
  ]
  result
]

The above is pretty straightforward: First, we copy the initial table, since we are going to mess it up. Then, for each letter in the input string, we find its position in the table and add this position (which is the INDEX? value minus one, since Rebol starts its positions at 1 and we want them to start at 0) and move the letter in the table to the front.

We haven't discussed at all if we can actually reverse the Move-to-Front algorithm, but it is quite easy to see that by looking at each number the encoding gives, we can pick our letter from our table and thereafter do the moving to front in the same way as the encoding does:

decode-mtf: func [
  "Decodes a string using Move-to-Front"
  indices [block!] "The encoded version of the string"
  table   [block!] "The initial table to use"
  /local result letter
][
  table: copy table
  result: make string! (length? indices)
  foreach index indices [
    letter: pick table (index + 1)
    append result letter
    remove skip table index
    insert table letter
  ]
  result
]

I'll leave it to you to find out in detail what the above does. It's pretty easy to figure out, though.

Examples

To use the two functions above, we need an initial table. Let's make one that resembles the table from the walkthrough of the encoding idea above:

>> table: [#"a" #"b" #"c" #"d" #"e"]
== [#"a" #"b" #"c" #"d" #"e"]

Now we can use this table to encode and decode our "ccdcabb" string from above:

>> encode-mtf "ccdcabb" table
== [2 0 3 1 2 3 0]
>> decode-mtf [2 0 3 1 2 3 0] table
== "ccdcabb"

This might not look very amazing, but later you will see that this encoding technique is one of the cornerstones of our compression program. Though, you will have to wait a little while before you can really feel the satisfaction!

Rebol Forces

Zine Building a Server Engine Metaphors Recursion, Iteration and Algebra A nice juicy Pair! Writing your own Protocol Handler Rebol and the Shell Series from the Ground Up Compression Dialecting and the Console View FAQ Links