Compression, part 3: Burrows-Wheeler transform

Author: Ole Friis (ole_f@post3.tele.dk)
Date: Jul, 2001
Part: 3 of 6

A little more advanced
Motivation
The basic idea
Implementation of the encoding
Decoding
Implementation of the decoding
Examples

A little more advanced

This compression article will perhaps feel a little harder to get to grips with than the first two, since the technique covered this time, the Burrows-Wheeler transform, might not be entirely easy to grasp (especially the decoding part). However, I will try to be as pedagogical as I can and try to explain the concepts in a simple way. But in any case, please don't be scared if you don't understand everything in this article - as long as you get the idea of why the Burrows-Wheeler transform can help us create a compression program, you can always read the details another time. This article will without doubt be the most advanced in this compression series, and none of the other articles assume that you will understand everything in this article.

So, are you feeling OK again? The dizziness has almost gone? In that case... LET'S JUMP INTO IT!

Motivation

As explained in the 2nd compression article, we already have two out of three techniques that, when combined, will be able to compress data for us. The two techniques already known are:

1: Huffman encoding: Given texts in which certain letters appear more often than other letters, we can compress this text, using less space on the often occuring letters and more space on the rest of them.

2: Move-to-Front encoding: Given texts in which series of the same letter occur often (like in "aabccccddd"), we can construct texts in which very small values appear more often than big values.

Now we would like the 3rd component:

3: Burrows-Wheeler transform: Given normal, English texts, we can construct texts in which series of the same letter appear often.

With the Burrows-Wheeler transform, the compression algorithm is this: Given a normal English text, we use the Burrows-Wheeler transform on this. This gives a new text which is suitable for a Move-to-Front encoding (since it has lots of sequences of the same letters). This gives yet another new text, which is suitable for Huffman encoding (since this text has lots of very small values). As you can see, the only step that actually compresses anything is the Huffman encoding, the two other steps are there simply to ensure that the Huffman encoding is able to compress the data nicely.

We would also like to go the other way - that is, given a compressed text, we would like to decompress it into the original text. We have already seen that this is possible with both Huffman encoding and Move-to-Front encoding. It is also possible with the Burrows-Wheeler transform, though you will most probably find it a lot less obvious than with the other two methods.

The basic idea

When you write English, you might not notice it, but quite often the letter "w" is followed by an "h", like in "what", "which", "where", and so on. In a similar way, often "." is followed by " " (space). Put simply, given a letter, we can give a qualified guess as to what the next letter is. This can be used to give good compression, and indeed a lot of compression programs work this way. The idea behind the Burrows-Wheeler transform is similar: Quite often, given a letter, we can give a qualified guess as to which letter came BEFORE.

If this is the case (as it is with normal English - you'll see that later), then we could do the following trick: Given a text, for example "Hello there", we could write up 11 strings after each other. These strings would be

01: The original string

02: The original string rotated 1 letter to the right

03: The original string rotated 2 letters to the right <br>...

11: The original string rotated 10 letters to the right

This would give these strings:

01: Hello there
02: eHello Ther
03: reHello The
...
11: ello thereH

So far, so good. Now, what if we sort these strings? This would give us the following strings (with a case-sensitive sort, which is why "H" comes before "e"):

01:  thereHello
02: Hello there
03: eHello ther
04: ello thereH
05: ereHello th
06: hereHello t
07: llo thereHe
08: lo thereHel
09: o thereHell
10: reHello the
11: thereHello

Notice that the first vertical row is a sorted list of letters, namely " Heeehllort". What we now do is take the LAST vertical row, which is "oerHhtelle ". This is one of the outputs of the Burrows-Wheeler transform. The other output is the number 2, since this is the position of our original text, "Hello there", in the sorted list.

So what happened? Since we believe that what comes before "h" is almost always "w", what comes before " " is often ".", and so on, we would also be lead to believe that the list of letters that come BEFORE a sorted list of letters will consist of blocks of the same letters. In our case with a very small text this does not show very clearly, but this property will become more evident when we look at larger texts.

You might wonder why the output from the Burrows-Wheeler transform is both this weird string (the last vertical row in the list of sorted, rotated strings) and the position in the list of sorted, rotated strings where we find our original, unrotated string. The reason is that this is necessary if we ever want to automatically go from the output back to our original text.

Notice, now, that the Burrows-Wheeler transform does not compress the text - instead, the text is not compressed, and we also get an extra index number, which takes up more space. However, as we will see, the text is very suitable for Move-to-Front and Huffman encoding, so the extra index number is worth the effort.

Implementation of the encoding

This is my suggestion for a Burrows-Wheeler encoding function:

encode-bw: func [
  "Encodes a string using the Burrows-Wheeler transform"
  str [string!] "The string to encode"
  /local str-length permutations transformed-string
    transformed-info i
][
  str-length: length? str
 
  ; Build a representation of the permutations
  permutations: make block! str-length
  for i 1 str-length 1 [append permutations i]
 
  ; Sort the permutations in lexicographic order
  sort/compare permutations func [n1 n2 /local c1 c2] [
    loop str-length [
      if n1 < 1 [n1: str-length]
      if n2 < 1 [n2: str-length]
      c1: pick str either n1 = 1 [1][str-length - n1 + 2]
      c2: pick str either n2 = 1 [1][str-length - n2 + 2]
      if c1 < c2 [return true]
      if c1 > c2 [return false]
      n1: n1 - 1
      n2: n2 - 1
    ]
    return true
  ]
 
  ; Now, let's create the new string
  transformed-string: make string! str-length
  transformed-info: (index? find permutations 1)
 
  foreach permutation permutations [
    append transformed-string pick str (
       str-length - permutation + 1
    )
  ]
  reduce [transformed-string transformed-info]
]

This might look very cryptic. We use one little trick: Instead of actually storing all the rotated strings, instead we store the numbers that these rotated strings would have in the original list we would make - string 1 was rotated 0 times, string 2 was rotated 1 time, string 3 was rotated 4 times, and so on. So we simply build the list [1 2 3 ...]. From each number, we could easily construct the rotated string. After we make this list of "pseudo-strings", we sort it. The comparison function is probably the most advanced part in the above. From two numbers, say 9 and 12, that represent two strings in the list we try to sort, in this case the original string rotated 8 times and the original string rotated 11 times, we can use the numbers and the original string to compare these two strings. Don't kill yourself if you don't understand it, please.

After this, it's a simple matter to extract the new string we want, along with the number needed for decoding the string again.

Decoding

Let's take a deep breath. Hope you're still with me, even though you might have skipped some of the above - which is OK for now. You can always read it through another time, when your brain is more likely to be in the mood for it.

You may be puzzled about one thing: How the h**k are we going to reconstruct the original text, given the outputs of the encoding function above? That this is possible at all is far from intuitive. However, it is possible. The Rebol function (which you can see later in this article) looks very simple, but the logic behind it is hard to grasp the first time you get it presented.

By sorting our encoded string, we can get the first vertical row of the string list above - and since our encoded string is the last vertical row of the list above, from the string "oerHhtelle " we can get the following part of the list of sorted, shifted strings:

01:           o
02: H         e
03: e         r
04: e         H
05: e         h
06: h         t
07: l         e
08: l         l
09: o         l
10: r         e
11: t

What we want to do now is to find, for each letter in the last vertical row, exactly which letter in the first vertical row this corresponds to. The first letter in the last vertical row, namely "o", is easy: It can only be the "o" from line 9, since this is the only "o" in the first vertical row.

The next letter, "e", is a bit trickier, and the following is the part that you may want to read through a few times: The "e" that we're dealing with is the first "e" in the first vertical row, i.e. the "e" from line 3. This is because what comes just after our "e" is a letter that comes before any other letter following any of the other "e"s in the first vertical row. Since the first vertical row represents the first letters in our sorted list of rotated strings, this would mean that the rotated string with just our "e" in front would go before any other rotated string with another "e" in front.

So, you ask, how do we know that the letter following our "e" comes before any other letters following any other "e"? Simple: Just as the first vertical row is sorted by the letters themselves, you can think of the last vertical row as sorted by THE LETTERS THAT COME JUST AFTER THE LETTERS THEMSELVES! (Sorry, I have no bold style in Makespec, so I need to use capital letters...)

See? See? No? The important part to get is the sorting criterium for the last vertical row, i.e. that they are sorted according to the letters following them.

With the same logic, the last "e" from line 7 is the same as the first "e" from line 4 (since this is the 2nd "e" in the sorted list), and the last "e" from line 10 is the same as the first "e" from line 5.

So, putting these relations in our table, we get the following:

01:           o -> line 9
02: H         e -> line 3
03: e         r -> line 10
04: e         H -> line 2
05: e         h -> line 6
06: h         t -> line 11
07: l         e -> line 4
08: l         l -> line 7
09: o         l -> line 8
10: r         e -> line 5
11: t           -> line 1

This table also shows that e.g. the strings in line 1 and line 9 differ with one letter of rotation - take string 1, rotate it once to the right, and you get to string 9. If your brain isn't total jelly after the arguments above, you should be able to see how we can use this to get the original string:

The last letter is clearly "e", since this is the last letter in line 2. The letter just before this "e" is the first letter in line 3, namely "r". So the last two letters of the decoded string are "re". We now go to line 10, where we get an "e", making it "ere". And so on.

Implementation of the decoding

If you have understood the above, you might get puzzled as to why the following doesn't look exactly like the described decoding process. Explanations follow later.

decode-bw: func [
  "Decodes a string using the Burrows-Wheeler transform"
  str   [string!]  "The encoded version of the string"
  index [integer!] "The index to the original text"
  /local
  str-length first-row found-table t letter-index
  old-offset new-offset decoded-string
][
  str-length: length? str
 
  ; Find t, which is the relation between str and the permutations
  first-row: sort/case copy str
  insert/dup (found-table: make block! 256) 0 256
  t: make block! str-length
  foreach letter str [
    letter-index: (to-integer letter) + 1
    old-offset: pick found-table letter-index
    new-offset: index? find/case (skip first-row old-offset) letter
    append t new-offset
    poke found-table letter-index new-offset
  ]
 
  ; Now the rest is easy
  decoded-string: make string! str-length
  insert decoded-string pick str index
  loop str-length - 1 [
    insert decoded-string pick str (index: pick t index)
  ]
  decoded-string
]

The principle of the above is that the block t is the number information that we got from our example before (the list of numbers to the right). The "found-table" is used to speed up the process of finding for example the 100th occurence of "e" in a big text - when we have found the 1st "e", we remember the position of this, and when we look for the 2nd "e", we start a new search for "e" just after the position of the first "e", and so on. So, when we want the 100th "e", we already have the position of the 99th "e", and finding the next is fast. Very simple.

When we have the t block, reconstructing the original text is easy, as is seen from the concluding loop.

Examples

Hopefully, the code above will encode and decode our examples above in the same way as we did. Let's try:

>> encode-bw "Hello there"
== ["oerHhtelle " 2]
>> decode-bw "oerHhtelle " 2
== "Hello there"

(Phew!) Go ahead and try with other texts too. It is Especially interesting to encode big English texts and see how the Burrows-Wheeler transform actually groups letters of the same kind in a fascinatingly effective way. (Doing this will probably reveal that the Burrows-Wheeler encoding function is not the fastest encoding function in the world, but more on that in a forthcoming article...)

In the next article we'll almost finish our compression program, looking at how to nicely merge our 3 encodings/decodings into a fully-functioning program. From now on, it's mostly trivial :-)

Rebol Forces

Zine Building a Server Engine Metaphors Recursion, Iteration and Algebra A nice juicy Pair! Writing your own Protocol Handler Rebol and the Shell Series from the Ground Up Compression Dialecting and the Console View FAQ Links