Compression, part 4: The needed glue
Author: Ole Friis (firstname.lastname@example.org)
Date: Jul, 2001
Part: 4 of 6
- The Bare Necessities
- Binary conversion
- Kickstarting Move-to-Front
- Finding the Huffman probabilities
- The file format
- Almost final comments
The Bare Necessities
With the previous 3 articles, we have described in detail (and implemented) the three different components of our compression program. However, before we can just stick the whole thing together in a series of function calls, there are some pragmatic stuff that we need to cover. That's exactly what we'll do in this article.
One problem with our Huffman routines that is clear from the first article is that it operates on strings representing binary numbers, in the form "0111001". This is, to put it mildly, not a very compressed form. What we want to do is convert these into real binary form.
This is very easy:
encode-bitstream: func [
"Encodes a string of 1's and 0's to a binary string"
s [string!] "The string of 1's and 0's"
/local res byte add-this
res: copy ""
while [(add-this <> 0) and (not tail? s)] [
if #"1" = first s [byte: byte + add-this]
add-this: to-integer (add-this / 2)
s: next s
append res to-char byte
if tail? s [return res]
decode-bitstream: func [
"Decodes a binary string into a string of 1's and 0's"
s [string!] "The binary string"
/local res next-bit
res: copy ""
while [not tail? s] [
append res either 0 = and~ next-bit to-integer first s ["0"]["1"]
next-bit: to-integer (next-bit / 2)
if next-bit = 0 [next-bit: 128 s: next s]
You don't have to understand the above - it's just a lot of bit-fiddling. What we do is use Rebol's string! datatype to store our series of bits. An example:
>> a: encode-bitstream "11111111000000001111000000001111"
>> length? a
>> decode-bitstream a
Surely, the variable a is not meant for reading by human beings. What we care about is that blocks of 8 characters in our "0 and 1 string" get concatenated into a single (probably unreadable) character, and that we can get the other way again.
The Move-to-Front algorithm works fine as it is now, but we can quickly do a little optimization. Instead of giving the routine a starting table which just starts with the character with value 0, then the character with value 1, etc., we can arrange this table so that we already have some characters in front that will most surely appear in English texts.
If we do this, the Move-to-Front algorithm can start right away by giving small numbers as output, instead of first using some time to get the table right, with often-used characters up front. As a starting table for the Move-to-Front algorithm, we can use this:
mtf-table: copy used-letters: [
#"." #"s" #"r" #"g" #"m" #"w" #"a" #"l" #"y" #"," #"e"
#"o" #"t" #" " #"n" #"k" #"W" #"d" #"h" #"i" #"f" #"x"
for i 0 255 1 [
letter: to-char i
if not found? find/case used-letters letter [
append mtf-table letter
You'll easily be able to think about ways to create better starting tables, but let's stick with this one for now.
Apart from this, we have a small problem with the Move-to-Front functions: The encoder returns a block of integers, and the decoder expects the same block of integers. Our Huffman encoder, on the other hand, expects values of the string! datatype. However, this is easily solved with a "foreach" loop, converting the values by use of "to-char" and "to-integer".
Finding the Huffman probabilities
One important point before we can make the whole compression process work in a satisfying way is to make the Huffman encoding work good. For this to happen, we need to generate a good "probability list", which means that we need a good estimation of the probabilities of the characters given to the Huffman encoder. With this "probability list", the Huffman encoding can use less space on the often occuring characters and more space on the more seldomly occuring characters.
So, how can we get such a list? Pretty straight forward: For a number of English texts, we run them first through the Burrows-Wheeler transform, then through the Move-to-Front algorithm. Then we just keep track of how often each character occurs in the output of this process, and we have our list.
The file format
We have to settle on a file format. My suggestion is this:
4 bytes: Length of decoded text.
4 bytes: The index value needed for the Burrows-Wheeler decoding.
? bytes: Compressed version of the text.
Why do we need the length of the decoded text? Well, this is necessary because the "encode-bitstream" and "decode-bitstream" functions above could, when combined, decode a little more characters than were actually encoded, if the length of the original string (of 0's and 1's) is not a multiple of 8. There is a possibility that this error cannot be found in any other way, since the Huffman decoding might succeed anyway (giving one or more extra characters).
We could have chosen to include more information in the header of the file, for example:
4 bytes: Id for our new file format.
4 bytes: Checksum, to ensure that the file is intact.
Others might be of interest, too, but we'll stick with the length of the decoded text and the index value.
For creating those 4 bytes from a Rebol integer, we need a simple conversion:
encode-integer: func [
"Converts a Rebol integer to a 4-character string"
i [integer!] "The integer to encode"
join "" reduce [
to-char (i / 16777216) ; 16777216 = power 2 24
to-char (i / 65536) // 256 ; 65536 = power 2 16
to-char (i / 256) // 256
to-char i // 256
decode-integer: func [
"Converts a 4-character string to a Rebol integer"
s [string!] "The 4-character string to decode"
to-integer ((to-integer first s) * 16777216) + ; 16777216 = power 2 24
((to-integer second s) * 65536) + ; 65536 = power 2 16
((to-integer third s) * 256) +
to-integer fourth s
Another thing we have to do is change our decode-huffman function so that it only decodes the correct amount of characters. This is very, very easy:
decode-huffman: func [
"Huffman-decodes a string"
code [string!] "The Huffman code to decipher"
tree [block!] "The Huffman tree to use"
count [integer!] "Number of characters to decipher"
code: copy code
result: copy ""
loop count [
append result decode-huffman-char code tree
(In fact, this made the function a little neater.)
Almost final comments
By now, creating the finished program should be a task for any given slave. However, being the fine individual that I am, I will give you the full implementation in the next article. However, please don't think that this is the end of the compression articles: There will be a concluding article stuffed with discussions about this and other compression techniques and some exercises for you to (hopefully) enjoy. So, please, stay with us a little longer.