Compression, part 6: Discussions
Author: Ole Friis (firstname.lastname@example.org)
Date: Jul, 2001
Part: 6 of 6
- Efficiency of the compression
- Problems with our approach
- Other approaches to encoding
- Problems etc.
Efficiency of the compression
In fact, the compression ratio that we get with our compression program is not bad at all. As an example, I compressed the Makespec version of the first compression article. Results were these:
Original text file size: 11750 bytes
Compressed size: 4369 bytes
Size from Rebol's COMPRESS function: 4292 bytes
OK, so we didn't quite beat Rebol's COMPRESS function, but still we got a respectable compression. One reason that we're not beating the COMPRESS function might be that the Makespec version of the article is not exactly "plain English", as it contains code snippets, ASCII drawings, and formatting stuff. So, I removed all this and tried again:
Original text file size: 7510 bytes
Compressed size: 2953 bytes
Size from Rebol's COMPRESS function: 3017 bytes
Here we are just a little better than Rebol's COMPRESS function. Though, it's not as bright as you may think, because Rebol's COMPRESS function does something we don't, namely makes sure that the output is ASCII text, not containing unreadable characters, as we produce.
However, considering the simplicity of our algorithms, the above results are good. Fiddling with various parameters in the algorithms might give better results, and it might be the case that small tricks could be used some places to push the performance a little further.
Problems with our approach
As you may have "tried on your own body" if you have tried to compress files with our compression function, the compression algorithm is quite slow. Speeding up some parts might be possible - the reason that I haven't considered this is that I thought that reader-friendliness should go before efficiency.
Also, remember that the compression program is implemented in Rebol, while most other compression programs have been written in C and compiled with very efficient compilers. I could have done these articles in C, but it just wouldn't be so much fun, and the resulting program listing would be quite a lot bigger and harder to get to grips with.
Another problem is that our program needs the whole input available before it can start compressing. This is a basic requirement when doing Burrows-Wheeler transforming, since it has to sort the rotated strings. This disables us from using our algorithms to encode a live video stream, for example, and it means that our compression scheme needs a lot of memory. Other compression approaches can work with one byte at a time and then work their way through the input, spitting out output on the way.
Don't forget, though, that our approach to compression is very clean, compact, (relatively) easy to understand, and platform independent. Plus, it gives acceptable compression ratios.
Other approaches to encoding
The Burrows-Wheeler transformation used as the basis for our compression program is just one of many approaches to compression - and actually a pretty new approach, since the first compression program based on this transformation was written in 1994. Of course the world had compression programs before 1994, too.
I will not go in detail with other compression algorithms, just scratch the surface.
There is a whole class of algorithms that use the fact that when we get a letter from the input, we can say that certain characters most likely will follow. This is called "predictive coding".
Another approach is the so-called dictionary techniques. One possibility is that the encoder has a dictionary of often-used character sequences (for example including common words like "what", "where", etc.), and then instead of outputting each character separately, it tries to match blocks of the input with dictionary entries and then outputs the positions in the dictionary where the text blocks are.
Instead of a static dictionary (which might work good for a special kind of input, for example English texts or Rebol programs, but might be working very, very bad with any other type of input), the encoder can build the dictionary "on the run", as it finds repeating patterns in the input. The so-called LZ77 and LZ78 algorithms do this, but in different ways.
Rebol's COMPRESS function is, I remember being told, based on the source code for the ZIP compression program. This is based on the LZ77 algorithm.
Instead of Huffman encoding, we could have used a more advanced technique called arithmetic coding. It normally gives a little better compression ratios (assuming you start with the right "probability table", as in the Huffman technique), but is also a little more difficult to implement. Basically, as the Huffman technique outputs 0's and 1's, the arithmetic coding technique can output "parts of 0's and 1's". You probably don't know what I mean, and don't think too hard about it.
As I mentioned in the first article, a really good book covering lots of topics in data compression, is the book "Introduction to Data Compression" by Khalid Sayood, 2nd edition, ISBN 1-55860-558-4. So, if you would like to dive deeper into what this is all about, this is the book to get.
You might also be able to find some information on the internet, but I don't know of good places to start.
The following is a bunch of exercises for you to spend a little time on. Notice, please, that these exercises are not just there to make sure that your time is spent on something, but rather they are meant to inspire you and suggest further enhancements of the compression engine. Therefore, even though you have no intention of actually doing any exercises right now, please read through the text below and at least think a couple of seconds after reading each exercise.
Exercise 1: In the decode-bw function, the "found-table" can be initialized so that each entry points to the first occurence of the corresponding character (if it occurs). This can be done by using only one iteration through "first-row". Doing this will speed up the decoding a little. Do this!
Exercise 2 (only if Exercise 1 has been completed): After this, we can eliminate our "find/case" in the loop that builds t. This is unnecessary since we know that the search will terminate at the first character (and why do we know that?). Change this!
Exercise 3: The Burrows-Wheeler transformation uses the fact that certain letters have a higher probability than others of occuring BEFORE certain letters. Suggest an easy way to change the BEFORE to AFTER (one line of Rebol code is by far enough). See if it gives any change in the size of compressed texts.
Exercise 4: The starting table for the Move-to-Front algorithm is still rather bad, because just after the very used letters in the alphabet, we start directly with the character with value 0, then the character with value 1, etc. Think of better ways to do this (an experimental way might be good), and test which performance gain you get.
Exercise 5: Construct your own "probability table" for the Huffman tree generation, as described in the 4th compression article. Measure the performance changes.
Exercise 6: Instead of first reading in the Probabilities.r file and then constructing a Huffman tree, we might be able to store the Huffman tree in some representation in a file and then (almost) directly read this into memory, saving the initialization time spent on constructing the Huffman tree. Suggest ways to do that.
Exercise 7: As noted above, our compression algorithm has at least 2 disadvantages: It's slow, and it requires all of the input to be available when the compression is started. A suggestion for improving both these points is dividing the input in blocks of certain sizes, and then performing our compression algorithm on each block separately. It might be a good idea to let a block continue with the Move-to-Front table that the preceeding block left. This approach would probably cut down a little on the time spent, since a lot of small blocks should be faster to process than one big block (this might not be intuitively clear, I know), and at the same time, we just require the next block to be filled out before we can continue compressing. Try to implement this, and measure the performance gains/losses concerning speed and compression ratios.
So, that's it. Disappointed? Please don't be. I never promised you a practical substitute for Rebol's COMPRESS command - and, since Rebol's COMPRESS command does its job beautifully, I've got no reason for promising you that. However, I hope that you have enjoyed learning about various techniques that can be applied in a compression program. Hopefully most of what I have written is understandable somehow.
If you have any comments (flames, roses, bug fixes or improvement suggestions for the code, etc.), please mail me at email@example.com. Thank you for staying with me so long!