Project lab
The lab part of the course is a small project, where you are supposed to implement a couple of compression methods and test these on various types of data. You should also do entropy estimations for the different data.
The work should be done in groups of 1-2 students. Examination is by written report.
Entropy estimation
Make random models for the different test files and estimate entropies for them. You should at least estimate H(Xi), H(Xi|Xi-1) and H(Xi|Xi-1,Xi-2) for all sources.
Source coding
Implement two of the following compression methods and test them by coding all the test files. At least one of the implemented methods should utilize the memory of the sources. You should of course write both coders and decoders for the chosen methods.
- Huffman coding (static or adaptive code tree)
- Arithmetic coding (static or adaptive probability model)
- Lempel-Ziv coding (LZ77, LZSS, LZ78, LZW or another variant)
- Burrows-Wheeler block transform (requires that you also implement Huffman coding or arithmetic coding as the actual source coder)
Compare your results with the estimated entropies.
You are free to choose any programming language for your implementations. Matlab will work nicely for entropy estimation and for simple source coding, but might be too slow for some coding methods, especially those that involve searching (LZ) or sorting (BWT).
Test data
There are many files to use as test data for your coders. Treat the files as sequences of 8-bit bytes, ie don’t make any assumptions of the actual contents of the files.
The Canterbury corpus
A standard test set of 11 files of different types and sizes from The Canterbury corpus.
File | Category | Size in bytes |
---|---|---|
alice29.txt | English text | 152089 |
asyoulik.txt | Shakespeare | 125179 |
cp.html | HTML source | 24603 |
fields.c | C source | 11150 |
grammar.lsp | LISP source | 3721 |
kennedy.xls | Excel Spreadsheet | 1029744 |
lcet10.txt | Technical writing | 426754 |
plrabn12.txt | Poetry | 481861 |
ptt5 | CCITT fax test set | 513216 |
sum | SPARC Executable | 38240 |
xargs.1 | GNU manual page | 4227 |
You can download the files here.
Large corpus
A standard test set of 3 relatively large files of different types and sizes from The Canterbury corpus.
File | Category | Size in bytes |
---|---|---|
E.coli | Complete genome of the E. Coli bacterium | 4638690 |
bible.txt | The King James version of the bible | 4047392 |
world192.txt | The CIA world fact book 1992 | 2473400 |
You can download the files here.
Report
The project is examined by a written report, where you describe how you have solved the problem and what results you achieve. Discuss how file type and file size influences the compression ratio. Include all program code files.
Give the name, person number and email address of all group members.
You can either leave a paper copy of the report directly to Harald, or submit it in electronic form, in PDF format. Send electronic reports by email to Harald.
Lab results will be reported to LADOK at approximately the same time as the exam results. If you are late in submitting your report, you will have to wait until June or August for your points.