Good Library for data compression / decompression? LZO? LZW? LZMA? Zip? Others?

niteris

Member
I didn't see anything in the forums related to data compression so I wanted to start a thread here. I'm trying to get a simple compression algorithm so I can store small movies (570 pixel frames) on the internal flash memory without having to move them to the SD card. Zipping up the video files shows that they can get 5-1 compression ratio so I'm hoping to that one of these simple libraries will work very well.

LZO:

FastLZ

LZF:
 
I can't get liblzf to compile (yet).

I have fastlz installed as a library and added an example which does the same test as in minilzo - compress 128kB of zero and then decompress it. The example times how long it takes to do 10000 compress and 10000 decompress steps. and I did the same test for minilzo.

For minilzo on a Teensy 4.0
repeat 10000 times
compressed 131072 bytes into 593 bytes
time = 4642
decompressed 593 bytes back into 131072 bytes
time = 9448


fastlz has two levels of compression, one faster than the other.
On a Teensy 4.0
LEVEL 1
nc = 1512, time = 10413ms
no = 131072, time = 6813ms

LEVEL 2
nc = 525, time = 10260ms
no = 131072, time = 6590ms

If my tests are valid (e.g. the compiler hasn't optimized something), it looks like minilzo is the clear winner as far as speed is concerned but if you really need to compress the data as much as possible, fastlz using level 2 is probably best.

Pete
View attachment FastLZ-master.zip
 
All of the good compression algorithms are adaptive, so compressing an all-zero file is not very informative. The best compression routines for linear files and still images look for progressively larger patterns in the data, and then encode these patterns in short, file-specific codes. The best compression routines for movies (and for other data that can be described as strings of frames) know that they are dealing with movies, so they record the occasional frame (compressed like a still image) and then record successive frames by recording only incremental differences.

If you want to do comparative tests of different routines, run the routines on some of your real data.
 
How would you use the to compress a CVS file? Would you compress it line by line? Read line in, compress, and write to a new file?

Thoughts?

My idea is I will be acquiring a lot of raw can bus data while the engine is running but when the vehicle is off I want to compress the data to save SD card and or send by wireless.

Any thoughts?

Thanks

Bruce
 
Did you mean CSV?
The three compression methods in msg #1 all require that you provide the whole file. You could do it line by line (i.e. each line is a "file") but it would not provide very good compression.

Pete
 
The problem with that the file is too big to put into the teensy 4.1 ram so how can you provide the whole file which is on the sd card.

Thanks
 
The LZW method runs through the to-be-compressed data only once, generating the output as it goes. An implementation that appears to be file --> file is coded with the primitives
  • initialize
  • get next byte
  • output next 12-bit code
  • finalize
That's it (not that each of those is altogether trivial). That is, there's no need to have the whole file in memory. See https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
 
It depends what you mean by "done line by line." Most of the output of an LZW compression is a string of 12-bit codes that capture the input, but there are also one or more 4096-entry tables that tell what the codes are. The early codes in the table just encode single ASCII characters, but later codes are used for discovered ASCII strings and, most important, discovered strings of earlier codes. That compression-of-compression is where its power comes from. The earlier you force LZW to wrap things up and generate the code table, the more opportunities for compression are lost. So you can feed data into LZW line by line, but it's a bad idea to force it to give you output until you need the output.

One way to think about this is that compression algorithms try to provide shortest-possible encodings of data blocks. If the data are truly random, then the data are their own shortest-possible encoding (that's the definition of random). The more non-randomness can be found (for example, non-random digrams & trigrams in natural-language text), the shorter the description can be. It's fine to know what the non-randomness is in advance (for example, the shortest description of a million-byte all-zero file is "this is a million zeroes"), but LZW does not prejudge what it will see, so it will look at English text and (without knowing it's English) encode digrams as it finds them (["th"], ["e "]), then bigger sequences ([["th"]["e "]], [" can be"], and so on.

These examples show why the only informative tests of compression algorithms use real data. If a compression routine were written with the idea that it might be especially likely to see long sequences of the same character repeated, then it might take special steps to handle that situation efficiently, and if that is what your data are like, then that routine might be best for you. In an earlier post, I mentioned the interesting case of movies, where a frame can often be best described by just listing the (few) ways in which it differs from the previous one.
 
Yeah, I was going to experiment with various buffer sizes to see what the trade off is.

Another question is when I compress the CSV files and store them on an SD card then can those files be opened by an windows application or will I need to reencode the with the Arduino again?

Also how much of the file needs to be brought in prior to being able to decompress it?

Thanks

Bruce
 
Last edited:
. . . when I compress the CSV files and store them on an SD card then can those files be opened by an windows application or will I need to reencode the with the Arduino again?

That's implementation-dependent. Two standard LZW routines should produce the same output, but one or both implementors might have diverged, thinking that the output was never going to be decoded by any decoder but the one he/she provided. For example, it's natural to put the code table at the end of the output, but an implementor might choose to shuffle the file around, so that the code table was at the beginning. That would speed up repeated decompression, if that were anticipated.

Also how much of the file needs to be brought in prior to being able to decompress it?
The decompressor needs the code table, and it well might be at the end of the file. A space-conscious decompressor might read through the whole file to get the code table, then go back and start working at the codes. Again, it's implementation-dependent.
 
Back
Top