An efficient compression algorithm for short text strings

Question

I m searching for an algorithm to compress small text strings  50-1000 bytes  i e  URLs   Which algorithm works best for this

User · Accepted Answer

Check out Smaz      Smaz is a simple compression library suitable for compressing very short   strings

User · Answer

I don t have code to hand  but I always liked the approach of building a 2D lookup table of size 256   256 chars  RFC 1978  PPP Predictor Compression Protocol   To compress a string you loop over each char and use the lookup table to get the  predicted  next char using the current and previous char as indexes into the table  If there is a match you write a single 1 bit  otherwise write a 0  the char and update the lookup table with the current char  This approach basically maintains a dynamic  and crude  lookup table of the most probable next character in the data stream   You can start with a zeroed lookup table  but obviosuly it works best on very short strings if it is initialised with the most likely character for each character pair  for example  for the English language  So long as the initial lookup table is the same for compression and decompression you don t need to emit it into the compressed data   This algorithm doesn t give a brilliant compression ratio  but it is incredibly frugal with memory and CPU resources and can also work on a continuous stream of data - the decompressor maintains its own copy of the lookup table as it decompresses  thus the lookup table adjusts to the type of data being compressed

User · Answer

You might want to take a look at Standard Compression Scheme for Unicode   SQL Server 2008 R2 use it internally and can achieve up to 50  compression

User · Answer

Huffman has a static cost  the Huffman table  so I disagree it s a good choice   There are adaptative versions which do away with this  but the compression rate may suffer   Actually  the question you should ask is  what algorithm to compress text strings with these characteristics   For instance  if long repetitions are expected  simple Run-Lengh Encoding might be enough  If you can guarantee that only English words  spaces  punctiation and the occasional digits will be present  then Huffman with a pre-defined Huffman table might yield good results   Generally  algorithms of the Lempel-Ziv family have very good compression and performance  and libraries for them abound  I d go with that   With the information that what s being compressed are URLs  then I d suggest that  before compressing  with whatever algorithm is easily available   you CODIFY them  URLs follow well-defined patterns  and some parts of it are highly predictable  By making use of this knowledge  you can codify the URLs into something smaller to begin with  and ideas behind Huffman encoding can help you here   For example  translating the URL into a bit stream  you could replace  http  with the bit 1  and anything else with the bit  0  followed by the actual procotol  or use a table to get other common protocols  like https  ftp  file   The       can be dropped altogether  as long as you can mark the end of the protocol  Etc  Go read about URL format  and think on how they can be codified to take less space

User · Answer

If you are talking about actually compressing the text not just shortening then Deflate gzip  wrapper around gzip   zip work well for smaller files and text   Other algorithms are highly efficient for larger files like bzip2 etc   Wikipedia has a list of compression times   look for comparison of efficiency   Name         Text           Binaries        Raw images ----------- -------------- --------------- ------------- 7-zip        19  in 18 8s   27  in  59 6s   50  in 36 4s bzip2        20  in  4 7s   37  in  32 8s   51  in 20 0s rar  2 01    23  in 30 0s   36  in 275 4s   58  in 52 7s advzip       24  in 21 1s   37  in  70 6s   57 amp  in 41 6s gzip         25  in  4 2s   39  in  23 1s   60  in  5 4s zip          25  in  4 3s   39  in  23 3s   60  in  5 7s

User · Answer

Any algorithm library that supports a preset dictionary  e g  zlib   This way you can prime the compressor with the same kind of text that is likely to appear in the input   If the files are similar in some way  e g  all URLs  all C programs  all StackOverflow posts  all ASCII-art drawings  then certain substrings will appear in most or all of the input files   Every compression algorithm will save space if the same substring is repeated multiple times  in one input file  e g   the  in English text or  int  in C code    But in the case of URLs certain strings  e g   http   www      com     html     aspx  will typically appear once in each input file   So you need to share them between files somehow rather than having one compressed occurrence per file   Placing them in a preset dictionary will achieve this

User · Answer

Huffman coding generally works okay for this

[algorithm] An efficient compression algorithm for short text strings

Examples related to algorithm

Examples related to compression