In computer science, data compression is the process of encoding information using fewer bits, or information units, thanks to specific encoding schemes. For example, this article could be encoded with fewer bits if we accept the convention that the word "compression" is encoded as "CP!". Compression only works when both the sender and receiver of the information message have agreed on the encoding scheme.
A popular encoding scheme used on almost all modern operating systems is the ZIP file format. It can be used to reduce the size of an attachment to an e-mail, facilitating its transmission. However, both the sender and receiver must be aware of this format, and must use the appropriate encoding / decoding program.
Compression is possible because most real-world data is very redundant, or represented in its human-interpretable form in a not concise way. Compression is important because it helps reduce the consumption of expensive resources, such as disk space or connection bandwidth. However, compression requires information processing power, which can also be expensive. Therefore, many data compression schemes have been designed for various purposes. Some schemes are reversible so that the original data can be reconstructed (lossless compression), while others accept some loss of data in order to achieve higher compression (lossy compression).
One very simple means of compression, for example, is run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to better use disk space on office computers, or better use the connection bandwidth in a computer network. For symbolic data such as spreadsheets, text, executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).
In other kinds of data such as sounds and pictures, a small loss of quality can be tolerated without losing the essential nature of the data, so lossy data compression methods can be used. These frequently offer a range of compression efficiencies, where the user can choose whether he wants highly-compressed data with noticeable loss of quality or higher-quality data with less compression. In particular, compression of images and sounds can take advantage of limitations of the human sensory system to compress data in ways that are lossy, but nearly indistinguishable from the original by the human eye. This is used on CD-ROM and DVD, as well as in digital cameras.
Compression of sounds is generally called audio compression, where methods of psychoacoustics are used to remove non-audible components of the signal to make compression more efficient. Audio compression is therefore lossy compression. Different audio compression standards are listed under audio codecs. This is used in internet telephony for example.
The theoretical background of compression is provided by information theory and algorithmic information theory. Cryptography and coding theory are also closely related. The idea of data compression is deeply connected with statistical inference and particularly with the maximum likelihood principle.
Many data compression systems are best viewed with a four-stage model.
The Lempel-Ziv (LZ) compression methods are the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio. Compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. LZW (Lempel-Ziv-Welch) was patented by Unisys until June of 2003, and is used in GIF images. This patent is the main reason for GIF's increasing obsolescence. Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table based compression model where table entries are substituted for redundant data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (eg. SHRI, LZX). The current LZ based code that performs best is the obsolete LZX, although RAR and ACE are now coming close. LZX was purchased by Microsoft, slightly reduced in potency, and used in the CAB format.
Data compression topics
- algorithmic complexity theory
- information entropy
- image compression
- multimedia compression
- minimum description length
- universal code s
lossless data compression
- Burrows-Wheeler transform
- dictionary coders
- entropy encoding
- prediction by partial matching
- run-length encoding
lossy data compression
- discrete cosine transform
- fractal compression
- JPEG 2000 (image compression using wavelets, then quantization, then Huffman coding)
- Timothy C. Bell, Ian Witten, John Cleary (1990) Text Compression, Prentice Hall, ISBN 0139119914
- Data Compression Benchmarks and Tests
- Compression definition @ eLook Computing
- Data Compression - Systematisation by T.Strutz
- Public domain article on data compression