76

What compression tools are available in Ubuntu that can benefit from a multi-core CPU.

Luis Alvarado
  • 216,643

9 Answers9

87

Well, the keyword was parallel. After looking for all compression tools that were also parallel I found the following:

PXZ - Parallel XZ is a compression utility that takes advantage of running LZMA compression of different parts of an input file on multiple cores and processors simultaneously. Its primary goal is to utilize all resources to speed up compression time with minimal possible influence on compression ratio.

sudo apt-get install pxz

PLZIP - Lzip is a lossless data compressor based on the LZMA algorithm, with very safe integrity checking and a user interface similar to the one of gzip or bzip2. Lzip decompresses almost as fast as gzip and compresses better than bzip2, which makes it well suited for software distribution and data archiving.

Plzip is a massively parallel (multi-threaded) version of lzip using the lzip file format; the files produced by plzip are fully compatible with lzip.

Plzip is intended for faster compression/decompression of big files on multiprocessor machines, which makes it specially well suited for distribution of big software files and large scale data archiving. On files big enough, plzip can use hundreds of processors.

sudo apt-get install plzip

PIGZ - pigz, which stands for Parallel Implementation of GZip, is a fully functional replacement for gzip that takes advantage of multiple processors and multiple cores when compressing data.

sudo apt-get install pigz

PBZIP2 - pbzip2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 (ie: anything compressed with pbzip2 can be decompressed with bzip2).

sudo apt-get install pbzip2

LRZIP - A multithreaded compression program that can achieve very high compression ratios and speed when used with large files. It uses the combined compression algorithms of zpaq and lzma for maximum compression, lzo for maximum speed, and the long range redundancy reduction of rzip. It is designed to scale with increases with RAM size, improving compression further. A choice of either size or speed optimizations allows for either better compression than even lzma can provide, or better speed than gzip, but with bzip2 sized compression levels.

sudo apt-get install lrzip

A small Compression Benchmark (Using the test Oli created):

ORIGINAL FILE SIZE - 100 MB
PBZIP2 - 101 MB (1% Bigger)
PXZ - 101 MB (1% Bigger)
PLZIP - 102 MB (1% Bigger)
LRZIP - 101 MB (1% Bigger)
PIGZ - 101 MB (1% Bigger)

A small Compression Benchmark (Using a Text file):

ORIGINAL FILE SIZE - 70 KB Text File
PBZIP2 - 16.1 KB (23%)
PXZ - 15.4 KB (22%)
PLZIP - 15.5 KB (22.1%)
LRZIP - 15.3 KB (21.8%)
PIGZ - 17.4 KB (24.8%)

Luis Alvarado
  • 216,643
36

There are two main tools. lbzip2 and pbzip2. They're essentially different implementations of bzip2 compressors. I've compared them (the output is a tidied up version but you should be able to run the commands)

cd /dev/shm  # we do all of this in RAM!
dd if=/dev/urandom of=bigfile bs=1024 count=102400

$ lbzip2 -zk bigfile 
Time: 0m3.596s
Size: 105335428 

$ pbzip2 -zk bigfile
Time: 0m5.738s6
Size: 10532460

lbzip2 appears to be the winner on random data. It's slightly less compressed but much quicker. YMMV.

Oli
  • 299,380
25

Update:

XZ Utils supports multi-threaded compression since v5.2.0, it was originally mistakenly documented as being multi-threaded decompression.

For example: tar -cf - source | xz --threads=0 > destination.tar.xz

Exil
  • 545
11

In addition the nice summary above (thanks Luis), these days folks might also want to consider PIXZ, which according to it's README (Source: https://github.com/vasi/pixz -- I haven't verified the claims myself) has some advantages over PXZ.

[Compared to PIXZ, PXZ has these advantages and disadvantages:]

    * Simpler code
    * Uses OpenMP instead of pthreads
    * Uses streams instead of blocks, not indexable
    * Uses temp files and doesn't combine them until the whole file is compressed, high disk/memory usage

In other words, PIXZ is supposedly more memory and disk efficient, and has an optional indexing feature that speeds up decompression of individual components of compressed tar files.

nturner
  • 329
  • 3
  • 9
8

Zstandard supports multi-threading since v1.2.0ยน. It is a very fast compressor and decompressor intended to replace gzip and it can also compress as efficient (if not better) as LZMA2/XZ on its highest levels.

You have to use one of these releases, or compile the latest version from source to get these benefits. Luckily it doesn't pull in a lot of dependencies.

There was also a 3rd party pzstd in v1.1.0 of zstd.

Pablo Bianchi
  • 17,371
LiveWireBT
  • 29,597
5

lzop may also be a viable option, although it's single-threaded.

It uses the very fast lempel-ziv-oberhumer compression algorithm which is 5-6 times faster than gzip in my observation.

Note: Although it's not multi-threaded yet, it will probably outperform pigz on 1-4 core systems. That's why I decided to post this even if it doesn't directly answer your question. Try it, it may solve your CPU bottleneck problem while using only one CPU and compressing a little worse. I found it often to be a better solution than, e.g pigz.

ce4
  • 181
4

It is not really an answer, but I think it is relevant enough to share my benchmarks comparing speed of gzip and pigz on a real HW in a real life scenario. As pigz is the multithreaded evolution I personally have chosen to use from now on.

Metadata:

  • Hardware used: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (4c/8t) + Nvme SSD
  • GNU/Linux distribution: Xubuntu 17.10 (artful)
  • gzip version: 1.6
  • pigz version: 2.4
  • The file being compressed is 9.25 GiB SQL dump

gzip quick

time gzip -1kN ./db_dump.sql

real    1m22,271s
user    1m17,738s
sys     0m3,330s

gzip best

time gzip -9kN ./db_dump.sql 

real    10m6,709s
user    10m2,710s
sys     0m3,828s

pigz quick

time pigz -1kMN ./db_dump.sql 

real    0m26,610s
user    1m55,389s
sys     0m6,175s

pigz best (no zopfli)

time pigz -9kMN ./db_dump.sql 

real    1m54,383s
user    14m30,435s
sys     0m5,562s

pigz + zopfli algorithm

time pigz -11kMN ./db_dump.sql 

real    171m33,501s
user    1321m36,144s
sys     0m29,780s

As a bottomline I would not recommend the zopfli algorithm since the compression took tremendous amount of time for a not-that-significant amount of disk space spared.

Resulting file sizes:

  • bests: 1309M
  • quicks: 1680M
  • zopfli: 1180M
helvete
  • 141
3

The LZMA2 compressor of p7zip Install p7zip uses both cores on my system.

David Foerster
  • 36,890
  • 56
  • 97
  • 151
0

Relevant Arch Wiki entry: https://wiki.archlinux.org/index.php/Makepkg#Utilizing_multiple_cores_on_compression

# lzma compression
xz --threads=0

drop-in parallel gzip replacement

-p/--processes flag can be used to employ less cores

pigz

drop-in parallel bzip2 replacement

-p# flag can be used to employ less cores

(note: no space between the -p and number of cores)

pbzip2

modern zstd compression

is used to build Arch packages by default

since somewhere 2020

zstd --threads=0

murla
  • 1
  • 2