Which Checksum Tool on Linux is Faster?
Posted on In Linux, News, SoftwareIt is common practice to calculate the checksums for files to check its integrity. For large files, the checksum computation is slow. Now I am wondering why it is so slow and whether choosing another tool will be better. In this post, I try three common tools md5sum
, sha1sum
and crc32
to compute checksums on a relatively large file to see which checksum tool on Linux is faster to help us decide the choices of the checksum tool.
File to be checsum’ed is a 15GB text file:
$ ls -lha wiki.txt
-rw-r--r-- 1 zma zma 15G Jun 14 10:28 wiki.txt
Table of Contents
The performance
Now, let’s see how does the three tools perform for computing the checksum of the file.
sha1sum speed
$ time sha1sum wiki.txt
251dcb5c08c6a2fabd258f2c8a9b95e15c0cc098 wiki.txt
real 1m21.143s
user 0m21.647s
sys 0m4.668s
crc32 speed
$ time crc32 wiki.txt
0080f7a1
real 1m21.051s
user 0m16.194s
sys 0m4.890s
md5sum speed
$ time md5sum wiki.txt
e2e649030c795ffa9f33a99bcb39dde7 wiki.txt
real 1m27.392s
user 0m25.563s
sys 0m3.936s
Summary
From the results, crc32
is the fasted. But it is just a tiny bit faster than sha1sum
and md5sum
. md5sum
is the slowest but just a little bit slower.
Why there is no much differences? To compute the checksums, the tools need to read these files and do the computation. Now, let’s check how much time is needed to read the file content out.
$ time dd if=wiki.txt of=/dev/null bs=8192
1953039+1 records in
1953039+1 records out
15999296457 bytes (16 GB) copied, 80.4203 s, 199 MB/s
real 1m20.447s
user 0m0.202s
sys 0m7.091s
The I/O read speed is around 200MB/s. That’s not bad for a single magnetic disk I/O storage.
So, almost all time are on reading the file content. The algorithms and the tools themselves are not yet the limitation. The disk I/O speed is.
The conclusion is that use any tools that work the best for you (you may need to be aware of the the collisions for these algorithms, check Simard’s comment) without worrying a lot about the speed (it still consumes time) on a relatively modern computer. If you want higher speed, improve your I/O speed first till CPU is the bottleneck (CPU usage reaches 100%).
What if I/O was not the bottleneck
Pádraig comments that we can avoid the I/O and measure the computational cost. I did a little bit change to the suggested command to do checksum on a file under /dev/shm/ as crc32
does not accept input from STDIN. The system is the same one on which I did the previous tests. It can only support 3GB by the time I did this test. The results are as follows.
[zma@host:/dev/shm]$ head -c 3G /dev/zero >test
[zma@host:/dev/shm]$ for chk in crc32 md5sum sha1sum ; do echo $chk; time $chk test; done
crc32
480bbe37
real 0m3.411s
user 0m2.931s
sys 0m0.482s
md5sum
c698c87fb53058d493492b61f4c74189 test
real 0m5.103s
user 0m4.697s
sys 0m0.409s
sha1sum
6e7f6dca8def40df0b21f58e11c1a41c3e000285 test
real 0m4.451s
user 0m4.082s
sys 0m0.372s
To summarize the speed if we consider md5sum
‘s speed as the baseline:
md5sum
: 1.00x
crc32
: 1.50x
sha1sum
: 1.15x
crc32
is the fastest here. It is a Perl 5 program using Archive::Zip::computeCRC32()
to compute the crc32.
The throughput here for md5sum
is above 600MB/s. This is not a number that can not be achieved by an SSD or a RAID of SSDs. On the system I tested, if the I/O is much improved, the computation will likely affect much of the time spent.
CPU model and versions of checksum tools used
Here are the CPU model and versions of the checksum tools used during the test.
$ lscpu | grep "Model name"
Model name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
$ md5sum --version
md5sum (GNU coreutils) 8.23
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper, Scott Miller, and David Madore.
$ sha1sum --version
sha1sum (GNU coreutils) 8.23
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper, Scott Miller, and David Madore.
$ rpm -qf `which crc32`
perl-Archive-Zip-1.46-1.fc22.noarch
On your system the bottleneck is disk, though increasingly this is moving up with the advance of SSDs. In that case the computational overhead becomes significant. You can quantify that on your system by avoid disk with something like:
for chk in crc32 md5sum sha1sum; do time head -c 1G /dev/zero | $chk; done
Note that sha1sum and md5sum use system specific instructions for significant speedups on systems congifured –with-openssl (as is the default on arch, fedora, centos7, gentoo at least).
Hi Pádraig,
That’s a good point.
I did some tests on the same system (Fedora 22 x86-64) by doing checksums on a file under /dev/shm/. The results you can find in http://www.systutorials.com/136737/which-checksum-tool-on-linux-is-faster/#what-if-i.2Fo-was-not-the-bottleneck . crc32 turns to be the fastest one.
Hi Eric,
I hope that you know that collisions exist in crc32, md5sum and even sha-0 checksums. But not yet for sha-1 which you actually used. Since I found these collision problems, I only use sha1sum and better (sha224, sha256, sha384 or sha512) for my verifications when I can.
http://preshing.com/20110504/hash-collision-probabilities/
http://www.mathstat.dal.ca/~selinger/md5collision/
https://en.wikipedia.org/wiki/SHA-0
Nice and informative website by the way.
Hi Simard,
Thanks!
Although this post is mainly talking about the speed, that’s a good point taking the collisions into consideration. I will add a note in the post mentioning your comment.
‘sum -s filename’ is significantly faster than all of these.
uni@box:~$ ls -lh kali-linux-1.0.3-i386.iso
-rwxrwxrwx 1 uni uni 2.3G Jun 22 2013 kali-linux-1.0.3-i386.iso
uni@box:~$ time crc32 kali-linux-1.0.3-i386.iso
bd3a7323
real 0m12.701s
user 0m5.263s
sys 0m1.033s
uni@box:~$ time sum kali-linux-1.0.3-i386.iso
11559 2387392
real 0m4.270s
user 0m3.986s
sys 0m0.280s
uni@box:~$ time sum -s kali-linux-1.0.3-i386.iso
47724 4774784 kali-linux-1.0.3-i386.iso
real 0m1.241s
user 0m0.972s
sys 0m0.268s
uni@box:~$ sum –version|head -1
sum (GNU coreutils) 8.21
uni@box:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1674.878
BogoMIPS: 6600.22
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Nice to know these numbers desromic!
The `cksum`/`sum -s` which is also a CRC tool as `crc32` seems much faster thatn `crc32`.
One note that the CRC algorithms have the same problems being “useless as secure indicator of intentional manipulation of the data” as discussed in
Simard’s comment http://www.systutorials.com/136737/which-checksum-tool-on-linux-is-faster/#comment-76996 and also discussions at http://www.derkeiler.com/Newsgroups/sci.crypt/2003-07/1451.html :
For many years I have found md5sum to consistently be faster than sha1sum, so I was very surprised when I read this article.
I just tried it again on a file of size 295G and got this:
md5sum
real 10m20.952s
the same file for
sha1sum
real 15m15.332s
This is consistent with what I seem to always see.
Thanks for sharing the numbers. My inference is it depends on the machine used. The CPU, memory and disks taking together matter (assuming the good enough optimization already applied in the implementation). It is okay to assume disk I/O (e.g. SSD) could be faster enough to sustain the CPU and memory. Then the main factors are CPU and memory. md5sum size is smaller and likely has less pressure to the memory and memory bus systems. However, the modern CPU architecture and software implementation seems have better optimizations for sha1sum (sha) computation.
I do not have the original machine I did the test any more. But would you like to do the test as in the post on the machine you used to see how it performs and also share with us the machine details? Thanks.