A Closer Look at Rsync
It’s not often that academia analyzes the unix tools we use everyday.
But rsync
is one fortunate exception as Andrew
Tridgell not only wrote rsync
while pursuing his PhD, but also published a short and accessible
paper
outlining its inner workings. While I’d highly encourage reading the entire paper, I took away
one major tl:dr; from the rsync
algorithm: be lazy.
This lesson will be applicable anytime I write performance conscious code.
Algorithm Overview
Before diving into the benefits of being lazy, we must understand rsync’s goals and
implementation. rsync
is an algorithm for updating a file on machine A to be
identical to a file on machine B. Tridgell optimizes for when the connection
between machine A and machine B is high-latency and when the two files are
similar.
To copy file a
from machine A
to file b
on machine B
, the rsync
algorithm conducts the following high-level steps.
- Machine
B
splits fileb
into non-overlapping blocks of sizeS
bytes. - For each block,
B
calculates a weak checksum and a strong checksum. - Machine
B
sends the checksums to machineA
. - Machine
A
searches through all possible blocks of sizeS
in filea
to find blocks which have weak and strong checksums equal to one of the blocks onB
. - Machine
A
uses the knowledge of equivalent blocks to send machineB
a sequence of instructions for constructing a copy ofA
. It only sendsB
data which it does not have, so if a block ina
is a match for a block onb
, machineA
will not transmit the block to machineB
.
If you’re interested in greater detail, the paper further explains the algorithm, as well as detailing the underlying math and providing empirical analysis.
Be Lazy
In order for the rsync
algorithm to operate efficiently, it must quickly
compare all possible blocks of size S
in file a
to the set of
non-overlapping blocks of size S
in file b
. To visualize, imagine file a
contains 1234567890
, file b
contains 2345678901
and S
is two characters.
From step 1 of the algorithm, the blocks from file b
are 23
, 45
, 67
,
89
, and 01
. To accurately compare file a
to file b
, we not only need to
compare 12
on file a
to 23
on file b
, but also 23
, 34
, 45
… on
file a
to 23
on file b
. However, we don’t want to do an expensive calculation for each
possible block on a
. Rather, we want to a perform a calculation which allows
us to take our checksum for block i,j
and with a small amount of work, find
the checksum of block i+1,j+1
.
Tridgell accomplishes this through the weak checksum calculated in steps 2 and 4
of the algorithm. As he discusses in the paper, given weak-checksum(i,j)
and values
i,j+1
, we need to do little additional work to find weak-checksum(i+1,j+1)
.
Because of this efficiency, the algorithm can very quickly find the weak checksum for
all possible blocks in a
.
However, the weak checksum is not perfect. In certain scenarios, it will give
false positives and say two unequal blocks are equal.
To be absolutely certain two blocks are equal, we need
to calculate and compare the strong checksum. However, the strong checksum is
more expensive to calculate, and the value of strong-checksum(i,j)
gives us no
insight into the value of strong-checksum(i+1,j+1)
.
Tridgell cleverly works around this by being lazy, as rsync
does the hard
work of calculating and comparing the strong checksums only if the weak
checksums are equal. The algorithm carefully avoids doing expensive operations
until they are absolutely necessary by utilizing a cheap approximation, with the
potential for false positives. This technique is applicable to almost any situation involving expensive
computation.