A Closer Look at Rsync

2017-12-29

It’s not often that academia analyzes the unix tools we use everyday. But rsync is one fortunate exception as Andrew Tridgell not only wrote rsync while pursuing his PhD, but also published a short and accessible paper outlining its inner workings. While I’d highly encourage reading the entire paper, I took away one major tl:dr; from the rsync algorithm: be lazy. This lesson will be applicable anytime I write performance conscious code.

Algorithm Overview

Before diving into the benefits of being lazy, we must understand rsync’s goals and implementation. rsync is an algorithm for updating a file on machine A to be identical to a file on machine B. Tridgell optimizes for when the connection between machine A and machine B is high-latency and when the two files are similar.

To copy file a from machine A to file b on machine B, the rsync algorithm conducts the following high-level steps.

Machine B splits file b into non-overlapping blocks of size S bytes.
For each block, B calculates a weak checksum and a strong checksum.
Machine B sends the checksums to machine A.
Machine A searches through all possible blocks of size S in file a to find blocks which have weak and strong checksums equal to one of the blocks on B.
Machine A uses the knowledge of equivalent blocks to send machine B a sequence of instructions for constructing a copy of A. It only sends B data which it does not have, so if a block in a is a match for a block on b, machine A will not transmit the block to machine B.

If you’re interested in greater detail, the paper further explains the algorithm, as well as detailing the underlying math and providing empirical analysis.

Be Lazy

via GIPHY

In order for the rsync algorithm to operate efficiently, it must quickly compare all possible blocks of size S in file a to the set of non-overlapping blocks of size S in file b. To visualize, imagine file a contains 1234567890, file b contains 2345678901 and S is two characters. From step 1 of the algorithm, the blocks from file b are 23, 45, 67, 89, and 01. To accurately compare file a to file b, we not only need to compare 12 on file a to 23 on file b, but also 23, 34, 45… on file a to 23 on file b. However, we don’t want to do an expensive calculation for each possible block on a. Rather, we want to a perform a calculation which allows us to take our checksum for block i,j and with a small amount of work, find the checksum of block i+1,j+1.

Tridgell accomplishes this through the weak checksum calculated in steps 2 and 4 of the algorithm. As he discusses in the paper, given weak-checksum(i,j) and values i,j+1, we need to do little additional work to find weak-checksum(i+1,j+1). Because of this efficiency, the algorithm can very quickly find the weak checksum for all possible blocks in a.

However, the weak checksum is not perfect. In certain scenarios, it will give false positives and say two unequal blocks are equal. To be absolutely certain two blocks are equal, we need to calculate and compare the strong checksum. However, the strong checksum is more expensive to calculate, and the value of strong-checksum(i,j) gives us no insight into the value of strong-checksum(i+1,j+1).

Tridgell cleverly works around this by being lazy, as rsync does the hard work of calculating and comparing the strong checksums only if the weak checksums are equal. The algorithm carefully avoids doing expensive operations until they are absolutely necessary by utilizing a cheap approximation, with the potential for false positives. This technique is applicable to almost any situation involving expensive computation.