Reducing Network Traffic in Subversion 1.8
This post discusses Apache Subversion features which are available in the development codebase at the time of writing but which have not yet been published in an official release and may change prior to such a release.
I’ve made neither apologies about nor attempts to hide the fact that I adore Subversion’s sparse checkouts functionality. The moment that feature became available, I reorganized my local Subversion-versioned projects from a scattered mess of thirty or so disparate trunk and branch working copies into a single working copy per project, rooted at the project’s root directory and sparsely populated from there.
Now, I’ll be honest. The immediate benefit (for me, at least) was mostly a soft one – an intangible improvement not so much to what Subversion was doing for me, but to what I was doing in between Subversion operations. My workspace was better organized. I spent less time trying to remember which working copies held what. I was able to get a better big-picture view of what was happening in the portions of the project tree that I cared about simply by running a single update that covered all those areas. These aren’t things you can measure with a benchmarking tool or a regression test suite – they are measured by the increased percentage of exposed surface area of the floor of my otherwise cluttered brain.
The Subversion developers debuted sparse checkouts in Subversion 1.5, and have improved it some in subsequent releases. For example, with Subversion 1.6 came the ability to exclude – that is, to “de-telescope” – existing working copy members. Now you no longer had to scrap and rebuild working copies which contained subtrees that no longer interested you. Improvements were also made to Subversion’s merge and merge tracking functionality (the latter of which was also introduced in Subversion 1.5) so that they behaved as expected where sparsely populated working copies were involved.
Subversion 1.7 brought another useful improvement to Subversion in general. The redesign of the working copy administration area – both in terms of its storage mechanism and its programming interface – was undertaken partly to help inspire new feature development. WC-NG (as the new working copy concept is called) was a major rewrite of the way working copy metadata is organized and accessed, and as was hoped for, a couple of its characteristics caught my attention. First, WC-NG moves all of the cached pristine text-base copies of your versioned files into a single location, indexing them by their SHA-1 checksums. For the sake of optimizing disk usage, where there are duplicate copies of the same content, only a single copy is kept. I’ve probably done more development of Subversion’s repository and repository access functionality than its client-side behaviors, so there was a certain symmetry here that was obvious to me: Subversion repositories also track file content using an index keyed on SHA-1 checksums, and also try to keep only a single copy of a given “representation” of file content. Secondly, WC-NG doesn’t preemptively purge pristine text-bases which are no longer strictly needed (because they aren’t associated with the current working copy state). Rather, these pristine versions accumulate over time and users can run svn cleanup to force Subversion to reconcile the stored pristine text-bases with the current working copy state and discard the unnecessary ones.
Sensing an opportunity, near the end of the Subversion 1.7 release cycle, I taught Subversion’s mod_dav_svn Apache server module to inform clients asking for updates (or checkouts, switches, merges, diffs, etc.) of the SHA-1 checksum of any file content it was recommending or providing to the client. (The server has transmitted MD5 checksums of this content since before Subversion 1.0, but we’ve been trying as a project to convert to SHA-1.) I wasn’t sure just yet what the client would do with that information, but it seemed valuable to me and to other Subversion developers who were sensing the same.
After Subversion 1.7 was released, I was finally able to fill in that blank. Subversion 1.8 is projected to ship with a single HTTP repository access module, libsvn_ra_serf. The Serf approach to HTTP communications differs from our previous Neon-based approach in that it favors using many small, pipelined requests instead of fewer, synchronous, monolithic ones. Checkouts and updates performed using libsvn_ra_serf ask the server for an update report containing a list of things the client needs to fetch from the server – a shopping list, if you will – in order to complete the update process. In Subversion 1.7, I’d taught the server to include in that shopping list the SHA-1 checksum (the UPC code, if we’re continuing the analogy) of each item required. Now, in the Subversion trunk (aka “1.8-dev”) codebase, I’ve taught libsvn_ra_serf to take that UPC code, ask WC-NG if there are any files in the “pristine pantry” which carry that code, and – if so – source the file directly from the local pristine store rather than asking the server for yet another copy of it. (After my initial implementation, I also taught libsvn_ra_serf to handle MD5 checksums the same way, effectively allowing this optimization to work against even the oldest of supported Subversion servers.)
What does all this mean? Speed.
When using a Subversion 1.8 client to communicate via HTTP, circumstances may permit you to download fewer – perhaps even zero – files over the network. For example, if you temporarily backdate your working copy to an older revision, then update again to the revision you were previously at, Subversion shouldn’t have to fetch any file’s content over the wire at all. Unless you’ve run svn cleanup since backdating, your working copy administrative area still holds the pristine versions of files in that younger revision. Similarly, if someone commits a rename of a file which you have locally in your working copy, Subversion shouldn’t have fetch the contents of the moved file over the wire again – a copy of those contents already resides in your local pristine text-base store.
Now, I started this post yammering on about sparse checkouts. Why? What’s the connection? Because it’s under sparse checkouts that I think this small little change to Subversion’s behavior really has a chance to shine. Since the pristine data store is per-working-copy, it stands to reason that the more sections of your repository tree which use it, the better the chances of finding something therein that would otherwise have to be fetched across the network. With a sparsely populated working copy, any time I need to “subscribe to”, say, a new branch in my project, Subversion’s new don’t-grab-file-content-I-already-have functionality has the entire existing pristine store at its disposal. Subversion should only have to fetch any files in that branch for which there are no other exact copies on any other branch or tag in my working copy. This is very unlike a freshly checked-out working copy devoted solely to a single branch, where the pristine store is completely empty until after the checkout, forcing Subversion to download every single file in the checkout set across the network. What’s more, should I ever need to exclude a branch or some other specific subtree and then restore it later, that restoration process has access to the pristine contents of that subtree prior to its exclusion plus that of all the other items elsewhere in the working copy from which to pull data before resorting to wire transfer.
Take for example a relatively simple yet common situation. You’re working on your project’s trunk, and you decide that you need to create a branch and do some of your development with a bit more isolation, but you still want to track the trunk too. In the past, most folks would:
- Create the branch, probably using svn copy URL NEW-URL.
- Checkout a new working copy of the branch, yawning as all those files were transferred over the wire. (Skilled Subversioneers would short-cut this by making a local OS-level copy of their working copy directory and then using svn switch to make that copy track their new branch.)
- Start working on the branch.
- Occasionally synchronize the branch with any changes that occurred on the trunk, with any file deltas again being transferred across the wire.
Using sparse checkouts and the pristine storage sourcing improvements, things run similarly but without so much network traffic:
- Create the branch, probably using svn copy URL NEW-URL.
- Update to include the new branch in the sparsely populated working copy. The content of the new branch exactly matches that of the trunk, so no file content is transferred over the wire at all!
- Start working on the branch.
- Occasionally synchronize the branch with any changes that occurred on the trunk, though now with file deltas being transferred across the wire only for files changed on the branch.
Of course, Subversion still uses the same path-based approach to access privileges that it always has. It’s not the case that just because a user may be able to read a file’s contents on a given branch that he or she is now permitted to know that the same file exists in some other unreadable area of the repository. All I’ve done is help the client avoid re-fetching file content it already holds in the local pristine cache.
As the Apache Subversion community looks to begin winding down development activities on what will be Subversion 1.8, I’d encourage you to have a look around the project’s public-facing materials (such as our Roadmap page and the 1.8 Release Notes, both of which are works-in-progress) to see if there’s a feature or enhancement that interests you. Perhaps you could help us test out some of this new stuff and ensure that when it does see the light of day in an official release, it’s the best it can be.