Atomic Coding with Subversion

Better programming languages, more extensive libraries, and faster hardware are the most visible drivers of increase in programmer productivity. People are obviously more productive in Ruby than in Fortran, with CPAN than with C standard library, and on a Dual-Core Opteron with 1GB RAM than on a punched card big iron. There are also many factors improving productivity that are less in-your-face. For example in the 2000s any two programs on any two machines can communicate using TCP/IP, HTTP and XML. In the 1980s one would need to code application-specific data encodings, network protocols, and add explicit support for every possible network. People got so used to the Internet that they don't think much about it any more. Back in the old days programmers didn't know about refactoring, unit testing, test-driven development, YAGNI, or even Object-Oriented Programming. Another productivity factor are revision control software and atomic coding. There are a few people who still don't use revision control, some even used to be pretty well known for that. Back when everybody used CVS, I viewed revision control as necessary evil - CVS was getting in the way more often than helping. It all changed with Subversion. It's not perfect for every possible development model, but for personal repositories and small teams it's simply awesome. This post will be about two things - subversion basics for those who aren't using it much yet, and Atomic Coding. If you know SVN well, just jump to the Atomic Coding section, as it's a very important concept that will help your productivity stay ahead of the Yannis's Law.

Subversion basics

Subversion supports multiple protocols, but the sanest one is HTTPS. With HTTPS anonymous access requires no authorization, and you don't even need any SVN software - wget -r --no-parent or any browser are good enough. There are guides for quickly setting up SVN repository with HTTPS for Debian, Ubunut, Gentoo, FreeBSD, and maybe even Windows. The first important SVN command you need is svn help and svn help some_command. It's pretty useful when you forget details of some command. When you use subversion you first need to checkout with svn co URL. Then you can edit checked-out files, add new ones with svn add FILES (works with adding directories too), move them around (svn mv FROM TO or svn mv FILES DESTDIR), and occasionally detele them (svn rm FILE). After you're done you can commit. svn ci -m "Message" will commit all changes in current directory and its subdirectories. Very often you want to commit less than that, for example svn ci -m "Some quick change in README" README while leaving rest of the code uncommitted. You can take a look at changes since the last version commit by svn diff, or with some older revision by svn diff -r REVISION. To look at changes from other perspective (list of added/deleted/modified/not-in-repository files) use svn status. A few more things. If you keep PDF files in repository, you don't want to see binaries in output of svn diff. SVN tries to determine what's binary and what's text (it matters only for diff viewing, it won't "convert line endings" in your binaries like some broken revision control systems), but PDFs often start with some text commands before going binary and confuse SVN. To tell SVN they're really binaries, run svn propset svn:mime-type application/octet-stream *.pdf. I never had such problems with other binary formats like pictures. To get your copy of repository in sync with master repository run svn up. To get a log of changes affecting current directory and its subdirectories run svn log. Every now and then you'll need to revert one file (svn revert FILE) or everything in the directory svn revert. If you added something but didn't commit, revert the file instead of svn rming it. If you want to script SVN, most commands accept --xml modifier, and will output machine-readable markup instead of the default human-readable plain text. Another useful switch is -v to increase verbosity. I think it's much easier to run SVN from command line and parse XML than use language-specific SVN libraries. In SVN branching and tagging is done by copying (svn cp), and merging by svn merge, but most of the time I work on just the main branch and merge by plain Unix diff and patch.

Atomic Coding

But I wasn't going to write just another SVN tutorial. I want to say something about Atomic Coding. The idea is basically committing (to the branch you're working on - usually the main branch) as soon as you have a complete change, no matter how small. You've written a single unit test ? Commit. Made one unit test pass that didn't ? Commit. Changed README file ? Commit. Fixed a typo in some comments ? Commit. Breaking your work into small pieces is one of the most fundamental ideas in productivity ever. Programming han Unit Testing, project management has Getting Things Done's Next Actions and so on - they're all about breaking big and complex things into small and simple ones. It's pretty straightforward to apply the same principle to repository management, but it's much easier to talk about something when it has a name, so let's refer to it as "Atomic Coding". So why do Atomic Coding ? First because committing is so easy. Just say svn ci -m "Unit test for Kitten#meow" and you're done. Many people insist on meaningful commit messages, but if you commit very often you're going to end up with messages like "Some comments added", "Minor code cleanup". Don't feel bad about them - repository management is there to help you, not to oppress you. The most low-level benefit is you can use svn commands to do something useful. When you do Atomic Coding, svn revert will revert to the last working state without losing any useful modifications, svn diff will tell you what are you doing. On the medium-level, by keeping your code up to date, you will be able to get away from coding and get back to it much more easily. Interruptions happen many times a day - phone calls, instant messaging, people passing by, lunch time, and so on. Atomic Coding will save you a few minutes on every interruption, and it's going to add up to a huge productivity boost. On the high-level, Atomic Coding works really great with Unit Testing, Test-Driver Development, Getting Things Done and so on. They reinforce each other. Programming is really much more effective if you break it into small pieces instead of trying to do everything at once, and all your habits should support this instead of trying to get you away from it. If you want to see Atomic Coding in action, Bitscribe has a cool screencast about it. I used vague words like "small", "atomic", but let's get more specific. Ask yourself:
What's the typical (median) time between your commits ?
If the answer is anything more than 1-2 hours, you're not doing Atomic Coding. It's often difficult to get the right answer, so I wrote a short script that extracted the answer from SVN repository (the script isn't that short, but it's mostly because of pretty-printing).

require 'enumerator'
require 'magic_xml'
require 'time'

class Numeric
    def time_pp
        s = self.to_i
        return "#{self}s" if s < 60

        m = (s / 60).to_i
        s -= m*60
        return "#{m}m#{s}s" if m < 60

        h = (m / 60).to_i
        m -= h*60
        return "#{h}h #{m}m#{s}s" if h < 24

        d = (h / 24).to_i
        h -= d*24
        return "#{d}d #{h}h #{m}m#{s}s"

log = XML.parse(STDIN)

summaries_by_author ={|ht,k| ht[k] = {:dates => [], :sizes => []}}

log.descendants(:logentry) {|e|
    summaries_by_author[e[:@author]][:dates] << Time.parse(e[:@date])
    summaries_by_author[e[:@author]][:sizes] << e.descendants(:path).size

summaries_by_author.to_a.sort.each{|author, summary|
    dates = summary[:dates].enum_for(:each_cons, 2).map{|a,b| a-b}.sort
    sizes = summary[:sizes].sort

    puts "Activity of #{author}:"
    puts "Time between commits distribution:"
    puts "* 10% - #{dates[dates.size/10].time_pp}"
    puts "* 25% - #{dates[dates.size/4].time_pp}"
    puts "* median - #{dates[dates.size/2].time_pp}"
    puts "* 75% - #{dates[dates.size*3/4].time_pp}"
    puts "* 90% - #{dates[dates.size*9/10].time_pp}"
    puts "Median number of affected files: #{sizes[sizes.size/2]}"

    sizes_summary =
    sizes.each{|sz| sizes_summary[sz] += 1}
        puts "* #{k} file#{(k == 1) ? '' : 's'} - #{v} time#{(v == 1) ? '' : 's'}"
To run it do svn log --xml -v | svn_log_summary.rb (it requires magic/xml). The results for me are: Activity of taw: Time between commits distribution:
  • 10% - 2m40s
  • 25% - 11m17s
  • median - 38m13s
  • 75% - 4h 45m34s
  • 90% - 1d 1h 19m45s
Median number of affected files: 2
  • 1 file - 520 times
  • 2 files - 287 times
  • 3 files - 156 times
  • 4 files - 84 times
  • 5 files - 47 times
  • 6 files - 33 times
  • 7 files - 14 times
  • 8 files - 22 times
  • 9 files - 8 times
  • 10 files - 7 times
  • 102 files - 1 time
  • 107 files - 1 time
  • 127 files - 1 time
  • 198 files - 1 time
  • 1274 files - 1 time
  • 2743 files - 1 time
So half of the commits were less than 38m13s before the previous commit, and a quarter were less than 11m17s before the previous one. A few hour breaks most likely represents getting away from coding, as it's very rare for me to code for hours without committing. Most commits were on just a few files, and the big ones are most likely import, or automated changes (like "Tabs replaced by spaces" 46-file commit), not results of long coding sessions. It took me some time to get used to Atomic Coding, but just like with Unit Testing and Getting Things Done - I'm never going back.


