The best kittens, technology, and video games blog in the world.

Wednesday, February 21, 2007

Atomic Coding with Subversion

Artemis chewing sweet William... by Dr. Hemmert from flickr (CC-ND) Better programming languages, more extensive libraries, and faster hardware are the most visible drivers of increase in programmer productivity. People are obviously more productive in Ruby than in Fortran, with CPAN than with C standard library, and on a Dual-Core Opteron with 1GB RAM than on a punched card big iron. There are also many factors improving productivity that are less in-your-face. For example in the 2000s any two programs on any two machines can communicate using TCP/IP, HTTP and XML. In the 1980s one would need to code application-specific data encodings, network protocols, and add explicit support for every possible network. People got so used to the Internet that they don't think much about it any more. Back in the old days programmers didn't know about refactoring, unit testing, test-driven development, YAGNI, or even Object-Oriented Programming. Another productivity factor are revision control software and atomic coding. There are a few people who still don't use revision control, some even used to be pretty well known for that. Back when everybody used CVS, I viewed revision control as necessary evil - CVS was getting in the way more often than helping. It all changed with Subversion. It's not perfect for every possible development model, but for personal repositories and small teams it's simply awesome. This post will be about two things - subversion basics for those who aren't using it much yet, and Atomic Coding. If you know SVN well, just jump to the Atomic Coding section, as it's a very important concept that will help your productivity stay ahead of the Yannis's Law.

Subversion basics

Subversion supports multiple protocols, but the sanest one is HTTPS. With HTTPS anonymous access requires no authorization, and you don't even need any SVN software - wget -r --no-parent or any browser are good enough. There are guides for quickly setting up SVN repository with HTTPS for Debian, Ubunut, Gentoo, FreeBSD, and maybe even Windows. The first important SVN command you need is svn help and svn help some_command. It's pretty useful when you forget details of some command. When you use subversion you first need to checkout with svn co URL. Then you can edit checked-out files, add new ones with svn add FILES (works with adding directories too), move them around (svn mv FROM TO or svn mv FILES DESTDIR), and occasionally detele them (svn rm FILE). After you're done you can commit. svn ci -m "Message" will commit all changes in current directory and its subdirectories. Very often you want to commit less than that, for example svn ci -m "Some quick change in README" README while leaving rest of the code uncommitted. You can take a look at changes since the last version commit by svn diff, or with some older revision by svn diff -r REVISION. To look at changes from other perspective (list of added/deleted/modified/not-in-repository files) use svn status. A few more things. If you keep PDF files in repository, you don't want to see binaries in output of svn diff. SVN tries to determine what's binary and what's text (it matters only for diff viewing, it won't "convert line endings" in your binaries like some broken revision control systems), but PDFs often start with some text commands before going binary and confuse SVN. To tell SVN they're really binaries, run svn propset svn:mime-type application/octet-stream *.pdf. I never had such problems with other binary formats like pictures. To get your copy of repository in sync with master repository run svn up. To get a log of changes affecting current directory and its subdirectories run svn log. Every now and then you'll need to revert one file (svn revert FILE) or everything in the directory svn revert. If you added something but didn't commit, revert the file instead of svn rming it. If you want to script SVN, most commands accept --xml modifier, and will output machine-readable markup instead of the default human-readable plain text. Another useful switch is -v to increase verbosity. I think it's much easier to run SVN from command line and parse XML than use language-specific SVN libraries. In SVN branching and tagging is done by copying (svn cp), and merging by svn merge, but most of the time I work on just the main branch and merge by plain Unix diff and patch.

Atomic Coding

But I wasn't going to write just another SVN tutorial. I want to say something about Atomic Coding. The idea is basically committing (to the branch you're working on - usually the main branch) as soon as you have a complete change, no matter how small. You've written a single unit test ? Commit. Made one unit test pass that didn't ? Commit. Changed README file ? Commit. Fixed a typo in some comments ? Commit. Breaking your work into small pieces is one of the most fundamental ideas in productivity ever. Programming han Unit Testing, project management has Getting Things Done's Next Actions and so on - they're all about breaking big and complex things into small and simple ones. It's pretty straightforward to apply the same principle to repository management, but it's much easier to talk about something when it has a name, so let's refer to it as "Atomic Coding". So why do Atomic Coding ? First because committing is so easy. Just say svn ci -m "Unit test for Kitten#meow" and you're done. Many people insist on meaningful commit messages, but if you commit very often you're going to end up with messages like "Some comments added", "Minor code cleanup". Don't feel bad about them - repository management is there to help you, not to oppress you. The most low-level benefit is you can use svn commands to do something useful. When you do Atomic Coding, svn revert will revert to the last working state without losing any useful modifications, svn diff will tell you what are you doing. On the medium-level, by keeping your code up to date, you will be able to get away from coding and get back to it much more easily. Interruptions happen many times a day - phone calls, instant messaging, people passing by, lunch time, and so on. Atomic Coding will save you a few minutes on every interruption, and it's going to add up to a huge productivity boost. On the high-level, Atomic Coding works really great with Unit Testing, Test-Driver Development, Getting Things Done and so on. They reinforce each other. Programming is really much more effective if you break it into small pieces instead of trying to do everything at once, and all your habits should support this instead of trying to get you away from it. If you want to see Atomic Coding in action, Bitscribe has a cool screencast about it. I used vague words like "small", "atomic", but let's get more specific. Ask yourself:
What's the typical (median) time between your commits ?
If the answer is anything more than 1-2 hours, you're not doing Atomic Coding. It's often difficult to get the right answer, so I wrote a short script that extracted the answer from SVN repository (the script isn't that short, but it's mostly because of pretty-printing).

require 'enumerator'
require 'magic_xml'
require 'time'

class Numeric
    def time_pp
        s = self.to_i
        return "#{self}s" if s < 60

        m = (s / 60).to_i
        s -= m*60
        return "#{m}m#{s}s" if m < 60

        h = (m / 60).to_i
        m -= h*60
        return "#{h}h #{m}m#{s}s" if h < 24

        d = (h / 24).to_i
        h -= d*24
        return "#{d}d #{h}h #{m}m#{s}s"
    end
end

log = XML.parse(STDIN)

summaries_by_author = Hash.new{|ht,k| ht[k] = {:dates => [], :sizes => []}}

log.descendants(:logentry) {|e|
    summaries_by_author[e[:@author]][:dates] << Time.parse(e[:@date])
    summaries_by_author[e[:@author]][:sizes] << e.descendants(:path).size
}

summaries_by_author.to_a.sort.each{|author, summary|
    dates = summary[:dates].enum_for(:each_cons, 2).map{|a,b| a-b}.sort
    sizes = summary[:sizes].sort

    puts "Activity of #{author}:"
    puts "Time between commits distribution:"
    puts "* 10% - #{dates[dates.size/10].time_pp}"
    puts "* 25% - #{dates[dates.size/4].time_pp}"
    puts "* median - #{dates[dates.size/2].time_pp}"
    puts "* 75% - #{dates[dates.size*3/4].time_pp}"
    puts "* 90% - #{dates[dates.size*9/10].time_pp}"
    puts "Median number of affected files: #{sizes[sizes.size/2]}"

    sizes_summary = Hash.new(0)
    sizes.each{|sz| sizes_summary[sz] += 1}
    sizes_summary.to_a.sort.each{|k,v|
        puts "* #{k} file#{(k == 1) ? '' : 's'} - #{v} time#{(v == 1) ? '' : 's'}"
    }
}
To run it do svn log --xml -v | svn_log_summary.rb (it requires magic/xml). The results for me are: Activity of taw: Time between commits distribution:
  • 10% - 2m40s
  • 25% - 11m17s
  • median - 38m13s
  • 75% - 4h 45m34s
  • 90% - 1d 1h 19m45s
Median number of affected files: 2
  • 1 file - 520 times
  • 2 files - 287 times
  • 3 files - 156 times
  • 4 files - 84 times
  • 5 files - 47 times
  • 6 files - 33 times
  • 7 files - 14 times
  • 8 files - 22 times
  • 9 files - 8 times
  • 10 files - 7 times
...
  • 102 files - 1 time
  • 107 files - 1 time
  • 127 files - 1 time
  • 198 files - 1 time
  • 1274 files - 1 time
  • 2743 files - 1 time
So half of the commits were less than 38m13s before the previous commit, and a quarter were less than 11m17s before the previous one. A few hour breaks most likely represents getting away from coding, as it's very rare for me to code for hours without committing. Most commits were on just a few files, and the big ones are most likely import, or automated changes (like "Tabs replaced by spaces" 46-file commit), not results of long coding sessions. It took me some time to get used to Atomic Coding, but just like with Unit Testing and Getting Things Done - I'm never going back.

14 comments:

Свилен Иванов said...

My need to frequent (non-atomic) commits is to avoid merging with rest of the team. I've noticed that committing small pieces on highly concurrent environment (say >5 people) is preferable.

Anonymous said...

I agree with atomic commits 100%.

BTW, you got "co" and "ci" mixed up in the following paragraph:

After you're done you can commit. svn co -m "Message" will commit all changes in current directory and its subdirectories. Very often you want to commit less than that, for example svn co -m "Some quick change in README" README while leaving rest of the code uncommitted.

Anonymous said...

So the final logical step to this, and it is one that I have been waiting for for years, is to have the "Save" function on your editor do an automatic commit, popping up ad dialog box that lets you enter an optional checkin comment.

- Elroy Jetson

taw said...

Anonymous: Thanks for spotting the typo, I fixed it.

Anonymous: It cannot practically work like that, because you want to test your code before committing and you need to save the file to test it. But most editors can already do single-click SVN commits.

Unknown said...

You have a mismatched <code> tag after "When you do Atomic Coding"... Opera renders it funny.

When you do Atomic Coding, <code>svn revert<code>

Should be:

When you do Atomic Coding, <code>svn revert</code>

Thought you might like to know. :)

taw said...

Peter: Thanks. I need to script up some syntax sanity checker for Blogger, because Blogger software is not doing its job well enough.

Anonymous said...

I don't use subversion (I'm currently a darcs addict) but I sort of do "atomic commits" as you describe. The difference is I don't always commit right that second. I often do commit runs where I create 20 individual commits over my current directory in the space of 5 or 10 minutes.

But I definitely agree that an individual commit should be as small and atomic as possible.

What really helps is a script I wrote (and emacs integration) that lets me commit just part of a file. So if I'm working on a major feature and I happend to see an unrelated bug in the same file I can fix the bug and commit the change immediately without worrying about my half-implemented feature.

http://porkrind.org/commit-patch/

I wrote a little tutorial for how it works here:

http://porkrind.org/missives/commit-patch-managing-your-mess

Caveat: It doesn't support svn yet, but it does support cvs, darcs and mercurial so it should be easy to extend. I'll happily accept the patch to anyone who wants to add svn support.

-David

Anonymous said...

svn is not that great... darcs' interactive patch recording is WAY better to create small, atomic patches.

Anonymous said...

It would only work in a very small team. Is a large team (for the sake of clarity, lets say more than five people) you do want to keep your commits as small as possible, but not smaller than required to fix a bug or to implement a small feature or a logically complete part of a larger feature. Plus, the time between commits will be mostly spent on following the process - code reviews, commit approvals, mandatory (sometimes very lengthy) test builds, regressions, etc.

taw said...

George: My advice - if every commit requires "a code review, commit approval, mandatory lengthy test builds, regressions, etc.", and you cannot fix that (this sounds like a serious lack of trust issue), then just get yourself a local repository (or a branch in the central one) where you can commit whenever you feel like committing, use atomic coding there, and only do the paperwork on merging.

You're still going to lose productivity, only a lot less so.

Unknown said...

I agree with your atomic commit opinion _completely_ and that's in part because I follow that strategy religiously each day.

However, like others posting here, I have to let you know that your editing of this article is horrendous. You have more spelling and grammar errors that I can count... and that is really, really annoying, no matter how right you are about other things...

Anonymous said...

Good post.

I can agree with atomic commits in some occasions, but I usually prefer complete-and-tested-feature-commits, just because my team is not very small and because in our project usually the hurry-to-commit is a large source of bugs.

By the way, I don't mind grammar errors very much, just because my english is worse .. !!

Anonymous said...

Subversion? Bleh.

Try a distributed system like git, mercurial, or darcs... Then you'll experience the true potential of code management -- seamless branches, throwaway branches, merges which preserve history, ...

Anonymous said...

I can think of better systems than our current version control systems. Specifically I think an integrated VIM type system of undo/redo version control would be better. Namely there would be absolutely no commits. Or if there were commits they would be more like bookmarks. Instead the whole environment (editor+ide or whatever) would continuously be making a log of everything you do so that all changes could be undone or redone to any degree. The best type of system to accomplish this would be something with persistent storage like an image based system.