Wednesday, March 28, 2007

Hashing archive contents

lemony cupcakes by chotda from flickr (CC-NC-ND)I'm slowly uploading stuff to the new server after crash. The thing that slows me down most is trying to get things right immediately instead of doing quick reupload first, and fixing them later. I know that's not the way to get things done fast, but it's more fun. I want packages to be automatically built and uploaded on a single command (and later, nightly from crontab). Some packages are pretty big (jrpg mostly), so doing this naively would mean unnecessarily reuploading a few hundred MBs every night. The Rakefile needs some way of knowing wheather the new package is any different from the old one. Unfortunately packages (.tar.gz, .tar.bz2, .zip etc.) with identical contents are not necessarily bitwise identical. In fact, most of the time they're not, and don't even have identical filesizes. So I wrote a library to hash archive contents, which will hopefully save a lot of unnecessary uploads. The library is pretty simple, so I'm just pasting it here instead of packaging, releasing on RubyForge etc., at least for now.
require 'sha1'
require 'tmpdir'

class Array
 def random
     self[rand(size)]
 end
end

class String
 def digest
     SHA1.hexdigest(self)
 end
 def self.random(len = 32)
     path_characters = ("a".."z").to_a + ("A".."Z").to_a + ("0".."9").to_a + ["_"]
     (0...len).map{ path_characters.random }.join
 end
end

class File
 def self.digest(file_name)
     SHA1.hexdigest(File.read(file_name))
 end
end

class Archive
 def self.finalizer(dir)
     Proc.new{
         system "rm", "-rf", dir
     }
 end
 # file_name must be absolute
 def initialize(file_name, type=nil)
     @file_name = file_name
     type = guess_type_by_extension if type == nil
     @type = type
     @unpacked = false
 end
 # It's not particularly secure
 # Unfortunately tempfile only creates files, not directories
 def dir
     return @dir if @dir
     while true
         @dir = Dir::tmpdir + "/ahash-" + String.random
         Dir.mkdir @dir rescue redo
         ObjectSpace.define_finalizer(self, Archive.finalizer(@dir))
         return @dir
     end
 end
 def guess_type_by_extension
     case @file_name
     when /(\.tgz|\.tar\.gz)\Z/
         :tar_gz
     when /(\.tar\.bz2)\Z/
         :tar_bz2
     when /(\.tar)\Z/
         :tar
     when /(\.zip)\Z/
         :zip
     else
         nil
     end
 end
 def unpack
     return if @unpacked
     Dir.chdir(dir) {
         case @type
         when :tar_gz
             system "tar", "-xzf", @file_name
         when :tar_bz2
             system "tar", "-xjf", @file_name
         when :tar
             system "tar", "-xf", @file_name
         when :zip
             system "unzip", "-q", @file_name
         else
             raise "Don't know how to unpack archives of type #{@type}"
         end
     }
     @unpacked = true
 end
 def quick_hash
     unpack
     @quick_hash ||= Dir.chdir(dir) {
         Dir["**/*"].map{|file_name|
             if File.directory?(file_name)
                 ['dir', file_name]
             else
                 ['file', file_name, File.size(file_name)]
             end
         }.sort.inspect.digest
     }
 end
 def slow_hash
     unpack
     @slow_hash ||= Dir.chdir(dir) {
         Dir["**/*"].map{|file_name|
             if File.directory?(file_name)
                 ['dir', file_name]
             else
                 ['file', file_name, File.size(file_name), File.digest(file_name)]
             end
         }.sort.inspect.digest
     }
 end
end
Some details:
  • Array#random picks a random array element
  • String.random picks a random array element
  • String#digest returns SHA1 hash of string in hex format
  • File.digest(file_name) returns hex SHA1 hash of contents of file file_name
  • Archive.new(file_name, type) creates Archive object
  • Archive.new(file_name) creates Archive object and guesses its type (:tar_gz, :tar_bz2, :tar, :zip) based on file extension
  • Archive#guess_type_by_extension guesses Archive's type by looking at file extension. (internal function)
  • Archive#dir when first run creates temporary directory in /tmp (or system-specific place for temporary files), registers finalizer which rm -rfs this directory, and returns path to the newly created directory. When run afterwards simply returns the saved path. (internal function)
  • Archive#unpack unpacks contents of the archive to the temporary directory. (internal function)
  • Archive#quick_hash returns a quick hash, based only on list of files and their sizes, not contents.
  • Archive#slow_hash returns a reliable but possibly slower hash, based on file list and their contents.
I don't think speed difference between Archive#quick_hash and Archive#slow_hash is that big, as unpacking and hashing take comparable amount of time. On the other hand Archive#quick_hash could easily be computed based on only archive listing (like tar -tvzf), without doing the unpacking, what would make a major difference.

No comments:

Post a Comment