The best kittens, technology, and video games blog in the world.

Wednesday, March 28, 2007

Hashing archive contents

lemony cupcakes by chotda from flickr (CC-NC-ND)I'm slowly uploading stuff to the new server after crash.

The thing that slows me down most is trying to get things right immediately instead of doing quick reupload first, and fixing them later. I know that's not the way to get things done fast, but it's more fun.

I want packages to be automatically built and uploaded on a single command (and later, nightly from crontab). Some packages are pretty big (jrpg mostly), so doing this naively would mean unnecessarily reuploading a few hundred MBs every night. The Rakefile needs some way of knowing wheather the new package is any different from the old one.

Unfortunately packages (.tar.gz, .tar.bz2, .zip etc.) with identical contents are not necessarily bitwise identical. In fact, most of the time they're not, and don't even have identical filesizes. So I wrote a library to hash archive contents, which will hopefully save a lot of unnecessary uploads.

The library is pretty simple, so I'm just pasting it here instead of packaging, releasing on RubyForge etc., at least for now.

require 'sha1'
require 'tmpdir'

class Array
def random
self[rand(size)]
end
end

class String
def digest
SHA1.hexdigest(self)
end
def self.random(len = 32)
path_characters = ("a".."z").to_a + ("A".."Z").to_a + ("0".."9").to_a + ["_"]
(0...len).map{ path_characters.random }.join
end
end

class File
def self.digest(file_name)
SHA1.hexdigest(File.read(file_name))
end
end

class Archive
def self.finalizer(dir)
Proc.new{
system "rm", "-rf", dir
}
end
# file_name must be absolute
def initialize(file_name, type=nil)
@file_name = file_name
type = guess_type_by_extension if type == nil
@type = type
@unpacked = false
end
# It's not particularly secure
# Unfortunately tempfile only creates files, not directories
def dir
return @dir if @dir
while true
@dir = Dir::tmpdir + "/ahash-" + String.random
Dir.mkdir @dir rescue redo
ObjectSpace.define_finalizer(self, Archive.finalizer(@dir))
return @dir
end
end
def guess_type_by_extension
case @file_name
when /(\.tgz|\.tar\.gz)\Z/
:tar_gz
when /(\.tar\.bz2)\Z/
:tar_bz2
when /(\.tar)\Z/
:tar
when /(\.zip)\Z/
:zip
else
nil
end
end
def unpack
return if @unpacked
Dir.chdir(dir) {
case @type
when :tar_gz
system "tar", "-xzf", @file_name
when :tar_bz2
system "tar", "-xjf", @file_name
when :tar
system "tar", "-xf", @file_name
when :zip
system "unzip", "-q", @file_name
else
raise "Don't know how to unpack archives of type #{@type}"
end
}
@unpacked = true
end
def quick_hash
unpack
@quick_hash ||= Dir.chdir(dir) {
Dir["**/*"].map{|file_name|
if File.directory?(file_name)
['dir', file_name]
else
['file', file_name, File.size(file_name)]
end
}.sort.inspect.digest
}
end
def slow_hash
unpack
@slow_hash ||= Dir.chdir(dir) {
Dir["**/*"].map{|file_name|
if File.directory?(file_name)
['dir', file_name]
else
['file', file_name, File.size(file_name), File.digest(file_name)]
end
}.sort.inspect.digest
}
end
end
Some details:
  • Array#random picks a random array element
  • String.random picks a random array element
  • String#digest returns SHA1 hash of string in hex format
  • File.digest(file_name) returns hex SHA1 hash of contents of file file_name
  • Archive.new(file_name, type) creates Archive object
  • Archive.new(file_name) creates Archive object and guesses its type (:tar_gz, :tar_bz2, :tar, :zip) based on file extension
  • Archive#guess_type_by_extension guesses Archive's type by looking at file extension. (internal function)
  • Archive#dir when first run creates temporary directory in /tmp (or system-specific place for temporary files), registers finalizer which rm -rfs this directory, and returns path to the newly created directory. When run afterwards simply returns the saved path. (internal function)
  • Archive#unpack unpacks contents of the archive to the temporary directory. (internal function)
  • Archive#quick_hash returns a quick hash, based only on list of files and their sizes, not contents.
  • Archive#slow_hash returns a reliable but possibly slower hash, based on file list and their contents.
I don't think speed difference between Archive#quick_hash and Archive#slow_hash is that big, as unpacking and hashing take comparable amount of time. On the other hand Archive#quick_hash could easily be computed based on only archive listing (like tar -tvzf), without doing the unpacking, what would make a major difference.

No comments: