I'm slowly uploading stuff to the new server after crash. The thing that slows me down most is trying to get things right immediately instead of doing quick reupload first, and fixing them later. I know that's not the way to get things done fast, but it's more fun. I want packages to be automatically built and uploaded on a single command (and later, nightly from crontab). Some packages are pretty big (jrpg mostly), so doing this naively would mean unnecessarily reuploading a few hundred MBs every night. The Rakefile needs some way of knowing wheather the new package is any different from the old one. Unfortunately packages (.tar.gz, .tar.bz2, .zip etc.) with identical contents are not necessarily bitwise identical. In fact, most of the time they're not, and don't even have identical filesizes. So I wrote a library to hash archive contents, which will hopefully save a lot of unnecessary uploads. The library is pretty simple, so I'm just pasting it here instead of packaging, releasing on RubyForge etc., at least for now.
require 'sha1'
require 'tmpdir'
class Array
def random
self[rand(size)]
end
end
class String
def digest
SHA1.hexdigest(self)
end
def self.random(len = 32)
path_characters = ("a".."z").to_a + ("A".."Z").to_a + ("0".."9").to_a + ["_"]
(0...len).map{ path_characters.random }.join
end
end
class File
def self.digest(file_name)
SHA1.hexdigest(File.read(file_name))
end
end
class Archive
def self.finalizer(dir)
Proc.new{
system "rm", "-rf", dir
}
end
# file_name must be absolute
def initialize(file_name, type=nil)
@file_name = file_name
type = guess_type_by_extension if type == nil
@type = type
@unpacked = false
end
# It's not particularly secure
# Unfortunately tempfile only creates files, not directories
def dir
return @dir if @dir
while true
@dir = Dir::tmpdir + "/ahash-" + String.random
Dir.mkdir @dir rescue redo
ObjectSpace.define_finalizer(self, Archive.finalizer(@dir))
return @dir
end
end
def guess_type_by_extension
case @file_name
when /(\.tgz|\.tar\.gz)\Z/
:tar_gz
when /(\.tar\.bz2)\Z/
:tar_bz2
when /(\.tar)\Z/
:tar
when /(\.zip)\Z/
:zip
else
nil
end
end
def unpack
return if @unpacked
Dir.chdir(dir) {
case @type
when :tar_gz
system "tar", "-xzf", @file_name
when :tar_bz2
system "tar", "-xjf", @file_name
when :tar
system "tar", "-xf", @file_name
when :zip
system "unzip", "-q", @file_name
else
raise "Don't know how to unpack archives of type #{@type}"
end
}
@unpacked = true
end
def quick_hash
unpack
@quick_hash ||= Dir.chdir(dir) {
Dir["**/*"].map{|file_name|
if File.directory?(file_name)
['dir', file_name]
else
['file', file_name, File.size(file_name)]
end
}.sort.inspect.digest
}
end
def slow_hash
unpack
@slow_hash ||= Dir.chdir(dir) {
Dir["**/*"].map{|file_name|
if File.directory?(file_name)
['dir', file_name]
else
['file', file_name, File.size(file_name), File.digest(file_name)]
end
}.sort.inspect.digest
}
end
end
Some details:
Array#random
picks a random array elementString.random
picks a random array elementString#digest
returns SHA1 hash of string in hex formatFile.digest(file_name)
returns hex SHA1 hash of contents of filefile_name
Archive.new(file_name, type)
createsArchive
objectArchive.new(file_name)
createsArchive
object and guesses its type (:tar_gz, :tar_bz2, :tar, :zip
) based on file extensionArchive#guess_type_by_extension
guessesArchive
's type by looking at file extension. (internal function)Archive#dir
when first run creates temporary directory in/tmp
(or system-specific place for temporary files), registers finalizer whichrm -rf
s this directory, and returns path to the newly created directory. When run afterwards simply returns the saved path. (internal function)Archive#unpack
unpacks contents of the archive to the temporary directory. (internal function)Archive#quick_hash
returns a quick hash, based only on list of files and their sizes, not contents.Archive#slow_hash
returns a reliable but possibly slower hash, based on file list and their contents.
Archive#quick_hash
and Archive#slow_hash
is that big, as unpacking and hashing take comparable amount of time. On the other hand Archive#quick_hash
could easily be computed based on only archive listing (like tar -tvzf
), without doing the unpacking, what would make a major difference.
No comments:
Post a Comment