Thursday, March 29, 2007

Segfaulting own programs for fun and profit

Lassanya en verano / Lassanya in summer time by Mòni from flickr (CC-NC-ND)Most people hate segfaults - and no wonder, getting one usually indicates a serious and difficult to debug problem, very likely also a security hole. People tend to be grateful that their language doesn't have segfaults.

I'm going to propose something seemingly insane - segfaulting your own programs on purpose. It will be necessary to switch to a low-level point of view, but if you stay with me, you'll get a great new tool.

Magical complex numbers

The program I'm going to show simply prints complex numbers. These numbers are magical - in addition to the usual re and im fields they contain a as_string field which always magically contains textual representation of them. You don't have to follow any special API - as_string updates itself automagically on demand. From user point of view the only way complex_t differs from a run-of-the-mill complex type are special allocation and deallocation routines (nothing unusual in C APIs), and the magical as_string field.

/* magic_complex.h */
typedef struct {
double re, im;
char *as_string;
} complex_t;

complex_t *complex_new();
void complex_free(complex_t *complex);
/* main.c */
#include <stdio.h>
#include "magic_complex.h"

int main()
{
complex_t *a = complex_new();
a->re = 5.0;
a->im = -2.0;

printf("%s\n", a->as_string);
printf("%s\n", a->as_string);

a->re = 3.0;
printf("%s\n", a->as_string);

return 0;
}

Compiling the program and running it produces the expected output. But how ?

$ ./main
5.000000-2.000000i
5.000000-2.000000i
3.000000-2.000000i

What really happens in hardware

Before I show the implementation of magical complex numbers, let's take a look at hardware. Back in the early days of x86, "real" memory model was used. Reading memory cell 0x1234 meant physical hardware cell 0x1234. If program accessed memory which wasn't physically there, program stopped and a special operating system procedure was called to decide what to do next.

On modern x86 the situation is different. Programs don't access memory directly. Instead, they access virtual memory, which is organized in 4096-byte "pages". So reading memory cell 0xb7fcb123 actually means reading cell 0x123 of whatever physical memory corresponds to virtual memory page 0xb7fcb***.

It's possible for a page to point nowhere, or to have restricted access (typically read only). If such page is accessed, a special operating system procedure is called to decide what next. Many such invalid accesses are handled within the operating system, and the operation which caused them reruns. If the operating system doesn't know what to do, it delivers Segmentation Fault signal to the program.

Many invalid accesses can be fixed by the operating system, by making the page active or increasing its permissions. For example if programs use too much memory, some of their memory gets swapped to disk, and corresponding pages are marked as inactive. If they're accessed, the operating system gets informed of access violation, reads the page from disk, and reruns the operation which caused the problem.

Another example are mmap-ed files. Instead of reading and writing files, programs can mmap them. mmap associates some pages of memory with contents of some file. When program accesses this memory, it's read from the disk. mmaping is easier than reading files (file contents appear as if they were in memory), often faster (as reading is only done on-demand), and takes less physical memory, because if multiple programs mmap the same file, the operating system can load it just once. This is much more common than it seems at first - all executables and shared libraries are mmaped, so if you run 15 copies of bash and 540 programs use libc, code of both is present in the physical memory only once.

The operating system performs even more interesting tricks. When a program forks, instead of duplicating all its memory the operating system only marks it as read-only. When either parent or child writes to one of such pages, the operating system copies them and marks as read/write. This way memory is used only when necessary.

When the operating system doesn't know what to do with an access violation, it signals segmentation fault. Most programs upon getting a segmentation fault simply an print error message and exit. But there are so many cool things one can do with virtual memory, it would be a shame not to use them.

Magical complex number library

The library consists of two parts. One part (laziness.c) handles ugly low-level issues and hides them behind a pretty interface. The other part (magic_complex.c) implements magic complex numbers.

The interface to the laziness library looks like that:
/* laziness.h */
#define PAGE_SIZE 4096
#define LAZY_SIZE PAGE_SIZE

typedef void (*fault_handler_t)(void*, void*);

void *lazy_alloc_rw(void);
void *lazy_alloc_ro(fault_handler_t fault_handler, void *data);
void *lazy_alloc_none(fault_handler_t fault_handler, void *data);

void lazy_make_rw(void *obj);
void lazy_make_ro(void *obj, fault_handler_t fault_handler, void *data);
void lazy_make_none(void *obj, fault_handler_t fault_handler, void *data);

void lazy_free(void *obj);

lazy_alloc_* functions allocate pages of memory with read-write, read-only, or no-access permissions. lazy_make_* functions change access permissions to existing memory. lazy_free frees such memory. Read-only and no-access versions of lazy_alloc_* and lazy_make_* also register a fault_handler argument, which will be run if such memory is accessed illegally. Always exactly 4096 bytes are allocated (or whatever is your hardware page size). It wouldn't be difficult to allocate multiples of 4096, but I wanted to keep the code simple.

Finally, the magic complex numbers library:
/* magic_complex.c */
#include <stdio.h>
#include "laziness.h"
#include "magic_complex.h"

void complex_main_fault(complex_t *complex, void *data);
void complex_string_fault(char *as_string, complex_t *complex);

void complex_main_fault(complex_t *complex, void *data)
{
lazy_make_rw(complex);
lazy_make_none(complex->as_string, complex_string_fault, complex);
}

void complex_string_fault(char *as_string, complex_t *complex)
{
lazy_make_ro(complex, complex_main_fault, 0);
lazy_make_rw(as_string);
snprintf(as_string, LAZY_SIZE, "%f%+fi", complex->re, complex->im);
as_string[LAZY_SIZE-1] = 0;
}

complex_t *complex_new()
{
complex_t *complex = lazy_alloc_rw();
complex->as_string = lazy_alloc_none(complex_string_fault, complex);
return complex;
}

void complex_free(complex_t *complex)
{
lazy_free(complex->as_string);
lazy_free(complex);
}
Some explanations are needed. When new number is allocated (complex_new), the library allocated two hardware pages - one read/write for the object itself, and one no-access for the magical as_string.

When as_string is used, complex_string_fault runs. It marks the string read/write, and the object read-only. Then it generates as_string using snprintf. As the object is read-only now, we can be sure the string doesn't go out of sync.

When the read-only object is written to, complex_main_fault is run, the object page is marked read/write, the as_string page is marked no-access, and so we can be sure that when as_string is requested it will be generated anew.

Laziness library

Low-level part of the library is bigger and more complex, so I'll describe it part by part.
/* laziness.c */
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <signal.h>
#include <unistd.h>

#include "laziness.h"

struct fault_handlers_list
{
void *obj;
fault_handler_t fault_handler;
void *data;
struct fault_handlers_list *next;
};

struct fault_handlers_list *fault_handlers_list_root;

void remove_fault_handler(void *obj)
{
struct fault_handlers_list **prev = &fault_handlers_list_root;
struct fault_handlers_list *node;
while(*prev)
{
node = *prev;
if(node->obj == obj)
{
*prev = node->next;
free(node);
return;
}
prev = &(node->next);
}
/* No fault handler for obj */
}

void install_fault_handler(void *obj, fault_handler_t fault_handler, void *data)
{

struct fault_handlers_list *node = malloc(sizeof(struct fault_handlers_list));
node->obj = obj;
node->fault_handler = fault_handler;
node->data = data;
node->next = fault_handlers_list_root;
fault_handlers_list_root = node;
}

install_fault_handler and remove_fault_handler maintain a list of fault handlers, in form of a linked list. It's of course extremely lame - in practice a hash table would be used, or a few bytes of the object would be reserved for fault handler. It's also extremely simple, so let's ignore performance for now.

void *lazy_alloc_rw(void)
{
void *obj = mmap(0, PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
return obj;
}

void *lazy_alloc_ro(fault_handler_t fault_handler, void *data)
{
void *obj = mmap(0, PAGE_SIZE, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
if(fault_handler)
install_fault_handler(obj, fault_handler, data);
return obj;
}

void *lazy_alloc_none(fault_handler_t fault_handler, void *data)
{
void *obj = mmap(0, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
if(fault_handler)
install_fault_handler(obj, fault_handler, data);
return obj;
}

lazy_alloc_* routines mmap an anonymous (that is - not associated with any file) page of memory with appropriate permissions and register a fault handler if any was given.

void lazy_make_rw(void *obj)
{
remove_fault_handler(obj);
mprotect(obj, PAGE_SIZE, PROT_READ|PROT_WRITE);
}

void lazy_make_ro(void *obj, fault_handler_t fault_handler, void *data)
{
remove_fault_handler(obj);
mprotect(obj, PAGE_SIZE, PROT_READ);
if(fault_handler)
install_fault_handler(obj, fault_handler, data);
}

void lazy_make_none(void *obj, fault_handler_t fault_handler, void *data)
{
remove_fault_handler(obj);
mprotect(obj, PAGE_SIZE, PROT_NONE);
if(fault_handler)
install_fault_handler(obj, fault_handler, data);
}

void lazy_free(void *obj)
{
remove_fault_handler(obj);
munmap(obj, PAGE_SIZE);
}

lazy_make_* functions remove the old fault handler, change access permissions, and install a new fault handler. lazy_free removes the old handler and unmaps the memory page.

And the last part:
void lazy_segfault_handler(int signum, siginfo_t *siginfo, void *something)
{
void *obj = (void*)((unsigned int)(siginfo->si_addr) & ~(PAGE_SIZE-1));
struct fault_handlers_list *node = fault_handlers_list_root;

while(node)
{
if(obj == node->obj)
{
node->fault_handler(obj, node->data);
return;
}
node = node->next;
}
fprintf(stderr, "Segmentation fault: %p\n", siginfo->si_addr);
exit(1);
}

void lazy_init() __attribute__((constructor));
void lazy_init()
{
struct sigaction segfault_handler;
segfault_handler.sa_sigaction = lazy_segfault_handler;
segfault_handler.sa_flags = SA_SIGINFO;
sigemptyset(&segfault_handler.sa_mask);

sigaction(SIGSEGV, &segfault_handler, 0);
}

lazy_init function is marked as constructor, so it runs automatically when the program starts. I wrote about it in the post about debuggers. It tell the operating system to run lazy_segfault_handler whenever a segmentation fault occurs.

lazy_segfault_handler finds and runs the fault handler. When it returns, the operating system automatically returns the operation which faulted. If none is found, it prints an error message and exits.

Let's take a look at main.c again. What happens inside.
/* main.c */
#include <stdio.h>
#include "magic_complex.h"

int main()
{
/* a allocated as read/write */
/* a->as_string allocated as no-access */
complex_t *a = complex_new();

/* a is read/write, nothing special happens */
a->re = 5.0;
a->im = -2.0;

/* a->as_string is no-access, segfault */
/* a becomes read-only */
/* a->as_string becomes read-write */
/* a->as_string generated */
printf("%s\n", a->as_string);

/* a->as_string is ok, nothing special happens */
printf("%s\n", a->as_string);

/* a is read only, segfault */
/* a becomes read-write */
/* a->as_string becomes no-access */
a->re = 3.0;

/* a->as_string is no-access, segfault */
/* a becomes read-only */
/* a->as_string becomes read-write */
/* a->as_string generated */
printf("%s\n", a->as_string);

return 0;
}

Why ?

So now that you see how such things can be done, there's a why question left. Obviously, nobody is going to use the magic complex library in any real program.

Here are a few examples where segfaults may be useful:
  • We can make mmap work for http connections, not only local files.
  • Some compilers make program reside the stack automatically if it grows too big.
  • Some compilers implement the tail call optimization by waiting for the stack to grow too big, and then compressing it.
  • Lazy data structures are possible, without user even knowing it ! A plain C list can be produced on-demand.
  • Some data structures exist in both C world and other world at the same time and need to be kept in sync. For example in GNOME programs arrays are represented by both glib's GArray and Ruby's RArray. By smart segfaulting it's possible to always keep only one of them active.
  • It also works for complex data structures. If the array contains other translatable objects, the magic segfault will make them appear as Ruby objects in Ruby code and as glib objects in glib code.
The example library always allocates 4kB of memory at time. It's trivial to extend it to multiples of 4kB. Resizable objects and objects which take less than a full page are possible, but more complicated. Thread-safety wasn't even attempted. Doing strange things with segfaults can confuse optimizing compilers.

Happy segfaulting.

Wednesday, March 28, 2007

Hashing archive contents

lemony cupcakes by chotda from flickr (CC-NC-ND)I'm slowly uploading stuff to the new server after crash. The thing that slows me down most is trying to get things right immediately instead of doing quick reupload first, and fixing them later. I know that's not the way to get things done fast, but it's more fun. I want packages to be automatically built and uploaded on a single command (and later, nightly from crontab). Some packages are pretty big (jrpg mostly), so doing this naively would mean unnecessarily reuploading a few hundred MBs every night. The Rakefile needs some way of knowing wheather the new package is any different from the old one. Unfortunately packages (.tar.gz, .tar.bz2, .zip etc.) with identical contents are not necessarily bitwise identical. In fact, most of the time they're not, and don't even have identical filesizes. So I wrote a library to hash archive contents, which will hopefully save a lot of unnecessary uploads. The library is pretty simple, so I'm just pasting it here instead of packaging, releasing on RubyForge etc., at least for now.
require 'sha1'
require 'tmpdir'

class Array
 def random
     self[rand(size)]
 end
end

class String
 def digest
     SHA1.hexdigest(self)
 end
 def self.random(len = 32)
     path_characters = ("a".."z").to_a + ("A".."Z").to_a + ("0".."9").to_a + ["_"]
     (0...len).map{ path_characters.random }.join
 end
end

class File
 def self.digest(file_name)
     SHA1.hexdigest(File.read(file_name))
 end
end

class Archive
 def self.finalizer(dir)
     Proc.new{
         system "rm", "-rf", dir
     }
 end
 # file_name must be absolute
 def initialize(file_name, type=nil)
     @file_name = file_name
     type = guess_type_by_extension if type == nil
     @type = type
     @unpacked = false
 end
 # It's not particularly secure
 # Unfortunately tempfile only creates files, not directories
 def dir
     return @dir if @dir
     while true
         @dir = Dir::tmpdir + "/ahash-" + String.random
         Dir.mkdir @dir rescue redo
         ObjectSpace.define_finalizer(self, Archive.finalizer(@dir))
         return @dir
     end
 end
 def guess_type_by_extension
     case @file_name
     when /(\.tgz|\.tar\.gz)\Z/
         :tar_gz
     when /(\.tar\.bz2)\Z/
         :tar_bz2
     when /(\.tar)\Z/
         :tar
     when /(\.zip)\Z/
         :zip
     else
         nil
     end
 end
 def unpack
     return if @unpacked
     Dir.chdir(dir) {
         case @type
         when :tar_gz
             system "tar", "-xzf", @file_name
         when :tar_bz2
             system "tar", "-xjf", @file_name
         when :tar
             system "tar", "-xf", @file_name
         when :zip
             system "unzip", "-q", @file_name
         else
             raise "Don't know how to unpack archives of type #{@type}"
         end
     }
     @unpacked = true
 end
 def quick_hash
     unpack
     @quick_hash ||= Dir.chdir(dir) {
         Dir["**/*"].map{|file_name|
             if File.directory?(file_name)
                 ['dir', file_name]
             else
                 ['file', file_name, File.size(file_name)]
             end
         }.sort.inspect.digest
     }
 end
 def slow_hash
     unpack
     @slow_hash ||= Dir.chdir(dir) {
         Dir["**/*"].map{|file_name|
             if File.directory?(file_name)
                 ['dir', file_name]
             else
                 ['file', file_name, File.size(file_name), File.digest(file_name)]
             end
         }.sort.inspect.digest
     }
 end
end
Some details:
  • Array#random picks a random array element
  • String.random picks a random array element
  • String#digest returns SHA1 hash of string in hex format
  • File.digest(file_name) returns hex SHA1 hash of contents of file file_name
  • Archive.new(file_name, type) creates Archive object
  • Archive.new(file_name) creates Archive object and guesses its type (:tar_gz, :tar_bz2, :tar, :zip) based on file extension
  • Archive#guess_type_by_extension guesses Archive's type by looking at file extension. (internal function)
  • Archive#dir when first run creates temporary directory in /tmp (or system-specific place for temporary files), registers finalizer which rm -rfs this directory, and returns path to the newly created directory. When run afterwards simply returns the saved path. (internal function)
  • Archive#unpack unpacks contents of the archive to the temporary directory. (internal function)
  • Archive#quick_hash returns a quick hash, based only on list of files and their sizes, not contents.
  • Archive#slow_hash returns a reliable but possibly slower hash, based on file list and their contents.
I don't think speed difference between Archive#quick_hash and Archive#slow_hash is that big, as unpacking and hashing take comparable amount of time. On the other hand Archive#quick_hash could easily be computed based on only archive listing (like tar -tvzf), without doing the unpacking, what would make a major difference.

Tuesday, March 27, 2007

Hosting at zabor.org is down

sparklin' drops of spring by Steve took it from flickr (CC-NC-SA)Server I used to host most of my projects - zabor.org, is down for a few days now. At first I thought the problem is temporary, but it seems it's going to stay down for good. All data that was there is on my box, but it's somewhat messy, and it will take a few hours before I get it all back. Anyway, a new website is coming. It's going to be faster, better, more automatic, and with real backup. I want to get to the point where I can simply say rake website and all my packages will be checked out, built, tested, packaged, and uploaded together with updated HTML files. I'd also like to have add some WWW::Mechanize or firewatir-based website testing to make sure everything works as expected. Testing by hand is so 1990s. Well, at least it's a good excuse to learn firewatir. New hosting servers are: Thanks to balrog kun for providing hosting zabor.org, and to pio and anubis for new hosting servers. Extra hosting servers are of course welcome :-) To make this post a little less boring, here's fragrement of the Rakefile which builds the website, which is then uploaded to all servers:
website_root = "/home/taw/website"

PROJECTS = %w[jrpg jrpg2 magic_xml magic_help rlisp]

PROJECTS.each do |project|
   #desc "Create #{website_root}/#{project} directory"
   file "#{website_root}/#{project}" do
       mkdir_p "#{website_root}/#{project}"
   end

   desc "Remove old website files for #{project}"
   task :"clean_#{project}" do
       rm_rf "#{website_root}/#{project}", :verbose => true
   end

   desc "Fix permissions of #{project}'s website"
   task :"pub_#{project}" do
       sh "pub #{website_root}/#{project}"
   end

   desc "Build #{project}'s index.html"
   task :"#{project}_index" do
       sh "eruby <#{project}/index.rhtml >#{website_root}/#{project}/index.html"
   end
end

desc "Build everything that can be built automatically (for testing)"
task :test_build => [:jrpg_package, :jrpg, :jrpg2, :rlisp]

desc "Clean everything"
task :clean => PROJECTS.map{|project| :"clean_#{project}"}

def package(glob, dir="packages")
   return glob.map{|g| package(g, dir)} if glob.is_a? Array
   Dir.glob("#{dir}/#{glob}").sort[-1]
end

def rake_remote(dir, *actions)
   Dir.chdir(dir) { sh "rake", *actions }
end

# JRPG
desc "Build jrpg website (doesn't build package automatically)"
task :jrpg => [:clean_jrpg, "#{website_root}/jrpg", :jrpg_index, :jrpg_files, :pub_jrpg]

desc "Copy jrpg image and package files to website"
task :jrpg_files do
   files = Dir["jrpg/jrpg_new_screenshot*.{jpg,png}"] +
           package(["jrpg-20*-*.tar.gz", "jrpg-20*-*.zip", "jrpg-windows-20*.zip"])
   cp files, "#{website_root}/jrpg"
end

# jrpg packages are not built automatically
# It is not possible, as Windows package needs to be build by hand
desc "Build jrpg packgae"
task :jrpg_package do
   rake_remote "../jrpg/", "package"
   mv package(["jrpg-*-*.tar.gz", "jrpg-*-*.zip"], "../jrpg"), "packages/"
end

# JRPG 2
desc "Build jrpg2 website and packgae"
task :jrpg2 => [:clean_jrpg2, "#{website_root}/jrpg2", :jrpg2_index, :jrpg2_files, :pub_jrpg2]

desc "Copy jrpg2 image and package files to website"
task :jrpg2_files do
   files = Dir["jrpg2/jrpg2_*.{jpg,png}"] + package(["jrpg2-*.tar.gz", "jrpg2-*.zip"])
   cp files, "#{website_root}/jrpg2/"
end

desc "Build jrpg2 package"
task :jrpg2_package do
   rake_remote "../jrpg/", "package2"
   mv package(["jrpg2-*-*.tar.gz", "jrpg2-*-*.zip"], "../jrpg"), "packages/"
end

# RLisp
desc "Build RLisp website (doesn't build package automatically)"
task :rlisp => [:clean_rlisp, "#{website_root}/rlisp", :rlisp_index, :rlisp_files, :pub_rlisp]

desc "Copy RLisp package files to website"
task :rlisp_files do
   cp package(["rlisp-*.tar.gz", "rlisp-*.zip"]), "#{website_root}/rlisp/"
end

desc "Build RLisp package (NOT TESTED YET)"
task :rlisp_package do
   rake_remote "../rlisp/", "package"
   mv package(["rlisp-*-*.tar.gz", "../rlisp/rlisp-*-*.zip"], "../rlisp"), "packages/"
end

Thursday, March 22, 2007

How to code debuggers

Yawn... by _Xti_ from flickr (CC-NC)Coding low-level infrastructure like kernels, compilers, and linkers can be very scary, and most programmers stay as far away from them as they can. And the scariest of all are debuggers, which rip apart warm flesh of innocent programs, and use the dark side of the force to control them. Every decent Computer Science course includes coding at least a toy compiler, most also toy interpretters, virtual machines, and disassemblers, but how many people wrote even the most toyish debugger ? (by debugger I mean any tool for low-level analysis of running programs, wheather it single steps, traces execution, or does something else)

Most debuggers are written in C, so touching their code is pretty much out of the question. It's hard enough without having to use 30 year old and about 32 time less productive tools.

Why would anybody want to write a debugger ? Many people work almost exclusively with high-level languages running on virtual machines, and they probably don't need to - high-level languages provide a lot of reflection and run-time program modification abilities that writing custom debugging tools won't be necessary. Many people are satisfied by available debuggers, or at least find them flexible enough to write custom debugging tools around them. strace plus perl one-liners certainly saved people years of debugging time.

And then there are people who have to do low-level coding, and find perl scripting around existing tools insufficient, but because debuggers seem so scary they simply learned to live with bad tools.

How debuggers work

Debuggers are based on a few simple techniques. I'll talk about Linux and other modern Unices here like Solaris and FreeBSD. On other systems the details differ, but principles are similar.
  • System call ptrace lets processes control other processes.
  • Binaries in ELF format include a lot of useful information.
  • Calls to library functions get resolved only when the program runs. It's very easy to make them point to our functions instead.
  • Compilers emit a lot of useful debugging information in DWARF format.
  • /proc/ contains a lot of information about running programs.
That's about it. Sometimes these techniques are insufficient, for example when debugging kernel bugs. Then there are fancier tricks like kernel modules and running on a virtual machine, but it's very rarely needed.

ptrace

long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
Most debugging tools are based on ptrace. ptrace is extremely useful but almost unknown syscall.

To start tracing some process simply include <sys/ptrace.h> in your C program, and attach with ptrace(PTRACE_ATTACH, process_pid, 0, 0).

Then you can read and write its registers:
void *get_instruction_pointer(pid_t pid)
{
return (void *)ptrace(PTRACE_PEEKUSR, pid, 4 * EIP, 0);
}

void set_instruction_pointer(pid_t pid, void *addr)
{
ptrace(PTRACE_POKEUSR, pid, 4 * EIP, (long)addr);
}
and data:
void *get_stack_pointer(pid_t pid)
{
return (void *)ptrace(PTRACE_PEEKUSER, pid, 4 * UESP, 0);
}

void *get_return_addr(pid_t pid, void *stack_pointer)
{
return (void *)ptrace(PTRACE_PEEKTEXT, pid, stack_pointer, 0);
}
For "historical reasons" PEEK means read, and POKE means write. Available requests are:
  • PTRACE_PEEKDATA - read program data
  • PTRACE_POKEDATA - write program data
  • PTRACE_PEEKTEXT - read program code
  • PTRACE_POKETEXT - write program code
  • PTRACE_PEEKUSR - read process information (struct user)
  • PTRACE_POKEUSR - write process information (struct user)
  • PTRACE_GETREGS, PTRACE_GETFPREGS - read general/floating point registers
  • PTRACE_SETREGS, PTRACE_SETFPREGS - write general/floating point registers
  • PTRACE_SINGLESTEP - let program execute single instruction. There's hardware support for this.
  • PTRACE_SYSCALL - trace program until the next syscall
  • PTRACE_CONT - let program run until it gets a signal
  • PTRACE_DETACH - detach from program
  • and a few more.
Under Linux code and data are not separated, so PTRACE_PEEKDATA equals PTRACE_PEEKTEXT and PTRACE_POKEDATA equals PTRACE_POKETEXT. Registers can be read/written either through PEEKUSR/POKEUSR or through GETREGS/SETREGS, whichever is more convenient.

ptrace is very low level. Reads and writes only do one (probably aligned) machine word at time. It's not difficult to wrap these routines in some more convenient interface.

So you know how to manipulate programs. The only missing part is breakpoints. Breakpoints are actually pretty easy:
  • Compute location of the break point
  • Read byte that was there and save it.
  • Put 0xCC byte there (int3 opcode). If program is ptraced, executing int3 will freeze it.
Disabling breakpoits is even easier - just write back the old instruction. Running breakpointed code is a little more tricky:
  • Disable breakpoint (write back old instruction)
  • Run single step (PTRACE_SINGLESTEP)
  • Enable breakpoint (write int3)
  • Continue (PTRACE_CONT)
As we can only read and write full words at time, what actually happens is that we read 4 bytes, modify one of them, and write back 4 bytes (one with breakpoint, and 3 around it).

ptrace in action

There are many ptrace-based tools, like strace for tracing syscalls, ltrace for tracing shared library calls, and gdb uses ptrace too.

ptraceing process can be ptraced, so let's see what happens inside a tracing process. strace -o metastrace strace true traces strace as it traces true and saves the trace to metastrace file.

Here's a fragment:

ptrace(PTRACE_SYSCALL, 7973, 0x1, SIG_0) = 0
wait4(-1, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL, NULL) = 7973

--- SIGCHLD (Child exited) @ 0 (0) ---
ptrace(PTRACE_PEEKUSER, 7973, 4*ORIG_EAX, [0x5]) = 0
ptrace(PTRACE_PEEKUSER, 7973, 4*EAX, [0xffffffda]) = 0
ptrace(PTRACE_PEEKUSER, 7973, 4*EBX, [0xb7f2ab3c]) = 0
ptrace(PTRACE_PEEKUSER, 7973, 4*ECX, [0]) = 0
ptrace(PTRACE_PEEKUSER, 7973, 4*EDX, [0]) = 0

ptrace(PTRACE_PEEKDATA, 7973, 0xb7f2ab3c, [0x62696c2f]) = 0
ptrace(PTRACE_PEEKDATA, 7973, 0xb7f2ab40, [0x736c742f]) = 0
ptrace(PTRACE_PEEKDATA, 7973, 0xb7f2ab44, [0x3836692f]) = 0
ptrace(PTRACE_PEEKDATA, 7973, 0xb7f2ab48, [0x6d632f36]) = 0
ptrace(PTRACE_PEEKDATA, 7973, 0xb7f2ab4c, [0x6c2f766f]) = 0
ptrace(PTRACE_PEEKDATA, 7973, 0xb7f2ab50, [0x2e636269]) = 0
ptrace(PTRACE_PEEKDATA, 7973, 0xb7f2ab54, [0x362e6f73]) = 0
ptrace(PTRACE_PEEKDATA, 7973, 0xb7f2ab58, [0x62696c00]) = 0

write(2, "open(\"/lib/tls/i686/cmov/libc.so"..., 45) = 45
ptrace(PTRACE_SYSCALL, 7973, 0x1, SIG_0) = 0
wait4(-1, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL, NULL) = 7973

--- SIGCHLD (Child exited) @ 0 (0) ---
ptrace(PTRACE_PEEKUSER, 7973, 4*ORIG_EAX, [0x5]) = 0
ptrace(PTRACE_PEEKUSER, 7973, 4*EAX, [0x3]) = 0

write(2, ") = 3\n", 6) = 6
ptrace(PTRACE_SYSCALL, 7973, 0x1, SIG_0) = 0
wait4(-1, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL, NULL) = 7973

In highlighted regions:
  • Parent requests ptrace to wait for system calls of its child and waits for something to happen.
  • Something happens - child tries to run some system call, and gets virtual signal SIGTRAP. Parent gets SIGCHLD (incorrectly described at "Child exited")
  • Parent reads syscall number (5 = open), and arguments - 0xb7f2ab3c is pointer to the path, 0 is O_RDONLY, and the next 0 is mode.
  • Parent reads string located at 0xb7f2ab3c (the path), until it finds a 0.
  • Parent reports that system call happened.
  • Parent waits for system call to return by PTRACE_SYSCALL and wait4.
  • Parent reads return value from syscall - 3
  • Parent reports that return value of the system call.
  • Parent tells ptrace to keep tracking child's system calls and lets the child continue.
That wasn't bad, was it ?

ELF format

Binaries on most modern Unices use ELF format (106-page specification). A lot more can be done with ELF binaries than just executing them. For one we can check which shared library the program uses.
$ ldd /usr/bin/ruby
linux-gate.so.1 => (0xffffe000)
libruby1.8.so.1.8 => /usr/lib/libruby1.8.so.1.8 (0xb7e3e000)
libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb7e27000)
libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7e23000)
libcrypt.so.1 => /lib/tls/i686/cmov/libcrypt.so.1 (0xb7df5000)
libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7dce000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7c8d000)
/lib/ld-linux.so.2 (0xb7f29000)
/lib/ld-linux.so is dynamic loader, which looks for other libraries in the path. linux-gate is a pseudolibrary used for communication with kernel. libdl.so is a dynamic library loader and lets the program load more libraries at runtime. More about that later - what's interesting is that all this information is contained in the ELF binary in easily accessible format.

Two programs most useful for getting information from binaries are objdump and readelf. To get default a basic overview simply run readelf -W --all binary or objdump --all-headers binary.

Here's a fragrent of readelf -W --all /usr/bin/perl's output, listing functions which Perl interpretter gets from shared libraries.

Relocation section '.rel.plt' at offset 0x1688c contains 231 entries:
Offset Info Type Sym.Value Sym. Name
08142114 00001d07 R_386_JUMP_SLOT 00000000 readlink
08142118 00003907 R_386_JUMP_SLOT 00000000 nl_langinfo
0814211c 00003a07 R_386_JUMP_SLOT 00000000 mkdir
08142120 00004507 R_386_JUMP_SLOT 00000000 pthread_getspecific
08142124 00004807 R_386_JUMP_SLOT 00000000 cos
08142128 00005807 R_386_JUMP_SLOT 00000000 fgetc
0814212c 00006007 R_386_JUMP_SLOT 00000000 chown
08142130 00006907 R_386_JUMP_SLOT 00000000 getgrnam_r
08142134 00006e07 R_386_JUMP_SLOT 00000000 __strtod_internal
08142138 00007107 R_386_JUMP_SLOT 00000000 setservent
0814213c 00007a07 R_386_JUMP_SLOT 00000000 rename
08142140 00008b07 R_386_JUMP_SLOT 00000000 ferror
08142144 00009207 R_386_JUMP_SLOT 00000000 sigaction
08142148 00009b07 R_386_JUMP_SLOT 00000000 getgrent_r
0814214c 00009d07 R_386_JUMP_SLOT 00000000 execl
08142150 0000a207 R_386_JUMP_SLOT 00000000 vsprintf
08142154 0000a307 R_386_JUMP_SLOT 00000000 strchr
08142158 0000b607 R_386_JUMP_SLOT 00000000 fdopen
Here's a partial list of functions provided by libjpeg, found by readelf -all /usr/lib/libjpeg.so:
Symbol table '.dynsym' contains 141 entries:
Num: Value Size Type Bind Vis Ndx Name
12: 00008581 664 FUNC GLOBAL DEFAULT 10 jpeg_gen_optimal_table
13: 00002e6b 652 FUNC GLOBAL DEFAULT 10 jpeg_copy_critical_parame
14: 00005ba0 537 FUNC GLOBAL DEFAULT 10 jinit_c_prep_controller
16: 0000d1be 115 FUNC GLOBAL DEFAULT 10 jinit_input_controller
17: 0000c150 159 FUNC GLOBAL DEFAULT 10 jpeg_read_scanlines
18: 0001b414 35 FUNC GLOBAL DEFAULT 10 jpeg_get_small
19: 000146b4 1343 FUNC GLOBAL DEFAULT 10 jpeg_idct_float
20: 000032c1 969 FUNC GLOBAL DEFAULT 10 jpeg_set_colorspace
21: 00003280 65 FUNC GLOBAL DEFAULT 10 jpeg_quality_scaling
22: 000037fd 760 FUNC GLOBAL DEFAULT 10 jpeg_simple_progression
23: 0000b280 655 FUNC GLOBAL DEFAULT 10 jpeg_fdct_ifast
24: 00010487 1045 FUNC GLOBAL DEFAULT 10 jpeg_make_d_derived_tbl
25: 000160e3 59 FUNC GLOBAL DEFAULT 10 jpeg_idct_1x1
26: 0000fd59 309 FUNC GLOBAL DEFAULT 10 jpeg_huff_decode
27: 00014bf4 2667 FUNC GLOBAL DEFAULT 10 jpeg_idct_islow
28: 0000cd9e 919 FUNC GLOBAL DEFAULT 10 jinit_master_decompress
29: 0000cab6 744 FUNC GLOBAL DEFAULT 10 jpeg_calc_output_dimensio
30: 00003cbe 108 FUNC GLOBAL DEFAULT 10 jpeg_set_linear_quality
Using it on C++ libraries is more difficult, as C++ "mangles" identifiers. Because in C++ multiple functions with the same name are allowed, as long as they use arguments of different types, the compiler "mangles" identifiers to include type information, and it ends up like this: _ZN13nsCOMPtr_base18assign_with_AddRefEP11nsISupports. readelf doesn't seem to provide any demangling, but objdump -C -R /usr/bin/firefox will tell us that the needed function is nsCOMPtr_base::assign_with_AddRef(nsISupports*).

objdump also includes a disassembler, so you can find implementation of geteuid by looking at output of objdump -d /lib/libc.so.6:
0008c750 <geteuid>:
8c750: 55 push %ebp
8c751: 89 e5 mov %esp,%ebp
8c753: b8 c9 00 00 00 mov $0xc9,%eax
8c758: cd 80 int $0x80
8c75a: 5d pop %ebp
8c75b: c3 ret
For those that want to know - int 0x80 enters the kernel, and %eax contains syscall number (0xc9 is obviously one for geteuid). At least that's how it used to work. The "real" libc in /lib/tls/i686/cmov/libc.so.6 uses call *%gs:0x10 instead of int $0x80. (that's related to linux-gate.so.1 pseudolibrary)

Shared libraries

Let's get back to ldd:
$ ldd /usr/bin/ruby
linux-gate.so.1 => (0xffffe000)
libruby1.8.so.1.8 => /usr/lib/libruby1.8.so.1.8 (0xb7e3e000)
libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb7e27000)
libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7e23000)
libcrypt.so.1 => /lib/tls/i686/cmov/libcrypt.so.1 (0xb7df5000)
libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7dce000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7c8d000)
/lib/ld-linux.so.2 (0xb7f29000)
Only the dynamic linker is specified by full path. Everything else is resolved at runtime. Like binaries are looked for in PATH, libraries are searched in directories in LD_LIBRARY_PATH. So we can trivially make program use our libraries by changing LD_LIBRARY_PATH (like most techniques presented here, it doesn't work with setuid programs, so don't bother).

Even more effective is LD_PRELOAD, which simply lets us preload another library before the program runs. It can be any shared library, but usually we want to

Let's fool /bin/date into thinking it's noon (UTC) of July 21, 1995.

The library to do so it truly trivial:
#include <time.h>

int clock_gettime(clockid_t clk_id, struct timespec *tp)
{
tp->tv_sec = 806328000;
tp->tv_nsec = 0;
return 0;
}
We need to tell compiler we're compiling shared library, not a standalone binary by -fPIC -shared flags:
gcc -Wall -W -g -fPIC -shared magic_clock.c -o magic_clock.so
And:
$ date
Thu Mar 22 08:12:36 CET 2007
$ LD_PRELOAD=./magic_clock.so date
Fri Jul 21 14:00:00 CEST 1995
It even remembered to change to summer time.

Usually we want to have the original functionality available. I think it's enough hello worlds for this post, so let's do something useful, like getting statistics for hash lookups in Ruby.

Hash lookups in Ruby go through st_lookup function. We're interested in distribution of sizes of hash tables used. This kind of data is useful for profiling - maybe it will provide hints wheather replacing Ruby hashes by Judy is worthwhile. Or maybe we just want to know. Anyway, here's the debugging tool:
#define _GNU_SOURCE 1
#include <stdio.h>
#include <dlfcn.h>

/* Copied from ruby's st.h - it's the easiest way */
typedef unsigned long st_data_t;
struct st_table {
struct st_hash_type *type;
int num_bins;
int num_entries;
struct st_table_entry **bins;
};

int statistics[128];

int (*real_st_lookup) (struct st_table *, st_data_t, st_data_t *);

int st_lookup(struct st_table *table, st_data_t key, st_data_t *value)
{
int size = table->num_entries;
if(size > 15)
size = 15 + (size / 16);
if(size > 127)
size = 127;
statistics[size]++;

return real_st_lookup(table, key, value);
}

void st_lookup_stats_init() __attribute__ ((constructor));
void st_lookup_stats_init()
{
int i;
for(i=0; i<128; i++)
statistics[i] = 0;

real_st_lookup = dlsym(RTLD_NEXT, "st_lookup");
}

void st_lookup_stats_fini() __attribute__ ((destructor));
void st_lookup_stats_fini()
{
int i, total=0;

for(i=0; i<128; i++)
total += statistics[i];

printf("Hash lookup statistics by hash size:\n");
for(i=0; i<16; i++)
printf("%d-element hashes: %d (%.1f%%)\n", i, statistics[i], 100.0 * statistics[i] / total);
for(i=16; i<127; i++)
printf("%d..%d-element hashes: %d (%.1f%%)\n", (i-15)*16, (i-15)*16+15, statistics[i], 100.0 * statistics[i] / total);
printf("1792+-element hashes: %d (%.1f%%)\n", statistics[127], 100.0 * statistics[127] / total);
}
Compile it with:
gcc -Wall -W -g -fPIC -shared -ldl -lruby1.8 st_lookup_stats.c -o st_lookup_stats.so
and run with:
$ LD_PRELOAD=./st_lookup_stats.so ruby -e 'p 42'
42
Hash lookup statistics by hash size:
0-element hashes: 164 (2.2%)
1-element hashes: 1059 (14.3%)
2-element hashes: 20 (0.3%)
3-element hashes: 18 (0.2%)
4-element hashes: 17 (0.2%)
5-element hashes: 13 (0.2%)
6-element hashes: 17 (0.2%)
7-element hashes: 12 (0.2%)
8-element hashes: 19 (0.3%)
9-element hashes: 192 (2.6%)
10-element hashes: 10 (0.1%)
11-element hashes: 14 (0.2%)
12-element hashes: 10 (0.1%)
13-element hashes: 12 (0.2%)
14-element hashes: 12 (0.2%)
15-element hashes: 12 (0.2%)
16..31-element hashes: 185 (2.5%)
32..47-element hashes: 158 (2.1%)
48..63-element hashes: 171 (2.3%)
64..79-element hashes: 171 (2.3%)
80..95-element hashes: 126 (1.7%)
96..111-element hashes: 171 (2.3%)
112..127-element hashes: 92 (1.2%)
128..143-element hashes: 48 (0.7%)
144..159-element hashes: 103 (1.4%)
160..175-element hashes: 49 (0.7%)
176..191-element hashes: 95 (1.3%)
192..207-element hashes: 60 (0.8%)
[...]
1760..1775-element hashes: 0 (0.0%)
1776..1791-element hashes: 0 (0.0%)
1792+-element hashes: 0 (0.0%)
OK, now the explanations. int statistics[128]; keeps our statistics and st_lookup_stats_fini() function prints them out.

In ELF it's possible to mark some functions as constructors and destructors.

void st_lookup_stats_init() __attribute__ ((constructor)); tells gcc to mark st_lookup_stats_init as constructor, so it runs when the library gets loaded.

void st_lookup_stats_fini() __attribute__ ((destructor)); tells gcc to mark st_lookup_stats_fini as destructor, so it runs when the program exits.

void st_lookup_stats_init() zeroes statistics and find out location of the real st_lookup: real_st_lookup = dlsym(RTLD_NEXT, "st_lookup");

Our fake st_lookup records hash size and calls the genuine st_lookup.

st_data_t and struct st_table are copied from Ruby sources because i was too lazy to do it the "right way", and it doesn't matter anyway.

Debug information

gdb can control program by ptrace, but it still needs to know about its internals somehow. When gcc is run with -g flag, it saves debug information in DWARF format (117-page specification of DWARF-2).

Unfortunately most programs are distributed without this information, so we need to recompile with -g to get it. For all ELF files (programs, shared libraries, .o object files etc.) that contain debugging information, we can get it with readelf -W -w binary. Some of the useful data not available in other ways are types of function arguments, structure members and their offsets, and file/line information for everything.

Unfortunately parsing either binary DWARF or output of readelf is rather complicated. I haven't tried it out yet, but there seems to be a program for converting DWARF-2 information to XML. The XML it produces seems rather low-level, but it's XML, so it won't be hard to get useful information out of it.

/proc

/proc filesystem contains a lot of information about running processes. Most of this information can be extracted by ptrace, but /proc interface is far more convenient.

Some files useful for debugging:
  • /proc/<pid>/cmdline - process command line arguments
  • /proc/<pid>/cwd - symlink to current working directory of the process
  • /proc/<pid>/environ - process environment
  • /proc/<pid>/exe - symlink to executable
  • /proc/<pid>/fd/* - open file descriptors
  • /proc/<pid>/maps - map of process memory
  • /proc/<pid>/mem - entire process memory
That's pretty much all you need to know to start. Happy debugger coding.

Wednesday, March 21, 2007

Big O analysis considered harmful

Jack in Sink (the early years) by grebo guru from flickr (CC-ND)My previous post wasn't supposed to be about performance - just about x86 assembly. Somehow what grabbed most attention was criticism of the big O analysis. Comments here and on reddit seem to show that most people didn't really get my point.
All models are wrong. Some are useful.
Models are just tools that make reasoning about something easier - in many cases to make it possible at all. Real world systems are extremely complex. Models distort reality to make it more amenable to analysis. Classical mechanics is one such successful model. We know it is wrong, as it ignores quantum effects, and relativity. Most of the time, many other physical effect are ignored - like rotations of Earth possibility of object properties changing with temperature, and in some cases even friction. Most models in economics assume that capital is homogeneous, and bioinformatics even sometimes pretends that molecules are rigid. As long as distortions introduced by the model are not particularly relevant for our purposes, the model is fine. But even models known to seriously distort the very feature we're interested in can be useful. To find molecules likely to have medical effects, millions of molecules must be checked. Good models of molecular interactions are just too slow, so simple models in which molecules are rigid or have limited flexibility are used first to get a rough approximation, what lets us focus on more promising candidates. Sometimes a great molecule will be overlooked this way, and we simply have to accept that. To find a good developer, you don't offer a trial period to every candidate to evaluate them - it's quite accurate model, but way too expensive. So cheaper models like interviews and CVs are used, even though they distort reality a lot. There's nothing wrong with models being wrong. We need wrong models. There's just no way we'd perform a randomized double-blind study on a few thousand people every time we want to check wheather some molecule is medically useful. A lot of simpler models - computer simulations, chemical experiments, cell culture and animal tests - are used first. Very often the simple models give incorrent answers. Many drugs they found safe and effective were later rejected in clinical trials (or worse - after being introduced to the market). Most likely many drugs that would be safe and effective in humans were rejected because of such simple tests. We need such models like we need maps. Depending on our purpose, we need different models like we need different kinds of maps.
The map is not the territory.

Performance models

It's very easy to get used to a model - to keep using it even when it's totally inapplicable. A great example of a misguided performance model is Amdahl's Law. According to this law, programs have part that can be parallelized (1 − α) and part that has to be serial (α). So by throwing P processors at the problem, we get running time of:
T = α + (1 − α)/P
So if the serial part takes 10% of time on single processor, throwing whopping 256 processors at the problem gives only meager 9.66x speedup. The assumption which makes the model completely useless is serial part being constant. But the more processors we have, the bigger problems we deal with, and almost invariably, the smaller the serial part becomes. "Embarrassingly parallel" problems for which speedup is almost proportional to number of processors thrown at the problem are extremely common. Model which better matches reality is Gustafson's Law, where speedup is:
S = α + (1 − α) P
and throwing 256 processors at a problem with 10% serial time achieves 230.5x speedup ! Big O analysis on random access machine is just one of many models of performance. If used right, it can be useful. Unfortunately on modern hardware, it's almost completely useless for predicting performance. Worse than that - it is harmful, as very widespread belief in big O analysis slows down adoption of more accurate models.

Big O analysis

Here's a quick sketch of big O analysis in action.
  • Write down the algorithm in terms of basic operations.
  • Select relevant variable or variables. Usually it's input size n.
  • Analyze how number of basic operations depends on variables like n. We're only interested in asymptotic upper bound, ignoring the "constant factor".
  • The result is supposed to tell us how fast is the algorithm in the worst case. Algorithms with better O-class are supposed to be faster, "except for maybe small n".
The problem is the very notion of "basic operation". Big O analysis assumes that different basic operations take the same amount of time modulo a constant factor, which is small, and definitely independent of n. Unfortunately, as will be demonstrated, "the same" basic operation can easily take more than million times longer, depending on factors completely ignored by the big O analysis. x = data[i]; can as easily mean 1 ns copy from L1 cache to register or 8 000 000 ns to seek disk head, read whole 4kB page from swap file to memory, propagate the data through (optional L3), L2, and L1 caches, and finally copy it to the register. Modern hardware has multi-level memory. For small problem sizes all data fits L1 or L2 cache. For moderate sizes memory needs to be used. For big problem sizes disks are used. Even bigger than that it's tapes or distributed storage or something even more fancy. There is no n beyond which all memory access can be assumed to take uniform time, and up to which we can ignore all evidence against big O analysis as "small n effect". Even when we know at what level we're operating, counting operations is still useless - memories have high enough throughput, but very high latency. So accessing them the right way (typically long sequential reads) can be many orders of magnitude faster than accessing them the wrong way. The following table shows how much throughput DDR2 ram and 7200 rpm hard disks provide with sequential access (best case) and reading 4 bytes from random locations (worst case).
Memory levelLatencyWorst case throughputBest case throughputBest to worst ratio
Memory (DDR2-533 SDRAM 4–4–4)15 ns254 MB/s8.5 GB/s34:1
Hard drive (7200 rpm)4.17 ms7.7kB/s80 MB/s87500:1
Big O analysis won't tell us whether x = data[i]; accesses L1, L2, L3, main memory, disk, or some sort of distributed storage. And within particular memory level, it won't tell us wheather memory access patterns are of slow or fast kind (what for hard disks that makes 87500:1 difference !). Such effects are relatively benign for problems fitting L1 or L2 cache, and become far more important for larger n. Big O analysis is reasonably accurate only for small n !

Experiment

I wanted to run a simple experiment that would show performance of sequential and random access. The benchmark simply mmaps a big file full of random data, and adds its elements. In one version elements are accessed sequentially, in the other version "pseudorandomly". Pseudorandom access is just a few instructions, most expensive of them being j % size. Real (wall clock) time is measured, not user+sys. If less than 10s passed, program reruns the benchmark. This is done to improve measurement quality for very small n. The code is compiled with -O6 -march=athlon-xp -fomit-frame-pointer, and is optimized pretty well by the compiler. The machine has 512MB memory. Sequential version:

#define _GNU_SOURCE 1
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <unistd.h>

int main()
{
   char *file = "./huge_file";
   size_t max_size = 2560LL*1024*1024; /* 2.5GB */
   int fd;
   int *data;
   unsigned int size = 256;
   struct timeval start_time, end_time;
   unsigned int i;
   int res;
   double time_elapsed;
   int iterations;

   fd = open(file, O_RDONLY|O_LARGEFILE);
   if(fd == -1)
   {
       fprintf(stderr, "open failed: %s\n", strerror(errno));
       return 1;
   }

   data = mmap(NULL, max_size, PROT_READ, MAP_PRIVATE, fd, 0);
   if(data == MAP_FAILED)
   {
       fprintf(stderr, "mmap failed: %s\n", strerror(errno));
       return 1;
   }
   /* Tells kernel to expect sequential access */
   madvise(data, max_size, MADV_SEQUENTIAL|MADV_WILLNEED);

   while(size * sizeof(int) <= max_size)
   {
       gettimeofday(&start_time, 0);
       iterations = 0;

       /* At least one full iteration */
       do {
           iterations++;
           res = 0;
           for(i=0; i<size; i++)
               res += data[i];
           gettimeofday(&end_time, 0);
           time_elapsed = (end_time.tv_sec  - start_time.tv_sec)
                        + (end_time.tv_usec - start_time.tv_usec) * 1e-6;
       } while(time_elapsed < 10.0);

       printf("n=%d t/op=%g us t/op/n=%g ns res=%d (%f/%d)\n",
              size,
              time_elapsed*1000000/iterations,
              time_elapsed*1000000000/iterations/size,
              res,
              time_elapsed,
              iterations);

       if(size * sizeof(int) == max_size)
          break;
       size *= 1.5;
       if(size * sizeof(int) > max_size)
           size = max_size / sizeof(int);
   }

   return 0;
}
Pseudorandom version:

#define _GNU_SOURCE 1
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <unistd.h>

int main()
{
   char *file = "./huge_file";
   size_t max_size = 2560LL*1024*1024; /* 2.5GB */
   int fd;
   int *data;
   unsigned int size = 256;
   struct timeval start_time, end_time;
   unsigned int i;
   unsigned int j;
   int res;
   double time_elapsed;
   int iterations;

   fd = open(file, O_RDONLY|O_LARGEFILE);
   if(fd == -1)
   {
       fprintf(stderr, "open failed: %s\n", strerror(errno));
       return 1;
   }

   data = mmap(NULL, max_size, PROT_READ, MAP_PRIVATE, fd, 0);
   if(data == MAP_FAILED)
   {
       fprintf(stderr, "mmap failed: %s\n", strerror(errno));
       return 1;
   }
   /* Tells kernel to expect sequential access */
   madvise(data, max_size, MADV_RANDOM|MADV_WILLNEED);

   while(size * sizeof(int) <= max_size)
   {
       gettimeofday(&start_time, 0);
       iterations = 0;

       /* At least one full iteration */
       do {
           iterations++;
           res = 0;
           /* j is a fast pseudorandom number */
           j = 87230330;

           for(i=0; i<size; i++)
           {
               j += 2024939473;
               res += data[j % size];
           }
           gettimeofday(&end_time, 0);
           time_elapsed = (end_time.tv_sec  - start_time.tv_sec)
                        + (end_time.tv_usec - start_time.tv_usec) * 1e-6;
       } while(time_elapsed < 10.0);

       printf("n=%d t/op=%g us t/op/n=%g ns res=%d (%f/%d)\n",
              size,
              time_elapsed*1000000/iterations,
              time_elapsed*1000000000/iterations/size,
              res,
              time_elapsed,
              iterations);

       if(size * sizeof(int) == max_size)
          break;
       size *= 1.5;
       if(size * sizeof(int) > max_size)
           size = max_size / sizeof(int);
   }

   return 0;
}
Both programs perform exactly the same number of memory loads. According to big O analysis, their performance should differ only by a small constant factor. The following table shows time/operation.
nMemory sizeSequential time/operationPseudorandom time/operation
2561 kB3.4 ns31.8 ns
3841.5 kB2.9 ns31.5 ns
5762.2 kB2.7 ns31.3 ns
8643.4 kB2.4 ns31.1 ns
12965.1 kB2.3 ns31.0 ns
19447.6 kB2.2 ns31.0 ns
291611.4 kB2.2 ns30.9 ns
437417.1 kB2.1 ns30.9 ns
656125.6 kB2.1 ns30.9 ns
984138.4 kB2.2 ns31.0 ns
1476157.7 kB2.4 ns30.9 ns
2214186.5 kB2.7 ns31.0 ns
33211129.7 kB6.3 ns45.3 ns
49816194.6 kB10.6 ns182.5 ns
74724291.9 kB12.0 ns142.9 ns
112086437.8 kB10.6 ns189.6 ns
168129656.8 kB10.7 ns199.9 ns
252193985.1 kB10.6 ns232.1 ns
3782891.4 MB12.1 ns241.1 ns
5674332.2 MB10.9 ns240.9 ns
8511493.2 MB10.7 ns258.5 ns
12767234.9 MB10.7 ns260.7 ns
19150847.3 MB11.6 ns261.2 ns
287262611.0 MB10.8 ns265.0 ns
430893916.4 MB10.7 ns272.7 ns
646340824.7 MB10.8 ns297.0 ns
969511237.0 MB10.8 ns400.8 ns
1454266855.5 MB10.8 ns439.5 ns
2181400283.2 MB11.3 ns476.6 ns
32721003124.8 MB13.5 ns445.2 ns
49081504187.2 MB19.7 ns334.0 ns
73622256280.8 MB28.4 ns502.0 ns
110433384421.3 MB156.4 ns1028.4 ns
165650076631.9 MB422.6 ns150079.2 ns
248475114947.9 MB415.8 ns
3727126711.4 GB418.1 ns
5590690062.1 GB457.2 ns
6710886402.5 GB481.3 ns
In sequential algorithm, for very low n all data fits the cache, and time per operation is roughly constant. For moderate n the algorithm seems to be limited by memory transfer rates, but time per operation is still roughly constant. For high n disk transfer rates are the limitting factor, and time per operation reaches the third plateau. Overall time per operation raised 200 times, so if our performance analysis missed that, it's pretty much useless. On the other hand in practice it's very atypical to use a single algorithm for such a wide range of data sizes, and as long as we stay in one plateau, "constant time per operatioan" assumption can be at least a decent first approximation. Time in sequential case in median of three runs. Now, the random algorithm. For very low n all data fits the cache, and access times are dominated by computations of pseudorandom index j. For moderate n, time is dominated by memory latency, not memory throughput ! If memory latency wasn't a problem, time per operation would still be about 30 ns slower than in the sequential case. The real problem starts when the data no longer fits memory and disk needs to be used. I wasn't able to wait for n=165650076 case to finish. Instead, I had to freeze program, attach gdb, dump registers, and reverse engineer their values to find i (i was optimized away, but j is in %ecx, so it wasn't a big problem). That's 400x slower than sequential algorithm operating on problem of the same size, and 75 000 times slower than naive "constant time per operation" assumption would suggest. Some additional tests show that 150us is not the worst it gets. For bigger n seek times seem to get even longer (average seek times for disks are in a few ms range). Big O analysis won't make a come-back with bigger n. On the contrary, its predictions only get less and less accurate. Try running a quicksort on a few TBs of data if you don't believe me. That many orders of magnitude are far more important than an log(n) factor, or even a n factor (at least if it's worst case, not average case). Here are assembly decodes of innermost loops, just to show nothing funny is going on. Sequential:

80485f0:       03 1c 87                add    (%edi,%eax,4),%ebx
80485f3:       40                      inc    %eax
80485f4:       39 f0                   cmp    %esi,%eax
80485f6:       75 f8                   jne    80485f0
Pseudorandom:

80485f6:       81 c1 d1 1f b2 78       add    $0x78b21fd1,%ecx
80485fc:       31 d2                   xor    %edx,%edx
80485fe:       89 c8                   mov    %ecx,%eax
8048600:       f7 f3                   div    %ebx
8048602:       03 7c 95 00             add    0x0(%ebp,%edx,4),%edi
8048606:       39 ce                   cmp    %ecx,%esi
8048608:       75 ec                   jne    80485f6

The right way

Memory hierarchy is not the only problem with big O analysis. Another problem is overuse of worst case running times, which in many cases are far from typical. The only way of performance analysis that is pretty much guaranteed to work is getting actual load data and running benchmarks with them. Sometimes artificially constructed benchmarks and theoretical performance models (including the big O analysis) are appropriate, but don't blindly assume that's true in your case. As saying "benchmark with real data" is easy, but not very useful, here are A few bits of advice:
  • Benchmark with real data, or as realistic as you can.
  • Non-uniform access patterns are faster than uniform. One million accesses to a single memory cell guarantee it's going to be in L1 cache. One million accesses to different memory cells each mean RAM. That's why accessing top levels of search trees is cheaper than accessing lower levels.
  • Sequential is faster than random, much faster.
  • Keep data that is accessed together close to each other.
  • Keep data of different "temperatures" far away. Keeping very frequently and very rarely accessed data on the same cache line means cache space is wasted. This is particularly important with main memory caching of disk files, where cache line size is something like 4kB. A great illustration of this principle are B+ Trees, which keep all data (rarely accessed) on leaf level, so more indexes (frequently accessed) fit memory or caches. B-trees keeps data mixed with indexes. That's a huge waste of cache, as a lot of rarely accessed data must be hold in memory, and fewer frequently accessed indexes fit.
  • Reuse standard library data structures and algorithms - they're likely to be reasonably well-tuned. Often they perform worse than what you could code, usually by trying to be more generic than they need to be, or by not being optimized for your hardware. More often than not, they're better than what you'd do in reasonable amount of time, and in any case fast enough.
  • Learning basics of assembly, cache architecture, and other low-level stuff certainly won't hurt more than learning J2EE, and may even be more useful.
  • Did I mention you should get real data ?

Saturday, March 17, 2007

Modern x86 assembly

Upside-bun by greefus groinks from flickr (CC-SA)Nobody writes programs in assembly any more. Bits of assembly are used in compilers, JIT compilers, kernels, and for some performance-critical code like codecs. Few people write such code. You may want to try it anyway, as the best programmers write compilers, at least according to Steve Yegge. Knowing assembly and low-level details of your architecture is still important. Not for writing programs or even small pieces of code in it. It is important because it lets you understand what's really going on. High-level programming languages provide highly abstract views, and like all complex abstractions, they leak a lot. Without going below the abstract view, it's going to be very difficult to reason about performance or security, or to fix things when they break. Many people have irrational fear of the low level. Some (mostly in academia) go as far as claiming that worst-case "Big O analysis" is an adequate performance evaluation tool. If you're one of them, I'm sorry to burst your bubble. I'll show you a few simple examples of Big O analysis giving completely wrong results later on.

Ancient x86 assembly

Back in early 1990s, widely available compilers produced seriously suboptimal code (unoptimized and 16-bit), and it was necessary to use inline assembly to get anywhere close to decent performance. You might remember that in these times calling a function z = foo(x,y) was done by pushing items on hardware stack, calling a specific memory address, and getting the results back from ax register:
push x
push y
call foo
mov z, ax
or to use 32 bits and more modern notation something like:
pushl 0x4(%ebp) <x>
pushl 0x8(%ebp) <y>
call 8012EF0 <foo>
move 0xc(%ebp) <z>, %eax
Even though the basics of function calls stayed the same - arguments are placed on the hardware stack, and return value is in %eax register, code created by modern compiler looks nothing like that.

Alignment

In ancient times, memory access was just memory access. Each load and each store costed the same. Nowadays, cost of loads and stores differs a lot. There are multiple levels of caches, and it's very important for memory to be "aligned" - computers work much faster when the last bits of memory they access are all 0s. Let's write a C function for allocating precisely aligned memory.
#include <stdio.h>
#include <stdlib.h>

/*
* Returns memory, address of which ends in 1 and align_bits 0s
* Warning: Leaks memory
*/
void *aligned_malloc(int size, int align_bits)
{
   char *data = malloc(size + (2 << align_bits));
   unsigned int alignment_mask = (1 << align_bits) - 1;

   /* Guarantee that lowest align_bits of address are all 0 */
   data = (char*)(((unsigned int)data + alignment_mask) & ~alignment_mask);

   /* Guarantee that the next bit of address is 1 */
   if (((unsigned int)data & (1 << align_bits)) == 0)
       data += (1 << align_bits);
   return (void*)data;
}

int main(int argc, char **argv)
{
   double *x;
   double *y;
   double *z;
   int alignment_bits;

   if(argc != 2) {
       fprintf(stderr, "Usage: %s <alignment bits>\n", argv[0]);
       return 1;
   }
   alignment_bits = atoi(argv[1]);

   x = (double*)aligned_malloc(sizeof(double), alignment_bits);
   y = (double*)aligned_malloc(sizeof(double), alignment_bits);
   z = (double*)aligned_malloc(sizeof(double), alignment_bits);

   printf("x    = %p..%p\n", x, ((char*)(x+1))-1);
   printf("y    = %p..%p\n", y, ((char*)(y+1))-1);
   printf("z    = %p..%p\n", z, ((char*)(z+1))-1);

   return 0;
}
It seems to work just fine. Also notice that big blocks of memory were allocated in a different place (0xb7d3****) than small ones (0x804****).
$ ./alignment_test 0
x    = 0x804a009..0x804a010
y    = 0x804a019..0x804a020
z    = 0x804a029..0x804a030
$ ./alignment_test 1
x    = 0x804a00a..0x804a011
y    = 0x804a01a..0x804a021
z    = 0x804a02a..0x804a031
$ ./alignment_test 2
x    = 0x804a00c..0x804a013
y    = 0x804a024..0x804a02b
z    = 0x804a03c..0x804a043
$ ./alignment_test 3
x    = 0x804a008..0x804a00f
y    = 0x804a028..0x804a02f
z    = 0x804a048..0x804a04f
$ ./alignment_test 4
x    = 0x804a010..0x804a017
y    = 0x804a050..0x804a057
z    = 0x804a070..0x804a077
$ ./alignment_test 16
x    = 0xb7d70000..0xb7d70007
y    = 0xb7d50000..0xb7d50007
z    = 0xb7d30000..0xb7d30007
So let's test the actual performance.
#include <stdio.h>
#include <stdlib.h>

/*
* Returns memory, address of which ends in 1 and align_bits 0s
* Warning: Leaks memory
*/
void *aligned_malloc(int size, int align_bits)
{
   char *data = malloc(size + (2 << align_bits));
   unsigned int alignment_mask = (1 << align_bits) - 1;

   /* Guarantee that lowest align_bits of address are all 0 */
   data = (char*)(((unsigned int)data + alignment_mask) & ~alignment_mask);

   /* Guarantee that the next bit of address is 1 */
   if (((unsigned int)data & (1 << align_bits)) == 0)
       data += (1 << align_bits);
   return (void*)data;
}

int main(int argc, char **argv)
{
   double *x;
   double *y;
   double *z;
   int alignment_bits;
   int i;

   if(argc != 3) {
       fprintf(stderr, "Usage: %s <alignment bits> <iterations>\n", argv[0]);
       return 1;
   }
   alignment_bits = atoi(argv[1]);

   x = (double*)aligned_malloc(sizeof(double), alignment_bits);
   y = (double*)aligned_malloc(sizeof(double), alignment_bits);
   z = (double*)aligned_malloc(sizeof(double), alignment_bits);

   *x = 1.0;
   *y = 2.0;
   *z =-3.0;

   for(i=atoi(argv[2]); i != 0; i--)
   {
       *y += *x;
       *z -= *y;
       *x += *z;
   }

   return 0;
}

$ ./alignment_benchmark 0 500000000
real    0m19.363s
user    0m14.241s
sys     0m0.136s

$ ./alignment_benchmark 1 500000000
real    0m19.734s
user    0m14.005s
sys     0m0.152s

$ ./alignment_benchmark 2 500000000

real    0m19.802s
user    0m14.441s
sys     0m0.144s

$ ./alignment_benchmark 3 500000000

real    0m11.570s
user    0m9.285s
sys     0m0.092s

$ ./alignment_benchmark 4 500000000

real    0m13.542s
user    0m9.049s
sys     0m0.076s

$ ./alignment_benchmark 16 500000000

real    0m43.321s
user    0m32.066s
sys     0m0.304s
And indeed, 8/16-byte alignment makes the code much faster than 1/2/4-byte alignment. On the other hand performance of 64kB-alignment is terrible. It is actually caused by caching issues, not alignment issues. Notice that from high-level C point of view, the operations in all benchmarks are identical, so the program should take the same amount of time to run, except for maybe taking a bit longer to malloc more memory (actually malloc time in ./alignment_benchmark 16 500000000 is negligible - the effect is caused by caching alone). So much for C being "close to the hardware".

Caching

Compilers handle alignment magically, so normally you don't need to bother. The only reason it was possible to see the difference was casting pointers to unsigned int and messing with bare bits. What compilers don't handle magically is caching. To see how important it is, let's compare Big O analysis enthusiasts' favourite data structure - linked list, with real programmers' favourite - plain array. The test will be iterating over allocated list/array and adding everything in it. First, the theory. Both data structures are obviously O(n). Number of memory loads is supposedly two times higher for linked lists than for plain arrays. In theory it doesn't make a difference whether we iterate sequentially or randomly.
#include <stdio.h>
#include <stdlib.h>

struct list_node {
   int value;
   struct list_node * next;
};

int main(int argc, char **argv)
{
   struct list_node *list_head = 0, *tmp;
   int *array = 0;
   int test;
   int size;
   int iterations;
   int i;
   int result = 0;

   if(argc != 4) {
       fprintf(stderr, "Usage: %s <data structure> <iterations> <elements>\n", argv[0]);
       return 1;
   }

   test = atoi(argv[1]);
   size = atoi(argv[3]);

   switch(test){
   case 0:
       array = malloc(sizeof(int) * size);
       for(i = 0; i<size; i++) {
           array[i] = i;
       }
   break;

   case 1:
       for(i = 0; i<size; i++) {
           tmp = list_head;
           list_head = malloc(sizeof(struct list_node));
           list_head->value = i;
           list_head->next = tmp;
       }
   break;
   }

   for(iterations = atoi(argv[2]); iterations!=0; iterations--)
   {
       switch(test) {
       case 0:
           for(i = 0; i<size; i++) {
               result += array[i];
           }
       break;

       case 1:
           for(tmp = list_head; tmp; tmp=tmp->next) {
               result += tmp->value;
           }
       break;
       }
   }
   printf("Result is %d\n", result);

   return 0;
}
This benchmark lets us change number of iterations and size of data structure. According to the theory, difference between linked lists and plain arrays shouldn't matter much, and as they're both obviously O(n), size shouldn't matter at all, as long as number of iterations is halved when size is doubled. Reality shows a diametrically different picture.
$ time ./lists_benchmark 0 1000000 100 >/dev/null
real    0m0.192s
user    0m0.180s
sys     0m0.012s

$ time ./lists_benchmark 0 100000 1000 >/dev/null
real    0m0.188s
user    0m0.168s
sys     0m0.012s

$ time ./lists_benchmark 0 10000 10000 >/dev/null
real    0m0.189s
user    0m0.184s
sys     0m0.000s

$ time ./lists_benchmark 0 1000 100000 >/dev/null
real    0m0.960s
user    0m0.936s
sys     0m0.008s

$ time ./lists_benchmark 0 100 1000000 >/dev/null
real    0m1.023s
user    0m1.000s
sys     0m0.012s

$ time ./lists_benchmark 0 10 10000000 >/dev/null
real    0m1.169s
user    0m1.064s
sys     0m0.100s

$ time ./lists_benchmark 1 1000000 100 >/dev/null
real    0m0.280s
user    0m0.272s
sys     0m0.004s

$ time ./lists_benchmark 1 100000 1000 >/dev/null
real    0m0.258s
user    0m0.252s
sys     0m0.004s

$ time ./lists_benchmark 1 10000 10000 >/dev/null
real    0m5.794s
user    0m5.636s
sys     0m0.052s

$ time ./lists_benchmark 1 1000 100000 >/dev/null
real    0m6.019s
user    0m5.860s
sys     0m0.088s

$ time ./lists_benchmark 1 100 1000000 >/dev/null
real    0m6.216s
user    0m5.944s
sys     0m0.120s

$ time ./lists_benchmark 1 10 10000000 >/dev/null
real    0m7.596s
user    0m7.036s
sys     0m0.440s
Big surprise - the algorithms are not O(n) at all, and in fact behave differently with increased data size. Linked lists start 50% worse than plain arrays (pretty much what the theory predicted), but end up 7x slower. The difference between O(n) and observed behaviour is 6x for plain arrays and 26x for linked lists. Yay! What went wrong ? The Big O analysis made a fundamental mistake of ignoring memory hierarchy. Cost of memory access varies from very low (things guaranteed to be in innermost cache, like top of the stack) to huge (memory paged to disk).

Alignment magic

How compilers magically deal with alignment issues ? First, malloc always returns properly aligned memory. On modern x86s this means 16-byte alignment. 4 bytes is enough for integers, 8 bytes is enough for double, but SSE variables need 16 byte alignment, and malloc doesn't know how the memory will be used.
#include <stdio.h>
#include <stdlib.h>

struct packet {
   char a;
   char b;
   int c;
   int d;
   double e;
   char f;
   double g;
};

int main(int argc, char **argv)
{
   struct packet packet;
   printf("a = %p|next free=%p\n", &packet.a, 1+&packet.a);
   printf("b = %p|next free=%p\n", &packet.b, 1+&packet.b);
   printf("c = %p|next free=%p\n", &packet.c, 1+&packet.c);
   printf("d = %p|next free=%p\n", &packet.d, 1+&packet.d);
   printf("e = %p|next free=%p\n", &packet.e, 1+&packet.e);
   printf("f = %p|next free=%p\n", &packet.f, 1+&packet.f);
   printf("g = %p|next free=%p\n", &packet.g, 1+&packet.g);
   return 0;
}
a = 0xbf9cd7dc|next free=0xbf9cd7dd
b = 0xbf9cd7dd|next free=0xbf9cd7de
c = 0xbf9cd7e0|next free=0xbf9cd7e4
d = 0xbf9cd7e4|next free=0xbf9cd7e8
e = 0xbf9cd7e8|next free=0xbf9cd7f0
f = 0xbf9cd7f0|next free=0xbf9cd7f1
g = 0xbf9cd7f4|next free=0xbf9cd7fc
The compiler created holes in data structure to make sure ints and doubles are 4-byte aligned, but didn't bother aligning chars. That's not correct behaviour - doubles should definitely be 8-byte aligned ! I guess it's for backwards compatibility - on older x86 4-byte alignment might suffice, and changing C ABI now would be too painful. What about C being close to hardware again ? The whole data structure was also only 4-byte aligned on the stack. On the other hand as alignment of local variables doesn't cause any compatibility issues, compiler freely reorganized them and made sure doubles on stack are 8-byte aligned:
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
   double a;
   int b;
   double c;
   int d;

   printf("a = %p|next free=%p\n", &a, 1+&a);
   printf("b = %p|next free=%p\n", &b, 1+&b);
   printf("c = %p|next free=%p\n", &c, 1+&c);
   printf("d = %p|next free=%p\n", &d, 1+&d);
   return 0;
}

a = 0xbf8a1ec8|next free=0xbf8a1ed0
b = 0xbf8a1ed4|next free=0xbf8a1ed8
c = 0xbf8a1ec0|next free=0xbf8a1ec8
d = 0xbf8a1ed0|next free=0xbf8a1ed4
So the order is: double c; double a; int d; int b;. If you're an assembly old timer, you're probably asking how the heck compiler can make sure functions are aligned on stack, especially function that take variable number of arguments (unlike in C++, in C fun() takes variable number of arguments, to declare a no-argument function in C one uses fun(void)).
#include <stdio.h>
#include <stdlib.h>

double *fun() {
   double a;
   return &a;
}

int main(int argc, char **argv)
{
   printf("a3 = %p\n", fun(1, 2, 3));
   printf("a2 = %p\n", fun(4, 5));
   printf("a1 = %p\n", fun(6));
   printf("a0 = %p\n", fun());

   return 0;
}
fun doesn't do anything funny with stack pointer %esp (for clarity of generated assembly, code compiled with -fomit-frame-pointer):
08048380 <fun>:
8048380:       83 ec 14                sub    $0x14,%esp
8048383:       8d 44 24 08             lea    0x8(%esp),%eax
8048387:       83 c4 14                add    $0x14,%esp
804838a:       c3                      ret
Yet doubles are not only perfectly aligned, but all have the same offset.
$ ./align_fun
a3 = 0xbfa27068
a2 = 0xbfa27068
a1 = 0xbfa27068
a0 = 0xbfa27068
How is it possible ? We'd expect call to fun(1, 2, 3) to take 12 bytes more on stack than fun(). What happens is a slight modification of ABI. The first thing main function does is aligning the stack pointer to 16 bytes.
08048390 <main>:
8048390:       8d 4c 24 04             lea    0x4(%esp),%ecx
8048394:       83 e4 f0                and    $0xfffffff0,%esp
8048397:       ff 71 fc                pushl  0xfffffffc(%ecx)
804839a:       51                      push   %ecx
Every function assumes %esp is properly aligned when it's called, and makes sure %esp stays aligned by pushing it a bit more if necessary. One way to do so would be generating extra push instructions, but it's actually more efficient to manipulate %esp directly, and treat stack like a plain array. Somewhat better compiler would subtract from %esp (x86 stack grows backwards), move arguments to top of the stack, call the function, and then add the same number to %esp. Notice there's no good reason to keep subtracting and adding %esp. It's usually enough to make extra space on stack when the function starts (subtract something from %esp), and free that space before the function returns (add the same thing to %esp). A single subtraction allocates space for local variables, arguments, and makes sure the stack is aligned. Notice how calls to fun really look like - no messing with the stack pointer, just moving things to the right places.
08048390 <main>:
8048390:       8d 4c 24 04             lea    0x4(%esp),%ecx
8048394:       83 e4 f0                and    $0xfffffff0,%esp
8048397:       ff 71 fc                pushl  0xfffffffc(%ecx)
804839a:       51                      push   %ecx
804839b:       83 ec 18                sub    $0x18,%esp
804839e:       c7 44 24 08 03 00 00 00 movl   $0x3,0x8(%esp)
80483a6:       c7 44 24 04 02 00 00 00 movl   $0x2,0x4(%esp)
80483ae:       c7 04 24 01 00 00 00    movl   $0x1,(%esp)
80483b5:       e8 c6 ff ff ff          call   8048380 <fun>
80483ba:       c7 04 24 fc 84 04 08    movl   $0x80484fc,(%esp)
80483c1:       89 44 24 04             mov    %eax,0x4(%esp)
80483c5:       e8 f6 fe ff ff          call   80482c0 <printf@plt>
80483ca:       c7 44 24 04 05 00 00 00 movl   $0x5,0x4(%esp)
80483d2:       c7 04 24 04 00 00 00    movl   $0x4,(%esp)
80483d9:       e8 a2 ff ff ff          call   8048380 <fun>
80483de:       c7 04 24 05 85 04 08    movl   $0x8048505,(%esp)
80483e5:       89 44 24 04             mov    %eax,0x4(%esp)
80483e9:       e8 d2 fe ff ff          call   80482c0 <printf@plt>
80483ee:       c7 04 24 06 00 00 00    movl   $0x6,(%esp)
80483f5:       e8 86 ff ff ff          call   8048380 <fun>
80483fa:       c7 04 24 0e 85 04 08    movl   $0x804850e,(%esp)
8048401:       89 44 24 04             mov    %eax,0x4(%esp)
8048405:       e8 b6 fe ff ff          call   80482c0 <printf@plt>
804840a:       e8 71 ff ff ff          call   8048380 <fun>
804840f:       c7 04 24 17 85 04 08    movl   $0x8048517,(%esp)
8048416:       89 44 24 04             mov    %eax,0x4(%esp)
804841a:       e8 a1 fe ff ff          call   80482c0 <printf@plt>
804841f:       83 c4 18                add    $0x18,%esp
8048422:       31 c0                   xor    %eax,%eax
8048424:       59                      pop    %ecx
8048425:       8d 61 fc                lea    0xfffffffc(%ecx),%esp
8048428:       c3                      ret

PLT

There's one more thing - call to fun called a constant place, but call to printf actually called printf@plt. The program doesn't have any idea where printf will reside, it changes from one versions of libc to another. The program doesn't even know where libc will reside. strace /bin/true clearly shows that programs tries to mmap libc without asking for a specific location. The operating system just happens to put it at address 0xb7e60000:

open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
mmap2(NULL, 1312164, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7e60000
strace'ing some other program (in this case Ruby) shows libc allocated somewhere else:
open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
mmap2(NULL, 1312164, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7cd3000
So how does it work ? Address of the actual printf function is located at 0x8049620 (at least in this case), and printf@plt jumps there:
080482c0 <printf@plt>:
80482c0:       ff 25 20 96 04 08       jmp    *0x8049620
Places like 0x8049620 are filled in when the program starts (actually when they're first needed, but that's a minor technical detail). Why won't we simply modify the code at start time ? Keeping code strictly read only lets the operating system hold only one copy of code in memory, even when multiple programs use it. If we modified the code, operating system would need to keep in memory a separate copy of libc for every process that uses libc. That would be horribly wasteful. Actually a few highly hackful solutions are possible to modify the code and preserve sharing by making the operating system more aware of how linking works, but they'd require significant changes to Unix architecture.