taw's blog: Modern x86 assembly

Nobody writes programs in assembly any more. Bits of assembly are used in compilers, JIT compilers, kernels, and for some performance-critical code like codecs. Few people write such code. You may want to try it anyway, as the best programmers write compilers, at least according to Steve Yegge. Knowing assembly and low-level details of your architecture is still important. Not for writing programs or even small pieces of code in it. It is important because it lets you understand what's really going on. High-level programming languages provide highly abstract views, and like all complex abstractions, they leak a lot. Without going below the abstract view, it's going to be very difficult to reason about performance or security, or to fix things when they break. Many people have irrational fear of the low level. Some (mostly in academia) go as far as claiming that worst-case "Big O analysis" is an adequate performance evaluation tool. If you're one of them, I'm sorry to burst your bubble. I'll show you a few simple examples of Big O analysis giving completely wrong results later on.

Ancient x86 assembly

Back in early 1990s, widely available compilers produced seriously suboptimal code (unoptimized and 16-bit), and it was necessary to use inline assembly to get anywhere close to decent performance. You might remember that in these times calling a function z = foo(x,y) was done by pushing items on hardware stack, calling a specific memory address, and getting the results back from ax register:

push x
push y
call foo
mov z, ax

or to use 32 bits and more modern notation something like:

pushl 0x4(%ebp) <x>
pushl 0x8(%ebp) <y>
call 8012EF0 <foo>
move 0xc(%ebp) <z>, %eax

Even though the basics of function calls stayed the same - arguments are placed on the hardware stack, and return value is in %eax register, code created by modern compiler looks nothing like that.

Alignment

In ancient times, memory access was just memory access. Each load and each store costed the same. Nowadays, cost of loads and stores differs a lot. There are multiple levels of caches, and it's very important for memory to be "aligned" - computers work much faster when the last bits of memory they access are all 0s. Let's write a C function for allocating precisely aligned memory.

#include <stdio.h>
#include <stdlib.h>

/*
* Returns memory, address of which ends in 1 and align_bits 0s
* Warning: Leaks memory
*/
void *aligned_malloc(int size, int align_bits)
{
   char *data = malloc(size + (2 << align_bits));
   unsigned int alignment_mask = (1 << align_bits) - 1;

   /* Guarantee that lowest align_bits of address are all 0 */
   data = (char*)(((unsigned int)data + alignment_mask) & ~alignment_mask);

   /* Guarantee that the next bit of address is 1 */
   if (((unsigned int)data & (1 << align_bits)) == 0)
       data += (1 << align_bits);
   return (void*)data;
}

int main(int argc, char **argv)
{
   double *x;
   double *y;
   double *z;
   int alignment_bits;

   if(argc != 2) {
       fprintf(stderr, "Usage: %s <alignment bits>\n", argv[0]);
       return 1;
   }
   alignment_bits = atoi(argv[1]);

   x = (double*)aligned_malloc(sizeof(double), alignment_bits);
   y = (double*)aligned_malloc(sizeof(double), alignment_bits);
   z = (double*)aligned_malloc(sizeof(double), alignment_bits);

   printf("x    = %p..%p\n", x, ((char*)(x+1))-1);
   printf("y    = %p..%p\n", y, ((char*)(y+1))-1);
   printf("z    = %p..%p\n", z, ((char*)(z+1))-1);

   return 0;
}

It seems to work just fine. Also notice that big blocks of memory were allocated in a different place (0xb7d3****) than small ones (0x804****).

$ ./alignment_test 0
x    = 0x804a009..0x804a010
y    = 0x804a019..0x804a020
z    = 0x804a029..0x804a030
$ ./alignment_test 1
x    = 0x804a00a..0x804a011
y    = 0x804a01a..0x804a021
z    = 0x804a02a..0x804a031
$ ./alignment_test 2
x    = 0x804a00c..0x804a013
y    = 0x804a024..0x804a02b
z    = 0x804a03c..0x804a043
$ ./alignment_test 3
x    = 0x804a008..0x804a00f
y    = 0x804a028..0x804a02f
z    = 0x804a048..0x804a04f
$ ./alignment_test 4
x    = 0x804a010..0x804a017
y    = 0x804a050..0x804a057
z    = 0x804a070..0x804a077
$ ./alignment_test 16
x    = 0xb7d70000..0xb7d70007
y    = 0xb7d50000..0xb7d50007
z    = 0xb7d30000..0xb7d30007

So let's test the actual performance.

#include <stdio.h>
#include <stdlib.h>

/*
* Returns memory, address of which ends in 1 and align_bits 0s
* Warning: Leaks memory
*/
void *aligned_malloc(int size, int align_bits)
{
   char *data = malloc(size + (2 << align_bits));
   unsigned int alignment_mask = (1 << align_bits) - 1;

   /* Guarantee that lowest align_bits of address are all 0 */
   data = (char*)(((unsigned int)data + alignment_mask) & ~alignment_mask);

   /* Guarantee that the next bit of address is 1 */
   if (((unsigned int)data & (1 << align_bits)) == 0)
       data += (1 << align_bits);
   return (void*)data;
}

int main(int argc, char **argv)
{
   double *x;
   double *y;
   double *z;
   int alignment_bits;
   int i;

   if(argc != 3) {
       fprintf(stderr, "Usage: %s <alignment bits> <iterations>\n", argv[0]);
       return 1;
   }
   alignment_bits = atoi(argv[1]);

   x = (double*)aligned_malloc(sizeof(double), alignment_bits);
   y = (double*)aligned_malloc(sizeof(double), alignment_bits);
   z = (double*)aligned_malloc(sizeof(double), alignment_bits);

   *x = 1.0;
   *y = 2.0;
   *z =-3.0;

   for(i=atoi(argv[2]); i != 0; i--)
   {
       *y += *x;
       *z -= *y;
       *x += *z;
   }

   return 0;
}


$ ./alignment_benchmark 0 500000000
real    0m19.363s
user    0m14.241s
sys     0m0.136s

$ ./alignment_benchmark 1 500000000
real    0m19.734s
user    0m14.005s
sys     0m0.152s

$ ./alignment_benchmark 2 500000000

real    0m19.802s
user    0m14.441s
sys     0m0.144s

$ ./alignment_benchmark 3 500000000

real    0m11.570s
user    0m9.285s
sys     0m0.092s

$ ./alignment_benchmark 4 500000000

real    0m13.542s
user    0m9.049s
sys     0m0.076s

$ ./alignment_benchmark 16 500000000

real    0m43.321s
user    0m32.066s
sys     0m0.304s

And indeed, 8/16-byte alignment makes the code much faster than 1/2/4-byte alignment. On the other hand performance of 64kB-alignment is terrible. It is actually caused by caching issues, not alignment issues. Notice that from high-level C point of view, the operations in all benchmarks are identical, so the program should take the same amount of time to run, except for maybe taking a bit longer to malloc more memory (actually malloc time in ./alignment_benchmark 16 500000000 is negligible - the effect is caused by caching alone). So much for C being "close to the hardware".

Caching

Compilers handle alignment magically, so normally you don't need to bother. The only reason it was possible to see the difference was casting pointers to unsigned int and messing with bare bits. What compilers don't handle magically is caching. To see how important it is, let's compare Big O analysis enthusiasts' favourite data structure - linked list, with real programmers' favourite - plain array. The test will be iterating over allocated list/array and adding everything in it. First, the theory. Both data structures are obviously O(n). Number of memory loads is supposedly two times higher for linked lists than for plain arrays. In theory it doesn't make a difference whether we iterate sequentially or randomly.

#include <stdio.h>
#include <stdlib.h>

struct list_node {
   int value;
   struct list_node * next;
};

int main(int argc, char **argv)
{
   struct list_node *list_head = 0, *tmp;
   int *array = 0;
   int test;
   int size;
   int iterations;
   int i;
   int result = 0;

   if(argc != 4) {
       fprintf(stderr, "Usage: %s <data structure> <iterations> <elements>\n", argv[0]);
       return 1;
   }

   test = atoi(argv[1]);
   size = atoi(argv[3]);

   switch(test){
   case 0:
       array = malloc(sizeof(int) * size);
       for(i = 0; i<size; i++) {
           array[i] = i;
       }
   break;

   case 1:
       for(i = 0; i<size; i++) {
           tmp = list_head;
           list_head = malloc(sizeof(struct list_node));
           list_head->value = i;
           list_head->next = tmp;
       }
   break;
   }

   for(iterations = atoi(argv[2]); iterations!=0; iterations--)
   {
       switch(test) {
       case 0:
           for(i = 0; i<size; i++) {
               result += array[i];
           }
       break;

       case 1:
           for(tmp = list_head; tmp; tmp=tmp->next) {
               result += tmp->value;
           }
       break;
       }
   }
   printf("Result is %d\n", result);

   return 0;
}

This benchmark lets us change number of iterations and size of data structure. According to the theory, difference between linked lists and plain arrays shouldn't matter much, and as they're both obviously O(n), size shouldn't matter at all, as long as number of iterations is halved when size is doubled. Reality shows a diametrically different picture.

$ time ./lists_benchmark 0 1000000 100 >/dev/null
real    0m0.192s
user    0m0.180s
sys     0m0.012s

$ time ./lists_benchmark 0 100000 1000 >/dev/null
real    0m0.188s
user    0m0.168s
sys     0m0.012s

$ time ./lists_benchmark 0 10000 10000 >/dev/null
real    0m0.189s
user    0m0.184s
sys     0m0.000s

$ time ./lists_benchmark 0 1000 100000 >/dev/null
real    0m0.960s
user    0m0.936s
sys     0m0.008s

$ time ./lists_benchmark 0 100 1000000 >/dev/null
real    0m1.023s
user    0m1.000s
sys     0m0.012s

$ time ./lists_benchmark 0 10 10000000 >/dev/null
real    0m1.169s
user    0m1.064s
sys     0m0.100s

$ time ./lists_benchmark 1 1000000 100 >/dev/null
real    0m0.280s
user    0m0.272s
sys     0m0.004s

$ time ./lists_benchmark 1 100000 1000 >/dev/null
real    0m0.258s
user    0m0.252s
sys     0m0.004s

$ time ./lists_benchmark 1 10000 10000 >/dev/null
real    0m5.794s
user    0m5.636s
sys     0m0.052s

$ time ./lists_benchmark 1 1000 100000 >/dev/null
real    0m6.019s
user    0m5.860s
sys     0m0.088s

$ time ./lists_benchmark 1 100 1000000 >/dev/null
real    0m6.216s
user    0m5.944s
sys     0m0.120s

$ time ./lists_benchmark 1 10 10000000 >/dev/null
real    0m7.596s
user    0m7.036s
sys     0m0.440s

Big surprise - the algorithms are not O(n) at all, and in fact behave differently with increased data size. Linked lists start 50% worse than plain arrays (pretty much what the theory predicted), but end up 7x slower. The difference between O(n) and observed behaviour is 6x for plain arrays and 26x for linked lists. Yay! What went wrong ? The Big O analysis made a fundamental mistake of ignoring memory hierarchy. Cost of memory access varies from very low (things guaranteed to be in innermost cache, like top of the stack) to huge (memory paged to disk).

Alignment magic

How compilers magically deal with alignment issues ? First, malloc always returns properly aligned memory. On modern x86s this means 16-byte alignment. 4 bytes is enough for integers, 8 bytes is enough for double, but SSE variables need 16 byte alignment, and malloc doesn't know how the memory will be used.

#include <stdio.h>
#include <stdlib.h>

struct packet {
   char a;
   char b;
   int c;
   int d;
   double e;
   char f;
   double g;
};

int main(int argc, char **argv)
{
   struct packet packet;
   printf("a = %p|next free=%p\n", &packet.a, 1+&packet.a);
   printf("b = %p|next free=%p\n", &packet.b, 1+&packet.b);
   printf("c = %p|next free=%p\n", &packet.c, 1+&packet.c);
   printf("d = %p|next free=%p\n", &packet.d, 1+&packet.d);
   printf("e = %p|next free=%p\n", &packet.e, 1+&packet.e);
   printf("f = %p|next free=%p\n", &packet.f, 1+&packet.f);
   printf("g = %p|next free=%p\n", &packet.g, 1+&packet.g);
   return 0;
}

a = 0xbf9cd7dc|next free=0xbf9cd7dd
b = 0xbf9cd7dd|next free=0xbf9cd7de
c = 0xbf9cd7e0|next free=0xbf9cd7e4
d = 0xbf9cd7e4|next free=0xbf9cd7e8
e = 0xbf9cd7e8|next free=0xbf9cd7f0
f = 0xbf9cd7f0|next free=0xbf9cd7f1
g = 0xbf9cd7f4|next free=0xbf9cd7fc

The compiler created holes in data structure to make sure ints and doubles are 4-byte aligned, but didn't bother aligning chars. That's not correct behaviour - doubles should definitely be 8-byte aligned ! I guess it's for backwards compatibility - on older x86 4-byte alignment might suffice, and changing C ABI now would be too painful. What about C being close to hardware again ? The whole data structure was also only 4-byte aligned on the stack. On the other hand as alignment of local variables doesn't cause any compatibility issues, compiler freely reorganized them and made sure doubles on stack are 8-byte aligned:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
   double a;
   int b;
   double c;
   int d;

   printf("a = %p|next free=%p\n", &a, 1+&a);
   printf("b = %p|next free=%p\n", &b, 1+&b);
   printf("c = %p|next free=%p\n", &c, 1+&c);
   printf("d = %p|next free=%p\n", &d, 1+&d);
   return 0;
}

a = 0xbf8a1ec8|next free=0xbf8a1ed0
b = 0xbf8a1ed4|next free=0xbf8a1ed8
c = 0xbf8a1ec0|next free=0xbf8a1ec8
d = 0xbf8a1ed0|next free=0xbf8a1ed4

So the order is: double c; double a; int d; int b;. If you're an assembly old timer, you're probably asking how the heck compiler can make sure functions are aligned on stack, especially function that take variable number of arguments (unlike in C++, in C fun() takes variable number of arguments, to declare a no-argument function in C one uses fun(void)).

#include <stdio.h>
#include <stdlib.h>

double *fun() {
   double a;
   return &a;
}

int main(int argc, char **argv)
{
   printf("a3 = %p\n", fun(1, 2, 3));
   printf("a2 = %p\n", fun(4, 5));
   printf("a1 = %p\n", fun(6));
   printf("a0 = %p\n", fun());

   return 0;
}

fun doesn't do anything funny with stack pointer %esp (for clarity of generated assembly, code compiled with -fomit-frame-pointer):

08048380 <fun>:
8048380:       83 ec 14                sub    $0x14,%esp
8048383:       8d 44 24 08             lea    0x8(%esp),%eax
8048387:       83 c4 14                add    $0x14,%esp
804838a:       c3                      ret

Yet doubles are not only perfectly aligned, but all have the same offset.

$ ./align_fun
a3 = 0xbfa27068
a2 = 0xbfa27068
a1 = 0xbfa27068
a0 = 0xbfa27068

How is it possible ? We'd expect call to fun(1, 2, 3) to take 12 bytes more on stack than fun(). What happens is a slight modification of ABI. The first thing main function does is aligning the stack pointer to 16 bytes.

08048390 <main>:
8048390:       8d 4c 24 04             lea    0x4(%esp),%ecx
8048394:       83 e4 f0                and    $0xfffffff0,%esp
8048397:       ff 71 fc                pushl  0xfffffffc(%ecx)
804839a:       51                      push   %ecx

Every function assumes %esp is properly aligned when it's called, and makes sure %esp stays aligned by pushing it a bit more if necessary. One way to do so would be generating extra push instructions, but it's actually more efficient to manipulate %esp directly, and treat stack like a plain array. Somewhat better compiler would subtract from %esp (x86 stack grows backwards), move arguments to top of the stack, call the function, and then add the same number to %esp. Notice there's no good reason to keep subtracting and adding %esp. It's usually enough to make extra space on stack when the function starts (subtract something from %esp), and free that space before the function returns (add the same thing to %esp). A single subtraction allocates space for local variables, arguments, and makes sure the stack is aligned. Notice how calls to fun really look like - no messing with the stack pointer, just moving things to the right places.

08048390 <main>:
8048390:       8d 4c 24 04             lea    0x4(%esp),%ecx
8048394:       83 e4 f0                and    $0xfffffff0,%esp
8048397:       ff 71 fc                pushl  0xfffffffc(%ecx)
804839a:       51                      push   %ecx
804839b:       83 ec 18                sub    $0x18,%esp
804839e:       c7 44 24 08 03 00 00 00 movl   $0x3,0x8(%esp)
80483a6:       c7 44 24 04 02 00 00 00 movl   $0x2,0x4(%esp)
80483ae:       c7 04 24 01 00 00 00    movl   $0x1,(%esp)
80483b5:       e8 c6 ff ff ff          call   8048380 <fun>
80483ba:       c7 04 24 fc 84 04 08    movl   $0x80484fc,(%esp)
80483c1:       89 44 24 04             mov    %eax,0x4(%esp)
80483c5:       e8 f6 fe ff ff          call   80482c0 <printf@plt>
80483ca:       c7 44 24 04 05 00 00 00 movl   $0x5,0x4(%esp)
80483d2:       c7 04 24 04 00 00 00    movl   $0x4,(%esp)
80483d9:       e8 a2 ff ff ff          call   8048380 <fun>
80483de:       c7 04 24 05 85 04 08    movl   $0x8048505,(%esp)
80483e5:       89 44 24 04             mov    %eax,0x4(%esp)
80483e9:       e8 d2 fe ff ff          call   80482c0 <printf@plt>
80483ee:       c7 04 24 06 00 00 00    movl   $0x6,(%esp)
80483f5:       e8 86 ff ff ff          call   8048380 <fun>
80483fa:       c7 04 24 0e 85 04 08    movl   $0x804850e,(%esp)
8048401:       89 44 24 04             mov    %eax,0x4(%esp)
8048405:       e8 b6 fe ff ff          call   80482c0 <printf@plt>
804840a:       e8 71 ff ff ff          call   8048380 <fun>
804840f:       c7 04 24 17 85 04 08    movl   $0x8048517,(%esp)
8048416:       89 44 24 04             mov    %eax,0x4(%esp)
804841a:       e8 a1 fe ff ff          call   80482c0 <printf@plt>
804841f:       83 c4 18                add    $0x18,%esp
8048422:       31 c0                   xor    %eax,%eax
8048424:       59                      pop    %ecx
8048425:       8d 61 fc                lea    0xfffffffc(%ecx),%esp
8048428:       c3                      ret

PLT

There's one more thing - call to fun called a constant place, but call to printf actually called printf@plt. The program doesn't have any idea where printf will reside, it changes from one versions of libc to another. The program doesn't even know where libc will reside. strace /bin/true clearly shows that programs tries to mmap libc without asking for a specific location. The operating system just happens to put it at address 0xb7e60000:


open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
mmap2(NULL, 1312164, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7e60000

strace'ing some other program (in this case Ruby) shows libc allocated somewhere else:

open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
mmap2(NULL, 1312164, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7cd3000

So how does it work ? Address of the actual printf function is located at 0x8049620 (at least in this case), and printf@plt jumps there:

080482c0 <printf@plt>:
80482c0:       ff 25 20 96 04 08       jmp    *0x8049620

Places like 0x8049620 are filled in when the program starts (actually when they're first needed, but that's a minor technical detail). Why won't we simply modify the code at start time ? Keeping code strictly read only lets the operating system hold only one copy of code in memory, even when multiple programs use it. If we modified the code, operating system would need to keep in memory a separate copy of libc for every process that uses libc. That would be horribly wasteful. Actually a few highly hackful solutions are possible to modify the code and preserve sharing by making the operating system more aware of how linking works, but they'd require significant changes to Unix architecture.

15 comments:

Luke said...: Big Oh notation is used for comparing algorithms of different orders of magnitude because it is used to describe the asymptotic behavior of functions. You cannot discredit big oh evaluation based on a constant factor between two O(n) algorithms. I suggest you read the wikipedia article to understand how big oh is actually used.; 18:41
Unknown said...: Plain arrays are O(n)???

I thought they were O(1).; 19:15
Anonymous said...: (1) The difference between the linked lists and the arrays was not a constant factor; as n grows larger, linked lists get worse faster.

(2) Did you even read the article/code? Iterating through an array takes 0(n) time because you have to read every element of the array. Some operations on arrays take O(1) time but iteration through the entire array is not one of them.; 06:01
HawkERB said...: This comment has been removed by the author.; 10:10
Anonymous said...: Strictly speaking, nobody can tell whether it's O(n) or not by experiment alone. That's because O(n) only means that the running time in question divided by n will approach a constant as n approaches infinity. In particular, the running time may do whatever it likes for small n and the observed behavior is entirely within O(n).

Of course, this means that O(n) can only give an idea how fast an algorithm is, it doesn't allow exact time predictions. But in the present example, neither does step counting do! Due to caching issues, the operations

result += tmp->value;
tmp = tmp->next;

cannot be assumed to take the same amount of time every time they're executed. In other words, the possibility itself of prediciting running time by looking at the source code is threatened.; 11:13
Unknown said...: A binary tree is also O(n) when you're iterating over all its leaves.

This is also true for hash maps.

This is also true for doubly linked lists.

This is also true for Btrees!

Everything is O(n)!; 01:20
Anonymous said...: loads of people still write programs in assembly, its just that its difficult, and can only be properly done by truly skilled programmer.; 20:11
Anonymous said...: Oddly enough, found you via the War Nerd... excellent reference/post. I'm adding an RSS feed to your blog.; 00:04
taw said...: Anonymous: Yes, I'm a big fan of the War Nerd, even after he failed rather spectacularly with his Mumbai attacks predictions.; 01:48
Yuhong Bao said...: On section "Ancient x86 assembly", MSVC on Windows still uses this method, particularly because the __stdcall convention where callee pops the stack is very common in Windows.; 04:46
Yuhong Bao said...: "Notice how calls to fun really look like - no messing with the stack pointer, just moving things to the right places."
In contrast, on MSVC, the only difference between a __cdecl and __stdcall function is that a __stdcall function uses a RET n instruction inside the callee to add to the stack pointer, where a __cdecl function just uses a plain RET instruction and MSVC adds a ADD ESP, n (or add $n, %esp) instruction after the CALL instruction.; 04:56
Unknown said...: I have two small objections to your approach and (some) drawn conclusions.

Asymptotic function behaviour describes algorithms, not hardware. Of course it doesn't take memory hierarchy into account.

Secondly, C is close to the hardware. Unix 'time' on the other hand, isn't. This would be significantly cleared if you gathered statistics in your code instead of using a command like 'time' - taking initialization time into account is one of the most basic mistakes in a measurement/profiling process.

Cheers; 17:39
Unknown said...: Cleared = clearer. Damn fingers.; 17:40
taw said...: agit8d: Initialization of simple Unix process is on order of fractions of milisecond - less than typical gettimeofday(2) resolution. Measured time is many seconds. You seem to be confusing C with JVM.

Assuming memory access has unit cost is like assuming that cats are uniform frictionless spheres. It might be justified occasionally, but burden of proof is on the assumer if result is claimed to have any real world significance.

I thought my post should dispel ideas that C is close to hardware...; 17:53
Anonymous said...: I don't think it's fair to claim C isn't close to hardware because it doesn't align a double to an 8 byte boundary. Not only is the actual size of double is implementation defined, with the C standard only guaranteeing that it is at least as big as a float, but memory packing is entirely up to the compiler. The C language leaves a lot details up to the compiler writers, such as memory packing. Some compilers even allow you to align to boundaries using directives, while others do not.

Basically what you are saying about C should be directed at your implementation. Maybe your using an x86 version of gcc on linux? That's pretty common, but that doesn't even fully implement the latest dialect of C. It's unfair to make claims about C based on one implementation's performance.; 06:36

taw's blog

Saturday, March 17, 2007

Modern x86 assembly

Ancient x86 assembly

Alignment

Caching

Alignment magic

PLT

15 comments:

About Me

Popular Posts

Blog Archive

Labels

My software

Creative Commons