The best kittens, technology, and video games blog in the world.

Monday, June 19, 2006

Docking assembly


Ligand-Receptor docking: Canis lupus familiaris molecule docked into Triticum aestivum bun's binding pocket

What a funny day it was, playing with x86 FPU assembly :-D

The problem was inspired by docking, or molecular force field computations. The thing is - there is a commonly-held belief among bioinformaticians that exponential function is very expensive, so they're going to great lengths to avoid it even when the physics suggests it might be a good idea to use it. Unfortunately x86 FPU provides plenty of evidency for this, as it can only evaluate exponential function at extended (80 bit) precision, even though we'd much rather use single (32 bit) precision. Of course such excessive precision make it very very slow, so nobody uses it.

Did you even know that you can lower precision of some other operations on your FPU ? :-D

#include  <fpu_control.h>

void set_single_precision() {
int cw = 0;
_FPU_GETCW(cw);
cw = (cw &~ 0x300) | 0x000;
_FPU_SETCW(cw);
}

void set_double_precision() {
int cw = 0;
_FPU_GETCW(cw);
cw = (cw &~ 0x300) | 0x200;
_FPU_SETCW(cw);
}

void set_extended_precision() {
int cw = 0;
_FPU_GETCW(cw);
cw = (cw &~ 0x300) | 0x300;
_FPU_SETCW(cw);
}

Anyway, it only works for division and square root computation, which get about twice faster in the lowest precision, but pretty much nothing else.

Now we could simply wait for the hardware guys to provide us with faster hardware, or we could become evil and implement fast exponentiation in fast assembly :-D

The way x86 FPU implements normal exp function is:
  • split x into integer part and fractional part
  • evaluate 2fractional part of x
  • multiply the result by 2integer part of x, simply by adding to the floating point exponent

fld %st(0) # Duplicate
frndint # Round
fsubr %st, %st(1) # Subtract
fxch %st(1) # Exchange top two values on the stack

f2xm1 # 2^(fractional part) - 1
fadd %st(3), %st # Add 1, so we get 2(fractional part)
fscale # Multiply by 2(integer part)
fstp %st(1) # Clean-up

The first and the last part are very fast, unfortunately the second one is horribly slow.
So, how about we throw it away and replace by some approximation ?
The only reason we can reasonably do that is because the argument is already guaranteed to be a fractional part, that is -1/2 to +1/2 (surprisingly not 0 to 1, as we're rounding to nearest integer, not down). Anyway, even a very simple 2-term Taylor expansion is pretty effective. Oh and in addition to x86 FPU assembly, let's try interfacing it with gcc by some magic :-D

// The worst relative error is smaller than 1%
inline float fast_exp2_t2(float x)
{
float res;
asm (
"fld %%st(0) \n\t" // x x 0.5
"frndint \n\t" // r(x) x 0.5
"fsubr %%st, %%st(1) \n\t" // r(x) f(x)=y 0.5
"fxch %%st(1) \n\t" // y r(x) 0.5


// Instead of doing f2xm1 ...

"fldln2 \n\t" // ln2 y r(y) 0.5
"fmulp \n\t" // ln2*y r(y) 0.5
"fld %%st(0) \n\t" // ln2*y ln2*y r(y) 0.5
"fmul %%st(3) \n\t" // ln2*y/2 ln2*y r(y) 0.5
"fld1 \n\t" // 1 ln2*y/2 ln2*y r(y) 0.5
"faddp \n\t" // (1+ln2*y/2) ln2*y r(y) 0.5
"fmulp \n\t" // (1+ln2*y/2)*ln2*y r(y) 0.5


"fld1 \n\t"
"faddp \n\t"
"fscale \n\t"
"fstp %%st(1)\n\t"
:"=t"(res)
:"0"(x), "u"(0.5)
);
return res;
}

gcc inlines that, so it's almost as effective as if it actually knew what it's doing. Anyway, the results were really striking, the normal exp function was 160% slower than traditional bioinfo trickery, however with this low-precision exponential the slowdown was only 35% !

Of course it's all pretty much toyish and probably has about zero actual uses. At the same time, it's really cool and evil, isn't it :-D ?

No comments: