My previous post wasn't supposed to be about performance - just about x86 assembly. Somehow what grabbed most attention was criticism of the big O analysis. Comments here and on reddit seem to show that most people didn't really get my point.
All models are wrong. Some are useful.Models are just tools that make reasoning about something easier - in many cases to make it possible at all. Real world systems are extremely complex. Models distort reality to make it more amenable to analysis. Classical mechanics is one such successful model. We know it is wrong, as it ignores quantum effects, and relativity. Most of the time, many other physical effect are ignored - like rotations of Earth possibility of object properties changing with temperature, and in some cases even friction. Most models in economics assume that capital is homogeneous, and bioinformatics even sometimes pretends that molecules are rigid. As long as distortions introduced by the model are not particularly relevant for our purposes, the model is fine. But even models known to seriously distort the very feature we're interested in can be useful. To find molecules likely to have medical effects, millions of molecules must be checked. Good models of molecular interactions are just too slow, so simple models in which molecules are rigid or have limited flexibility are used first to get a rough approximation, what lets us focus on more promising candidates. Sometimes a great molecule will be overlooked this way, and we simply have to accept that. To find a good developer, you don't offer a trial period to every candidate to evaluate them - it's quite accurate model, but way too expensive. So cheaper models like interviews and CVs are used, even though they distort reality a lot. There's nothing wrong with models being wrong. We need wrong models. There's just no way we'd perform a randomized double-blind study on a few thousand people every time we want to check wheather some molecule is medically useful. A lot of simpler models - computer simulations, chemical experiments, cell culture and animal tests - are used first. Very often the simple models give incorrent answers. Many drugs they found safe and effective were later rejected in clinical trials (or worse - after being introduced to the market). Most likely many drugs that would be safe and effective in humans were rejected because of such simple tests. We need such models like we need maps. Depending on our purpose, we need different models like we need different kinds of maps.
The map is not the territory.
Performance models
It's very easy to get used to a model - to keep using it even when it's totally inapplicable. A great example of a misguided performance model is Amdahl's Law. According to this law, programs have part that can be parallelized (1 − α) and part that has to be serial (α). So by throwing P processors at the problem, we get running time of:T = α + (1 − α)/P
So if the serial part takes 10% of time on single processor, throwing whopping 256 processors at the problem gives only meager 9.66x speedup.
The assumption which makes the model completely useless is serial part being constant. But the more processors we have, the bigger problems we deal with, and almost invariably, the smaller the serial part becomes. "Embarrassingly parallel" problems for which speedup is almost proportional to number of processors thrown at the problem are extremely common.
Model which better matches reality is Gustafson's Law, where speedup is: S = α + (1 − α) P
and throwing 256 processors at a problem with 10% serial time achieves 230.5x speedup !
Big O analysis on random access machine is just one of many models of performance. If used right, it can be useful. Unfortunately on modern hardware, it's almost completely useless for predicting performance. Worse than that - it is harmful, as very widespread belief in big O analysis slows down adoption of more accurate models.
Big O analysis
Here's a quick sketch of big O analysis in action.- Write down the algorithm in terms of basic operations.
- Select relevant variable or variables. Usually it's input size n.
- Analyze how number of basic operations depends on variables like n. We're only interested in asymptotic upper bound, ignoring the "constant factor".
- The result is supposed to tell us how fast is the algorithm in the worst case. Algorithms with better O-class are supposed to be faster, "except for maybe small n".
x = data[i];
can as easily mean 1 ns copy from L1 cache to register or 8 000 000 ns to seek disk head, read whole 4kB page from swap file to memory, propagate the data through (optional L3), L2, and L1 caches, and finally copy it to the register.
Modern hardware has multi-level memory. For small problem sizes all data fits L1 or L2 cache. For moderate sizes memory needs to be used. For big problem sizes disks are used. Even bigger than that it's tapes or distributed storage or something even more fancy. There is no n beyond which all memory access can be assumed to take uniform time, and up to which we can ignore all evidence against big O analysis as "small n effect".
Even when we know at what level we're operating, counting operations is still useless - memories have high enough throughput, but very high latency. So accessing them the right way (typically long sequential reads) can be many orders of magnitude faster than accessing them the wrong way.
The following table shows how much throughput DDR2 ram and 7200 rpm hard disks provide with sequential access (best case) and reading 4 bytes from random locations (worst case).
Memory level | Latency | Worst case throughput | Best case throughput | Best to worst ratio |
---|---|---|---|---|
Memory (DDR2-533 SDRAM 4–4–4) | 15 ns | 254 MB/s | 8.5 GB/s | 34:1 |
Hard drive (7200 rpm) | 4.17 ms | 7.7kB/s | 80 MB/s | 87500:1 |
x = data[i];
accesses L1, L2, L3, main memory, disk, or some sort of distributed storage. And within particular memory level, it won't tell us wheather memory access patterns are of slow or fast kind (what for hard disks that makes 87500:1 difference !).
Such effects are relatively benign for problems fitting L1 or L2 cache, and become far more important for larger n. Big O analysis is reasonably accurate only for small n !
Experiment
I wanted to run a simple experiment that would show performance of sequential and random access. The benchmark simplymmap
s a big file full of random data, and adds its elements. In one version elements are accessed sequentially, in the other version "pseudorandomly". Pseudorandom access is just a few instructions, most expensive of them being j % size
. Real (wall clock) time is measured, not user+sys. If less than 10s passed, program reruns the benchmark. This is done to improve measurement quality for very small n. The code is compiled with -O6 -march=athlon-xp -fomit-frame-pointer
, and is optimized pretty well by the compiler. The machine has 512MB memory.
Sequential version:
#define _GNU_SOURCE 1
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <unistd.h>
int main()
{
char *file = "./huge_file";
size_t max_size = 2560LL*1024*1024; /* 2.5GB */
int fd;
int *data;
unsigned int size = 256;
struct timeval start_time, end_time;
unsigned int i;
int res;
double time_elapsed;
int iterations;
fd = open(file, O_RDONLY|O_LARGEFILE);
if(fd == -1)
{
fprintf(stderr, "open failed: %s\n", strerror(errno));
return 1;
}
data = mmap(NULL, max_size, PROT_READ, MAP_PRIVATE, fd, 0);
if(data == MAP_FAILED)
{
fprintf(stderr, "mmap failed: %s\n", strerror(errno));
return 1;
}
/* Tells kernel to expect sequential access */
madvise(data, max_size, MADV_SEQUENTIAL|MADV_WILLNEED);
while(size * sizeof(int) <= max_size)
{
gettimeofday(&start_time, 0);
iterations = 0;
/* At least one full iteration */
do {
iterations++;
res = 0;
for(i=0; i<size; i++)
res += data[i];
gettimeofday(&end_time, 0);
time_elapsed = (end_time.tv_sec - start_time.tv_sec)
+ (end_time.tv_usec - start_time.tv_usec) * 1e-6;
} while(time_elapsed < 10.0);
printf("n=%d t/op=%g us t/op/n=%g ns res=%d (%f/%d)\n",
size,
time_elapsed*1000000/iterations,
time_elapsed*1000000000/iterations/size,
res,
time_elapsed,
iterations);
if(size * sizeof(int) == max_size)
break;
size *= 1.5;
if(size * sizeof(int) > max_size)
size = max_size / sizeof(int);
}
return 0;
}
Pseudorandom version:
#define _GNU_SOURCE 1
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <unistd.h>
int main()
{
char *file = "./huge_file";
size_t max_size = 2560LL*1024*1024; /* 2.5GB */
int fd;
int *data;
unsigned int size = 256;
struct timeval start_time, end_time;
unsigned int i;
unsigned int j;
int res;
double time_elapsed;
int iterations;
fd = open(file, O_RDONLY|O_LARGEFILE);
if(fd == -1)
{
fprintf(stderr, "open failed: %s\n", strerror(errno));
return 1;
}
data = mmap(NULL, max_size, PROT_READ, MAP_PRIVATE, fd, 0);
if(data == MAP_FAILED)
{
fprintf(stderr, "mmap failed: %s\n", strerror(errno));
return 1;
}
/* Tells kernel to expect sequential access */
madvise(data, max_size, MADV_RANDOM|MADV_WILLNEED);
while(size * sizeof(int) <= max_size)
{
gettimeofday(&start_time, 0);
iterations = 0;
/* At least one full iteration */
do {
iterations++;
res = 0;
/* j is a fast pseudorandom number */
j = 87230330;
for(i=0; i<size; i++)
{
j += 2024939473;
res += data[j % size];
}
gettimeofday(&end_time, 0);
time_elapsed = (end_time.tv_sec - start_time.tv_sec)
+ (end_time.tv_usec - start_time.tv_usec) * 1e-6;
} while(time_elapsed < 10.0);
printf("n=%d t/op=%g us t/op/n=%g ns res=%d (%f/%d)\n",
size,
time_elapsed*1000000/iterations,
time_elapsed*1000000000/iterations/size,
res,
time_elapsed,
iterations);
if(size * sizeof(int) == max_size)
break;
size *= 1.5;
if(size * sizeof(int) > max_size)
size = max_size / sizeof(int);
}
return 0;
}
Both programs perform exactly the same number of memory loads. According to big O analysis, their performance should differ only by a small constant factor.
The following table shows time/operation.
n | Memory size | Sequential time/operation | Pseudorandom time/operation |
---|---|---|---|
256 | 1 kB | 3.4 ns | 31.8 ns |
384 | 1.5 kB | 2.9 ns | 31.5 ns |
576 | 2.2 kB | 2.7 ns | 31.3 ns |
864 | 3.4 kB | 2.4 ns | 31.1 ns |
1296 | 5.1 kB | 2.3 ns | 31.0 ns |
1944 | 7.6 kB | 2.2 ns | 31.0 ns |
2916 | 11.4 kB | 2.2 ns | 30.9 ns |
4374 | 17.1 kB | 2.1 ns | 30.9 ns |
6561 | 25.6 kB | 2.1 ns | 30.9 ns |
9841 | 38.4 kB | 2.2 ns | 31.0 ns |
14761 | 57.7 kB | 2.4 ns | 30.9 ns |
22141 | 86.5 kB | 2.7 ns | 31.0 ns |
33211 | 129.7 kB | 6.3 ns | 45.3 ns |
49816 | 194.6 kB | 10.6 ns | 182.5 ns |
74724 | 291.9 kB | 12.0 ns | 142.9 ns |
112086 | 437.8 kB | 10.6 ns | 189.6 ns |
168129 | 656.8 kB | 10.7 ns | 199.9 ns |
252193 | 985.1 kB | 10.6 ns | 232.1 ns |
378289 | 1.4 MB | 12.1 ns | 241.1 ns |
567433 | 2.2 MB | 10.9 ns | 240.9 ns |
851149 | 3.2 MB | 10.7 ns | 258.5 ns |
1276723 | 4.9 MB | 10.7 ns | 260.7 ns |
1915084 | 7.3 MB | 11.6 ns | 261.2 ns |
2872626 | 11.0 MB | 10.8 ns | 265.0 ns |
4308939 | 16.4 MB | 10.7 ns | 272.7 ns |
6463408 | 24.7 MB | 10.8 ns | 297.0 ns |
9695112 | 37.0 MB | 10.8 ns | 400.8 ns |
14542668 | 55.5 MB | 10.8 ns | 439.5 ns |
21814002 | 83.2 MB | 11.3 ns | 476.6 ns |
32721003 | 124.8 MB | 13.5 ns | 445.2 ns |
49081504 | 187.2 MB | 19.7 ns | 334.0 ns |
73622256 | 280.8 MB | 28.4 ns | 502.0 ns |
110433384 | 421.3 MB | 156.4 ns | 1028.4 ns |
165650076 | 631.9 MB | 422.6 ns | 150079.2 ns |
248475114 | 947.9 MB | 415.8 ns | |
372712671 | 1.4 GB | 418.1 ns | |
559069006 | 2.1 GB | 457.2 ns | |
671088640 | 2.5 GB | 481.3 ns |
80485f0: 03 1c 87 add (%edi,%eax,4),%ebx
80485f3: 40 inc %eax
80485f4: 39 f0 cmp %esi,%eax
80485f6: 75 f8 jne 80485f0
Pseudorandom:
80485f6: 81 c1 d1 1f b2 78 add $0x78b21fd1,%ecx
80485fc: 31 d2 xor %edx,%edx
80485fe: 89 c8 mov %ecx,%eax
8048600: f7 f3 div %ebx
8048602: 03 7c 95 00 add 0x0(%ebp,%edx,4),%edi
8048606: 39 ce cmp %ecx,%esi
8048608: 75 ec jne 80485f6
The right way
Memory hierarchy is not the only problem with big O analysis. Another problem is overuse of worst case running times, which in many cases are far from typical. The only way of performance analysis that is pretty much guaranteed to work is getting actual load data and running benchmarks with them. Sometimes artificially constructed benchmarks and theoretical performance models (including the big O analysis) are appropriate, but don't blindly assume that's true in your case. As saying "benchmark with real data" is easy, but not very useful, here are A few bits of advice:- Benchmark with real data, or as realistic as you can.
- Non-uniform access patterns are faster than uniform. One million accesses to a single memory cell guarantee it's going to be in L1 cache. One million accesses to different memory cells each mean RAM. That's why accessing top levels of search trees is cheaper than accessing lower levels.
- Sequential is faster than random, much faster.
- Keep data that is accessed together close to each other.
- Keep data of different "temperatures" far away. Keeping very frequently and very rarely accessed data on the same cache line means cache space is wasted. This is particularly important with main memory caching of disk files, where cache line size is something like 4kB. A great illustration of this principle are B+ Trees, which keep all data (rarely accessed) on leaf level, so more indexes (frequently accessed) fit memory or caches. B-trees keeps data mixed with indexes. That's a huge waste of cache, as a lot of rarely accessed data must be hold in memory, and fewer frequently accessed indexes fit.
- Reuse standard library data structures and algorithms - they're likely to be reasonably well-tuned. Often they perform worse than what you could code, usually by trying to be more generic than they need to be, or by not being optimized for your hardware. More often than not, they're better than what you'd do in reasonable amount of time, and in any case fast enough.
- Learning basics of assembly, cache architecture, and other low-level stuff certainly won't hurt more than learning J2EE, and may even be more useful.
- Did I mention you should get real data ?
12 comments:
I don't even feel like discussing this anymore =)
It's not surprising that your example, which reads megabytes of data but does almost nothing with the data it reads, is critically affected by cache locality.
But many programs are not like this, and in other contexts Big-O notation is often useful. For example, a Lisp programmer will find that using Big-O to count stack frames, conses, and bignum operations will give a pretty good picture of how a function will perform.
I'm coming in halfway into the conversation, but anyway...
I was always under the impression that Big-O was always very dependant on the cost metric used. 'All' you've demonstrated is that the usual metric of a single memory access isn't useful in many cases, and provided your own. Perhaps not quite to the letter of textbook Big-O, but certainly within the spirit.
Inflammatory headlines considered harmful.
http://img77.imageshack.us/my.php?image=shituw3.png
It is pretty obvious you have no clue what you're talking about. Everyone knows what Big O applies to. It is a simple tool and you use it. All you're doing is proving that your OS's disk cache strategy sucks for anything non-linear.
If you look at that graph you'll notice they map back to each other with an exponential difference (it is a log/log plot).
But, it is clear you have no clue what you are talking about because you chose something so naive and you seem to be unaware that unlike others that in big O we drop constants. This means things like wavelets O(n) potentially are faster than FFTs O(n log n) but aren't due to very high constant time start up costs. Everyone knows the weakness of big O and what you've done is nothing new and people already did it.
The fact you felt it was necessary to do this further proved that you have clue what you're doing because you would've accepted and been aware of the assumptions of big O analysis in the first place.
What you've really done here, is proved you are incapable of real analysis and understanding how metrics and analysis work.
Excellent article. The big-o lovers just can't stand that there's actual engineering to software and not just puzzle questions and big-o notation.
Great article and all, but no... just, no. Big O analysis is useful, if you know how to use it properly. Big O is NOT for measuring performance - Big O is intended to measure scalability, independent of the hardware you're running the code on. Given that is it's objective, if you actually read up on how to use it, and WHEN it can be applied, then it works pretty darn well.
It is great for you to remind us that the modern computer structure is an important factor on performance, and provide illustrious test code.
However the title is too much. Most of us knew the limit of big O, but still most of us would apply it.
Once I experience a performance issue of linked list on embedded system which likely caused by cache miss, but still changing some list implementation from O(n^2) to O(n) gets the performance to acceptable level, without addressing the underlying cache miss issue.
Without commenting your rather useless article I'd like to advise you that there's a bug in your code.
madvise's arguments are not allowed to be ored together (e.g. MADV_SEQUENTIAL|MADV_WILLNEED). On a linux platform this would result in 2 | 3 = 3 which again equals MADV_WILLNEED.
Therefore the page cache read ahead won't be set to sequential but to normal, with just MADV_WILLNEED requested.
cls: Huh, it seemed to work at the time ;-) I'll keep proper madvise use in mind for the future, thanks.
Post a Comment