The aggregate cache bandwidth of two high-end $1386 quad-core Xeons almost reaches the main memory bandwidth of a $300 graphics card. The cache bandwidth of an eight-core Power7 blade (cheapest system I could find started at $9788.00) actually exceeds the main memory bandwidth of the graphics card by 40%. A cache-conscious version of the GPU implementation might make it 5x faster (as according to this blog post, the GTX 285 has 1.5 TB/s cache bandwidth). Currently trying to write one. Testing on my 2GHz Core 2 Duo and a Radeon HD 4850 that has 40% of GTX 285's bandwidth.
My results as of now
- ATI Radeon HD 4850, 67.9 GBps memory bandwidth, 384 GBps L2, 480 GBps L1, 1000 mul-add SP GFLOPS, hobbled by lack of local mem and the lack of image support in AMD's OpenCL drivers, resulting in most reads coming from main memory
- Core 2 Duo E6400, 6.4 GBps memory bandwidth, 18 GBps L2, 90.5 GBps L1, 37.5 mul-add SP GFLOPS, hobbled by lack of cache bandwidth and computing performance (but hey, maybe there's some way to unlock the last 16 GFLOPS)
Table of current results (8 FLOPS per 32 bytes memory read)
The results from the paper are on 500x500 input images. The paper reported
runtimes in seconds, I calculated the bandwidth numbers by dividing 281 GB
by the reported runtimes.
The OpenCL results are measured without the context and kernel creation
and release times, as those are likely only relevant for a single-run
workload (and a 0.7 sec one-time overhead to do a 1 sec operation would
be a bit skewy). The OpenCL results do include the data write and read times.
GLSL times are similarly measured without global setup overhead.
359 GBps (90 GFLOPS) 2x2-blocked GLSL on Radeon HD 4850 (1024x1024 input)
337 GBps (84 GFLOPS) 2x2-blocked GLSL on Radeon HD 4850 (500x500 input)
278 GBps (70 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (640x640 input)
[276 GBps (69 GFLOPS) the paper's optimized impl on 8-core 3 GHz Power7]
259 GBps (65 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (512x512 input)
243 GBps (61 GFLOPS) optimized OpenCL GPU on Radeon HD 4850 (500x500 input)
240 GBps (60 GFLOPS) 2x2-blocked GLSL on Radeon HD 4850 (250x250 input)
[230 GBps (58 GFLOPS) the paper's GTX 285 implementation using local mem]
204 GBps (51 GFLOPS) naive GLSL implementation on Radeon HD 4850
[159 GBps (40 GFLOPS) the paper's GTX 285 implementation]
[150 GBps (38 GFLOPS) the paper's optimized impl on 2x4-core 2.93 GHz Nehalem]
86 GBps (22 GFLOPS) naive OpenCL GPU implementation (1D workgroup)
84 GBps (21 GFLOPS) optimized OpenCL CPU on Core 2 Duo E6400
45 GBps (11 GFLOPS) manually optimized OpenMP SSE implementation
25 GBps (6 GFLOPS) naive OpenCL GPU implementation (2D workgroup)
5 GBps (1 GFLOPS) naive OpenMP SSE implementation
5 GBps (1 GFLOPS) naive scalar C implementation
1.1 GBps (0.3 GFLOPS) naive OCaml implementation
0.03 GBps (0.008 GFLOPS) naive pure-Python implementation
The GPU data transfer overhead scales linearly with problem size, but the computational complexity of the algorithm grows quadratically. So on small problem sizes the effect of the overhead is more visible, as you can see above. The CPU does not suffer from the data transfer and setup overheads, so there is a point where the CPU is faster for problems smaller than that. With the current results for the GLSL implementation and the server CPUs, that point is at somewhere around 200x200 input image size. For my Core 2, it's around 150x150. For single image runs (with all the OpenGL overhead factored in the runtime), the GLSL-Power7-cutoff runs at around 600x600.
CPU performance went up from a 5.5 GBps naive parallelized SSE implementation to a 84 GBps OpenCL optimized version (93% of L1 bandwidth, 56% peak GFLOPS). The fastest manual SSE code I managed topped out at 43 GBps. GPU performance went from a 86 GBps naive implementation to a 337 GBps optimized version (70% of L1 bandwidth, 8% peak GFLOPS). Note that the naive CPU implementation is written in C, parallelized and vectorized. If you're writing it in pure Python instead, the performance is around 0.03 GBps, or 180x slower than the C baseline. For another baseline, a naive OCaml implementation manages 1.1 GBps. And the OCaml and Python implementations don't get faster from reordering the memory accesses because they're bottlenecked by computation, not memory access.
But in vectorized C, this algorithm is all about optimizing your memory accesses. The CPU got a 6x boost from changing the loop nesting (4x from the order optimization, rest from it enabling each core use its L1 caches and thus making parallelization more effective) and another 2x boost from reusing already loaded values (again, reducing average read latency). The GPU seems to be mostly bound by the memory access pattern (and the lack of local memory and global mem caching on the Radeon 4xxx series). On the GPU I got a 2.2x performance increase from updating 8 values at a time and reading the current block of used values to a local array. Heck, a more optimal OpenCL access pattern with _more_ reads and computation (but, sadly, _wrong_ reads) nearly doubles the GPU performance. A mostly lock-step-iterating version that does more reads got me an extra 20 GBps on OpenCL (kinda freaky, that.) GLSL does use the caches, so it's a good bit faster.
The AMD OpenCL implementation in Stream SDK 2.1 doesn't support images on HD 4850, and global memory accesses aren't cached (if I understood correctly. The generated assembly has texture fetches with the UNCACHED keyword, so probably!), which might be the cause for the low bandwidth. Guess I should buy an HD 5xxx / Nvidia and test the performance with images (I hear they have real local memory as well :P). Testing with a faster CPU would be nice as well.. Not that I have money for any of those so pfft.
Some interesting things from my POV: Slow CPU code is slow. It's possibly to reach at least 93% L1 bandwidth on the CPU, though even that doesn't max out the ALU (56% of the peak GFLOPS). And it's still kinda slow even when you do reach it. OpenCL is pretty nice for writing high-performance CPU code, it takes the pain out of SSE and parallelization. The setup is a pain though, and there's a 0.1 s binary program loading overhead. GPUs are kinda fast, even when hobbled with incomplete driver support. I'm stuck at 8% the peak GFLOPS because I can't read from the memory fast enough (and the GLSL implementation doesn't reuse loaded values to compute several output values in parallel, so it's doing cache reads instead of register reads. Which might well help.)
The algorithm used in the benchmark is massively amenable to optimization as it has an embarrassingly parallel map pass, a completely associative reduce pass, completely vectorizable, each output value is basically a single N-wide mul with a horizontal add, has a large amount of shared reads between adjacent output values, benefits a lot from cache, benefits a lot from prefetch, etc. Any numbers you see about a benchmark like that should be taken with a huge grain of salt, it's just too easy to get 2x performance differences with small optimizations. The difference between my scalar implementation and parallelized SSE implementation was 15%. The difference between the parallelized SSE implementation and the optimized implementation was 1700%. The GPU performance delta was from a pessimized version that ran at 20 GBps to a regular version that ran at 85 GBps to an optimized version that ran at 240 GBps to a GLSL version that ran at 337 GBps. And the OpenCL GPU number is very sensitive to data access pattern, problem size and the size of the OpenCL workgroup: I'm getting 240 GBps with 64 workers on 512x512 images, 190 GBps with 32 workers, 180 GBps with 128 workers, 30 GBps on 500x500 images (that rises to 150 GBps by changing stride size), 150 GBps on 256x256 images, etc. The GLSL version is much stabler with regard to input size, though it too suffers from lower performance at smaller input sizes.
The CPU is less prone to wild swings in throughput from small changes, which is nice. The only way to get decent performance on this algorithm seems to be memory access optimization, SSE, parallelization and collecting several values at a time. So you pretty much need to write it in C / C++ with OpenMP and SSE intrinsics, or OpenCL. And if you're writing it in OpenCL, why not get the GPU in the mix as well...
GPUs would be so much better if they didn't hard-lock when encountering an infinite loop. Over the past few days I've done Alt-SysRq-REISUB more times than I can remember. The OpenCL GPU implementation on the HD 48xx is ... finicky. Small changes in code can lead to either the GPU hard-locking, a 10x performance drop, or just plain incorrect behaviour. I'd like to see either GPUs with CPU levels of usability or CPUs with GPU levels of performance. Or both. I'm not picky.
Based on my initial experience, I'd say that if your algorithm is amenable to GPUs (i.e. it does lots of single-precision FP math or needs high memory bandwidth), it's not too difficult to have a naive GPU implementation that performs better than a naive CPU implementation, especially if your CPU implementation is in a slow language. In my case, optimizing the CPU and GPU implementations happened in parallel and the optimizations were quite often portable between the different implementations. The CPU was easier to optimize for and had a much better development experience (i.e. no lockups), but my cheapo CPU didn't have the performance to match the GPU at any point. The CPU-GPU performance delta went from 15x between naive implementations (40x with GLSL) to 4x between the optimized implementations.
In terms of price-performance, a $100 GPU and a $100 quad-core CPU (hmm, should you include the memory price in the CPU prices?) should perform about the same on this algorithm when using OpenCL. With GLSL, the GPU would be around 50% faster. A 2.9 GHz Athlon II X4 should get something like 220 GBps (247 GBps L1 bandwidth) and a 3 GHz quad-core Core i7 should get around 175 GBps (192 GBps L1 bandwidth). At the $200 price point, the gap between the GPU and CPU will likely increase: you get around 2x the GPU L1 bandwidth, but only a 50% extra CPU L1 bandwidth (with 6-core Phenom. Opting for a higher-clocked quad-core CPU would likely show very little performance benefit, you'd only get +20% out of going from 2.9GHz to 3.5GHz, for example.) But lacking the hardware to test, this is mere speculation.
Update #0: Source code.
Update #1: GPU at 86 GBps. Managed to drop the CPU runtime from 49 s (5.5 GBps) to 30 s (9 GBps) by making the inner loop update the values in blocks. The intuition there being that you're going to have better cache locality if you're updating a bunch of values that use data from the same area. It also returns bad values for some input sizes, so it probably doesn't work right.
Update #2: The OpenCL GPU kernel is kinda hard to optimize, most of my naive tries ended up hurting performance. But I did manage to get the performance up by 40% using a similar block-based approach as on the CPU (by using a local cache array instead of relying implicitly on L1.) Currently at 118 GBps with roughly the same optimization effort as with the CPU implementation. I don't really know much anything about optimizing OpenCL though, so I think the number could go considerably higher. On the CPU, there's maybe another 10 GBps squeezable from there, I'll have to see about that too.
Update #3: GPU at 165 GBps (note that this is on a HD 4850 that has 40% of the memory bandwidth of the GTX 285. If the results scale linearly, GTX 285 would get 391 GBps.) CPU at 27 GBps, but the CPU results are wrong so I need to fix them to see if that's a real score.
Update #4: Fixed the optimized CPU version. It's at 22 GBps, but could probably go to 27 GBps with little work. The upper limit for the CPU version is 42 GBps where all reads come from L1. The L2 limit is less than half the L1 limit, so the optimized version is already beating pure L2 bandwidth.
Update #5: According to this article on iXBT Labs, the L2 bandwidth of the HD 4850 is 384 GBps and the L1 bandwidth is 480 GBps, so those would be the limits for the GPU score. If the CPU scores are any indication, 400 GBps lies within the realms of possibility.
Update #6: Wow, a less math-heavy version of the OpenCL kernel absolutely flies on the CPU. 33 GBps. That's equivalent to 8 cycles per cache line. If the Nehalem was doing 10 cycles per cache line, it might go 20% faster with this version and reach 187 GBps. Anyone with an 8-core 2.93 GHz Nehalem want to give it a go (:?
The difference between the GPU version and the CPU version is that the CPU version is doing the dot product with an accumulator vector and doing the horizontal add at the very end of the loop. This is because a float4 dot product is mul + horizontal add + float add on the CPU, and that's going to take 6 cycles or something. On the other hand
acc += v0 * v1executes in one cycle. The GPU does the dot product in a single cycle and the add in another, so it's just 2 cycles. I tried running the CPU version on the GPU but it strokes it the wrong way and results in the GPU hanging (or just running at 0.1 GBps through a benchmark designed for thousand times that). And the ATI Linux drivers don't do the Windows 7 thing of resetting the GPU if it goes unresponsive for several seconds.
There's also a weird data alignment issue (or something) on the GPU: 496x496 images run at 152 GBps, 500x500 images run at 145 GBps, 512x512 runs at 165 GBps. The CPU doesn't care all that much.
Update #7: On trying to optimize the GPU kernel, managed to get the CPU kernel up to 42 GBps by refactoring a bit. That's nuts. That's a Core 2 Duo 2.12 GHz from four years ago beating the paper's two-thread Nehalem 2.93 GHz version by 13%. Maybe I should write a paper titled "Believe it or Not! 2 GHz Core 2 CPUs can Match 3 GHz Nehalems and POWER7s for FLOP-intensive Applications!"
Update #8: Did a few bogus GPU kernels to figure out where the realistic limits are. If the 160 GBps kernel did all its reads from L1, it could achieve 270 GBps.
Update #9: GPU kernel at 182 GBps, achieved by breaking each row into its own extra work unit. Fixed the algorithm as well.. The paper was offsetting both images on Y but the idea is to keep one image static and move the other image over it. Anyhow, that had no effect on performance. It's funny how an extra 15 GBps on the GPU is kinda meh, but a 7 GBps increase on the CPU is WHOAH AWESOMESAUCE. Hurr durr, still lacking the L1 breakthrough. If the new kernel did all its reads from L1, 350 GBps.
Update #10: Oh right, += isn't atomic. Now I get to rewrite things.
Update #11: Actually, += not being atomic wasn't the problem. The problem was that the HD 4850 OpenCL implementation only gives you 16k of local mem. And that's across all the workgroups. And I can't get it to work. Oh well. Anyhow, optimized work group size and got 190 GBps from the GPU (including data transfer to-from main mem), and the kernel bandwidth broke 200 GBps! Woop!
Update #12: CPU kernel at *drumroll* 78 GBps! Insane! That's equivalent to a single quad-core Nehalem socket. From a previous-gen chip with 66% of the clock-speed, half as many cores and quarter of the cache. Applying the same optimizations to the GPU kernel also got a nice 30 GBps performance boost, bringing it up to 235 GBps.
Update #13: Tuned the GPU kernel a bit and got 250 GBPs at 640x640 problem size. Kernel is doing 260 GBps internally. Still need to eke out 16 GBps and it'd match the paper's Power7 numbers. Man, that thing is a beast. Or maybe they were using a version that was more optimized for it than for the Nehalem. My CPU kernel should do, hmm, maybe 500 GBps on the Nehalem, if it scales in a linear fashion. Who knows. [Update: 92% of an 8-core Nehalem's 370 GBps L1 bandwidth would be 340 GBps.]
Update #14: Fixed the kernels, as I had optimized correctness away. 6 GBps drop for the CPU, GPU speed dropped 10-20 GBps.
Update #15: CPU kernel at 84 GBps, GPU kernel at 220 GBps. The CPU kernel is now about as fast as the naive GPU kernel, but still 3x slower than the optimized one. And the CPU kernel produces different results from the GPU kernel and the reference implementation because it does the additions in a different order.
Update #16: GPU kernel at 207 GBps (500x500), 245 GBps (512x512), 274 GBps (640x640), with internal kernel bandwidth reaching 285 GBps on 640x640. The 640x640 bandwidth is about the same as the Power7 number in the paper, yay!
Update #17: 500x500 kernel at 243 GBps. Also, before I was removing the context creation times and kernel build times from the OpenCL timings, but not the context and kernel release times. Which meant that the results weren't really descriptive of real multi-image run performance as both the creation and release are done one time only regardless of the amount of images you process. The creation overhead is 0.4 s and the release overhead is around 0.3 s, which will skew numbers for a test that takes ~1 s to run. To make the times more accurate for multi-image runs, I'm now deducting both the creation and release overheads from the OpenCL runtimes. The runtimes do include the data read and write overheads, as those are image-specific. (Hey, now all my OpenCL implementations have 0.3 s shorter times, yay?)
Update #18: Wrote a GLSL version of the GPU kernel to see if it might be able to use the texture caches better. Short answer: Yes. 204 GBps on a naive kernel.
Update #19: GLSL version at 337 GBps for 500x500 images (350 GBps on 640x640). 70% of L1 bandwidth. Cached memory reads sure are awesome. The performance improvement came from reading in 2x2 blocks. The reason why it's faster is that a texture memory read also loads the adjacent values to cache. It might also be possible to use a similar load-reusing algorithm as I have on the CPU and the OpenCL implementations and make the GLSL version faster. But GLSL is painful for general computation, so maybe not.
By my calculations they achieved 160 GBps read bandwidth from the GTX 285 (advertised memory bandwidth 159 GBps), 150 GBps from the Nehalem (advertised peak 2x22 GBps) and 276 GBps from the eight-core Power7 (advertised peak 100 GBps). The algorithm is doing 1 cycle of computation (a SSE mul-add) per 32 bytes of memory read, so it's pretty much a pure bandwidth benchmark. As the total problem size is 8.25 MB, it snugly fits inside the 8.38 MB L3 caches of the CPUs, so maybe the CPU L3 caches are made out of infinite bandwidth magic?
But, uh, the L3 latency on the Nehalem is 39 cycles, which'd give it 37.5 GBps bandwidth at 8 cores. 150 GBps is 10 cycles per cache line. The L2 latency on the Nehalem is 11 cycles. Plus you actually have to do computation for at least four cycles per cache line, though that's going to be masked by prefetch. How can it get 150 GBps?
The paper had a table that compared the different stages of optimization. A single-threaded scalar version ran in 30 s, compared to a single-threaded SSE version that ran in 12 s. A 16-threaded SSE version took 1.8 seconds, which is 6.7x faster than the single-threaded one. Pretty nice with 8 cores. But this is a memory benchmark. How does a memory benchmark get 7x faster by increasing the thread count? That's only going to happen if each core has its own copy of the memory block it's working on. The same thing happened on the Power7 implementation: SIMD + OpenMP resulted in a 14x performance boost(!) So maybe their algorithm is really good at caching, who knows. There's no compilable source code in the paper so...
I did a quick reimplementation of the algorithm in C++ and got 69 s for the scalar version and 58 s for the SSE version on my Core 2 Duo 2.12 GHz. Parallelizing it by tossing a '
#pragma omp parallel for' above the top-level loop brought the SSE execution time down to 49 s. Why so slow? The 8 MB input images don't fit in the CPU cache, turning the program from a cache bandwidth benchmark to a memory bandwidth one. By quick calculation, it's plowing through the benchmark at 5.6 GBps.
[Update: A more cache-optimal memory access pattern got the CPU up to 84 GBps, so it's definitely possible to do most reads from L1.]
Now, I also tested the algorithm on uninitialized allocations. And the results were very strange. On uninitialized data, the SSE version completed in 13 seconds. The parallel version ran in 7.5 seconds, for a bandwidth of 37.5 GBps. That's a 7 cycle latency. L2 latency is 14 cycles. It's actually reading from L1.
So how can two 4 MB images fit into 32 kB of L1 cache? The easy answer? They don't. I think what we're seeing here is the Linux VM subsystem doing optimistic allocation. That is, when you allocate a block of memory, all 4 kiB pages in that block are going to map to The Empty Page at first. The pages are only actually created when you write on them. So as all the pages of a 4 MB image actually map to the same empty page of RAM, the benchmark is going to act as if the whole image is in L1.
I next wanted to check out the GPU numbers and wrote an OpenCL version of the algorithm. Which also gave me an excuse to learn OpenCL. I downloaded the ATI Stream SDK and a couple hours later had a working OpenCL version. There's no Linux installer in the package, so you get to do a strange dance setting LD_LIBRARY_PATH and ATISTREAMSDKROOT environment variables.
Writing the kernel was simple. I pretty much copy-pasted the C++ version of the algorithm to do the OpenCL kernel (there are two things that differ between the versions: vec4 turned into float4 and offset loop counters are now job ids). The state setup one has to do to execute the kernel was quite tedious though, and weighed in at 30 lines of code.
The nice thing about OpenCL is that you can also run the kernels on the CPU. And, well, the CPU build of the OpenCL kernel actually ran faster than my parallelized SSE version, which is pretty awesome. The bad thing about OpenCL is that it's like C and OpenGL when it comes to error reporting: it fails silently and you have to milk the errors out. And when you manage to do an infinite loop, the GPU hangs and you have to cold-boot the computer. Additionally, when you're running something time-consuming on the GPU, it freezes compiz for the duration of the computation and you're left twiddling your thumbs. It would be really REALLY nice if the GPU could do pre-emptive multitasking. I've had to Magic-SysRq-boot my computer 4 times today. It's so Windows 95.
But how about the runtimes for the GPU? My first version took 3.5 seconds on a Radeon HD 4850. 14 times as fast as the CPU implementation, twice as fast as the bogus uninitialized version, not to forget having to account for a 0.6 sec kernel compilation and setup overhead. Caching a copy of the built kernel binary to disk and reusing it brought the GPU overhead down to 0.2 s. The CPU OpenCL overhead was 0.2 s with compilation, 0.1 s with binary kernel.
The HD 4850 has a memory bandwidth of 67 GBps, but the benchmark achieves 91 GBps. Did I make an error in my bandwidth calculations? Or might it cache more of the input image than the CPU? Time to write a small memory benchmark... Yeah. 10.3 seconds to do 69 billion 16-byte reads from a 4 MB buffer, 106 GBps. Guess it caches it better than the CPU (which managed 7.6 GBps.)
By spending the same time optimizing the CPU implementation and the GPU implementation, I managed to further drop the realistic case CPU runtime to 12 s (23 GBps) by making the inner loop update the values in blocks. The intuition there being that you're going to have better cache locality if you're updating a bunch of values that use data from the same area. This kind of a thing could make the Nehalem keep the reads mostly in L2, which might well get it close to 150 GBps. [Update: As noted above, you can keep the reads mostly in L1, which might get the dual Nehalem to ~340 GBps, if the Core 2 results scale.]
Applying the same optimization to the GPU version brought its bandwidth up to 165 GBps. That's 10% faster than the $2600 Nehalem duo. On a $100 graphics card. And that's not even close to the 384 GBps L2 bandwidth limit for the HD 4850, so there is probably quite a bit of headroom there. [Update: Well, if you could use the cache with the HD 4xxx OpenCL implementation...]
To wrap up: Benchmarking a cache-optimized algorithm against an algorithm that reads from the main memory is kinda pointless. Or was the goal of the paper to say "Hey, we've got an awesome optimizing compiler that can reorder loops to be very cache-optimal"? Also, I should remember to initialize buffers when benchmarking. And a $100 graphics card trounces $2600 worth of CPU power in this benchmark. Finally, that $100 graphics card sure could use some damn hardware support for pre-emptive multitasking.
I put the sources on GitHub, go check 'em out.
 Estimated bandwidth use for the benchmark:
float_sz = 4
out_size = 250 * 250 = 62500
reads_per_out = 2 * (500-0.5*250) * (500-0.5*250) * 4*float_sz
= 2 * 375*375 * 4*float_sz
writes_per_out = float_sz
total_reads = out_size * reads_per_out
total_writes = out_size * writes_per_out
total_bw = out_size * (reads_per_out + writes_per_out)
= 62500 * (4500000 + 4)
= 281 GB