Parallelism in modern CPUs
In recent years CPUs gained performance mainly by increasing parallelism on all levels of execution. The advent of x86-CPUs with multiple cores established true thread level parallelism in the PC world. Solving the "multi-core problem", the question of how to distribute workloads between different threads in order to use all that parallel hardware goodness—ideally (semi-)automatically, suddenly became and continues to be one of the predominant goals of current hard- and software related research.Years before the whole multi-core issue started, CPUs already utilized another kind of parallelism to increase performance: Instruction level parallelism (ILP) was exploited via pipelining, super-scalar and out-of-order execution and other advanced techniques. The nice thing about this kind of parallelism is that your average Joe Developer doesn't has to think about it, it's all handled by smart compilers and smart processors.
But then in 1997 Intel introduced the P55C and with it a new 64-bit-wide SIMD instruction set called MMX ("multimedia extension"; after all, "multimedia" was "the cloud" of the 90s). MMX made available a level of parallelism new to most PC developers: data level parallelism (DLP). Contrary to ILP however, DLP requires specifically designed code to be of any good. This was true for MMX and it remains to be true for its successors like Intel's 256-bit-wide AVX2. Just as with multi threading, programmers need to understand how to use those capabilities properly in order to exploit the tremendous amounts of floating point horse power.
From SISD to SIMD
Whenever I learn new concepts, I like to first think of examples and generalize afterwards (inductive reasoning). In my experience that's true for most people, so let us therefore start with the "Hello, world!" of array programming: Suppose you have two floating point arrays, say xs and ys of length m and, for some reason, you want to add those arrays component-wise and store the result in an array zs. The conventional "scalar" or SISD (Single Instruction Single Data, see Flynn's taxonomy) way of doing this looks like this (C# syntax)for (var i = 0; i < m; ++i) { zs[i] = xs[i] + ys[i]; }
Fetch two floats, add them up and store the result in zs[i]. Easy:
For this specific example, adding SIMD (Single Instruction Multiple Data) is almost trivial: For simplicity, let us further assume that m is evenly divisible by n, the vector lane width. Now instead of performing one addition per iteration, we add up xs and ys in chunks of n items, in pseudo-C# with an imagined array range expression syntax
for (var i = 0; i < m / n; i += n) { zs[i:(i+n-1)] = xs[i:(i+n-1)] + ys[i:(i+n-1)]; }
We fetch n items from xs and n items from ys using vector load instructions add them up using a single vector add and store the n results in zs. Compared to the scalar version, we need to decode fewer instructions and perform n times the work in each iteration. Think of each SIMD register as a window, n values wide into an arbitrarily long stream (array) of scalar values. Each SIMD instructions modifies n scalar values of that stream at once (that's where the speed-up comes from) and moves on to the next chunk of n values:
So SIMD really is just a form of array processing, where you think in terms of applying one operation (single instruction) to a lot of different elements (multiple data), loop-unrolling on steroids. And that's already really all you need to know about SIMD, basically.
Yet, this example is perhaps the most obvious data parallel case one could think of. It gets a whole lot more interesting once you add OOP and the data layout this paradigm propagates to the mix. We will look into the issue of vectorizing typical OOP code in the next part of this series.
No comments:
Post a Comment