## Saturday, June 13, 2015

### SIMD Fundamentals. Part II: AoS, SoA, Gather/Scatter - Oh my!

Last time we looked at a very basic example of a data parallel problem: adding two arrays. Unfortunately, data parallelism is not always so easy to extract from existing code bases and often requires considerable effort. Today we will therefore address a slightly more complicated problem and move step by step from a pure scalar to a fully vectorized version.

## The problem and its SISD solution

Imagine we wanted to compute the squared ℓ²-norm of a set of m 3-vectors u in R³, what is really just the dot product of each vector x with itself: ‖x‖² = x² + y² + z²

In good OO fashion, you'd probably first define a Vec3f data type, in its simplest form like so (F# syntax)

Our vector field u can then simply be represented by an array of these structs and the code that performs the computation of the dot products is

Load the three components, square them, add up the squares and store the result in dp:
For SSE2, that's a total of three MULSS and two ADDSS instructions, so the one dot product we get out is worth 5 FLOPs. Now we would of course like to make use of our shiny SIMD hardware to accelerate the computation. There are a few options of how we can approach this:

## Pseudo-SIMD: Bolting SIMD on top of AoS

Our current data layout is an "array of structures" (AoS). As we shall see further below, AoS isn't a particular good choice for vectorization and SIMDifying such code won't yield peak performance. Yet it can be done and depending on the used SIMD instruction set it may even provide some speed-up.

Here's the idea: Instead of loading the three scalar components into separate scalar registers, we store all of them in a single SIMD register. In case of SSE2, n = 4 for single precision floats, so one of the elements in the register is a dummy value that we can ignore.  Now we can multiply the SIMD register with itself to get the squared components (plus the ignored fourth) one. Then we need to horizontally add up the three squares that we are interested in; in pseudo-F#:

Graphically:

Although we were able to replace three MULSS with a single MULPS and therefore express 5 FLOPs through only 3 arithmetic instructions, there are multiple issues with this idea:
1. We need three scalar loads plus shuffling operations to place the three components in a single SIMD register.
2. We effectively only use 3 of the 4 vector lanes, as we operate on Vec3f, not Vec4f structs. This problem becomes even more severe with wider vector paths.
3. We use vector operations only for multiplication, but not for addition.
4. Horizontal operations like adding the components are comparatively slow, require shuffling the components around and extracting a single scalar (the result).
5. We still only compute one dot product per iteration.
Another example for this kind of pseudo-vectorization can be found in Microsoft's SIMD sample pack (ray tracer sample). Yet, for all the reasons mentioned above, you should avoid this approach whenever possible.

## On-the-fly vectorization

If you recall our "Hello, world!" example from Part I, then you may also remember that vectorizing our loop meant that we computed n results per iteration instead of one. And that is what we have to achieve for our current dot-product example as well: We need to compute n dot products in each iteration. How can we accomplish this?

Imagine our data (array of Vec3f structs) to form a 3×m matrix (three rows, m columns). Each column represents a Vec3f, each row the x, y or z components of all Vec3fs. In the previous section, we tried to vectorize along the columns by parallelizing the computation of a single dot product—and largely failed.

The reason is that the data parallelism in this example can really be found along the rows of our imagined matrix, as each dot product can be computed independently and thus in parallel from each other. Furthermore m, the number of Vec3fs, is typically much larger then n and so we don't face problems in utilizing the full SIMD width. The wider the SIMD vector is, the more dot products we can compute per iteration.

As with our example from last time, the vectorized version of the algorithm is actually very similar to the plain scalar one. That's why vectorizing a data-parallel problem isn't really hard once you get the idea. The only difference is that we don't handle individual floats, but chunks of n floats:
1. Load n x values and square them
2. Load n y values and square them
3. Load n z values and square them
4. Add the n x, y and z values
5. Store the n resulting dot products
In pseudo-F# the algorithm is now

Graphically:

That good part about this version is that it uses vector arithmetic instructions throughout, performing n times the work of its scalar counterparts, performing a total of 20 FLOPs each iteration.

The bad part is the one labeled "scalar loads & shuffling" in the picture above: Before we can compute the n results, we have to gather n x, y and z values, but our data is still laid out in memory as an AoS, i.e. ordered [xyzxyzxyzxyzxyz...]. Loading logically successive [x0 x1 x2 x3 ...] values thus requires indexed/indirect load instructions (vector gather). SIMD instruction sets without gather/scatter support, like SSE2, have to load the data using n conventional scalar loads and appropriately place them in the vector registers, hurting performance considerably.

## Approaching nirvana: Structure of arrays (SoA)

To avoid this drawback, maybe we should just store the data in a SIMD-friendly way, as a Structure of Arrays (SoA): Instead of modeling the vector field u as a number of Vec3f structs, we consider it to be an object of three float arrays:

This is indeed very similar to switching from a column-major to row-major matrix layout, because it changes our data layout from [xyzxyzxyz...]. to [xxxxx...yyyyy...zzzzz...].

Observe how this implementation differs from the previous one only in that we index the components instead of the vectors:

And yet it saves us a lot of loads and shuffling, resulting in pure, 100% vectorized SIMD code:
As this version really only replaces the scalar instructions with vector equivalents but executes 20 FLOPs in each iteration, we should now indeed get about 4x the performance of the scalar version and a much more favourable arithmetic-to-branch/load/store ratio.

In fact, once the hard part of the work is done (switching from AoS to an SoA data layout), many compilers can even automatically vectorize loops operating on that kind of data.

## Conclusions

Vectorizing OO code needs some getting used to, the SoA view of the world may be a bit alien to non-APL programmers. If you can foresee the need for vectorizing your code at some point in the future, it may be wise to use an SoA layout from the get-go instead of having to rewrite half of your code later on. Experience certainly helps in identifying data-parallel problems; but as a rule of thumb good candidates for extracting data parallelism are those problems where thinking in terms of operations that apply to a whole array/field of items comes naturally.

In part III we are going to discuss concrete implementations of the above concepts using F#, RyuJIT and System.Numerics.Vectors and compare the performance of the different versions—once they sorted out this issue.