Friday, April 11, 2014

NativeInterop is live on NuGet

Today I released a first version of my NativeInterop package on NuGet. You can find a description of the purpose and usage of the package as well as the source code on BitBucket.

The motivation to build and release this package really evolved from two practical issues I encountered when building performance critical C# code:

  1. Creating a native, high-performance generic array data structure in C# seems to be impossible (see A minimalistic native 64 bit array ...).
  2. Reading structured binary data from a byte[] requires some ugly hacks to get both decent performance and genericity (see e.g. Reading Unmanaged Data Into Structures).
The reason for both of these issues is the fact, that C# lacks a "unmanaged" type constraint and thus you cannot express something like
static unsafe T Read<T>(T* p) {
    return *p;
But in F#, you can; you'd simply encode this as
open NativeInterop
let pVal = ptr
where ptr is of type nativeptr<'T> and 'T is constrained to unmanaged types.

The performance offered by the NativeInterop package should be on par with non-generic unsafe C# code. The NativeInterop package also contains an implementation of NativeArray64, but this time without using inline IL. It turned out that in newer versions of the .NET framework, the AGUs are utilized correctly for the address (offset) computation (instead of emitting IMUL instructions): Calling NativePtr.set<'T>/get<'T>/set64<'T>/get64<'T> (or NativePtr.Get/Set or IntPtr.Get/Set or NativePtr.get_Item/set_Item, respectively) should all result in the generation of a single mov instruction.

Monday, April 7, 2014

A first look at RyuJIT CTP3 and SIMD (SSE2) support for .NET

Not being able to exploit today's processors SIMD processing capabilities is a major culprit when implementing high-performance (e. g. numerical) applications in C# (or any other CLI language). While there is Mono.Simd, there is no solutions for applications running on top of Microsoft's own runtime (CLR), despite popular demand ... until now!

With BUILD 2014, Microsoft released a new preview version of the next generation JIT compiler "RyuJIT" that, combined with a special SIMD library that can be installed via NuGet, supports SIMD intrinsics (only SSE2 for now, but AVX is in the works).

Finally! I couldn't wait to try out the new bits; thus I modified the C# version of my existing XRaySimulator* to make use of SSE2 by implementing a simple packet ray tracing technique, i. e. instead of tracing individual rays, this version traces bundles of 2x2 (SSE2) or 4x2 (AVX) rays. Because the rays are largely "coherent" they typically hit the same objects (cache hit rate!).

The contenders

Currently there are a total of six different variants of the XRaySimulator:
  •  "C#": This is the baseline, scalar managed implementation.
  •  "C# adj. trav.": A further optimized version that exploits the fact that once a ray is inside a volume (finite element) mesh, it must hit a face of an adjacent element (hexahedron).
  •  "C#/SSE2": Like "C#", but using 2x2 (X-)ray packets; doesn't use "adjacency traversal" due to the high branching factor
  •  "C++": A C++11 reimplementation of "C#"; I tried to stay as close as possible to "C#" while still using at least half-way decent, idiomatic C++.
  •  "C++ adj. trav.": Corresponds to "C# adj. trav."
  •  "C++/AVX": Vectorized version of "C++" using 4x2 ray bundles thanks to AVX
Note that most of these implementations are to be considered "quick-and-dirty, yet somehow working hacks..." If you don't mind the ugliness, though, you may follow the links to BitBucket and have a look at the code (Visual Studio 2013 projects; you also need the latest Roslyn CTP in order to compile the C#/SSE2 branch).

Performance analysis

So, who wins? The following figure shows the performance of the different versions in million rays per second (MRay/s) rendering an FE model consisting of 28672 hexahedral elements (344064 triangles) at a resolution of 6400 x 4800 pixels on an Intel Core i7-2600K (3.4 - 4.2 GHz) with 32 GB DDR4 RAM running under Windows 8.1 Pro:

As expected "C++/AVX" blasts away the rest of the pack with a stunning 13 MRay/s. And while "C++/AVX" delivers a speed-up of 5.2 over "C++", "C#/SSE2" only improves by a factor of 2.5 compared to "C#" and displays only insignificant performance gains over the optimized scalar version "C# adj. trav."

Now, given that SSE2 uses only 128-bit-wide vector lanes compared to AVX's generous 256 bit and the generally much more aggressive optimizer of the Visual C++ compiler, it's not exactly surprising to see an obvious performance difference between the "C++/AVX" and "C#/SSE2" case. Yet, I still would have expected the speed-up of  "C#/SSE2" to reach a value a little closer to 4x instead of 2.5. What's going on there?

According to Visual Studio's built-in profiler all of the implementations spend the majority of their time in the intersection routine of the AABB (axis-aligned bounding box) - which is a good thing, because this intersection test is very fast compared to a triangle intersection test. Thus the quality of the generated machine code for this method/function is critical for the overall performance of the renderer.

The source code of the C#/SSE2 version looks like this:
Loading ....

And here's the source for the C++/AVX version:
Loading ....

(Note: The C++ code uses a hard-coded vector lane width of 8 floats.)

Almost identical; yet, if you compare what both RyuJIT and Visual C++ make of these sources, you'll first notice that the machine code emitted by RyuJIT is much more convoluted and thus longer:
Most of that "additional stuff" that's going on in the C#/RyuJIT version seems to be related to null-pointer checks (both AABB and RayPack are classes and thus reference types). Still, I wonder if all those load/store operations are truly neccessary.

Preliminary conclusions

It seems like Microsoft has finally awakend and makes the long overdue investments in .NET performance. Thanks Google and Apple! Although RyuJIT will still require a lot of optimizations, in particular with respect to the generated SIMD code, Redmond's latest moves are promising. A next generation JIT, SIMD support, AOT compilation using the Visual C++ optimizer backend... What will come next? GPGPU support? Large arrays? A decent, modern, performant desktop UI framework? True first-class support for F#?

The future is bright!

*XRaySimulator is a visualization tool that renders X-ray-like images of finite element models. It uses a modified ray tracing algorithm to compute the energy absorption within each intersected element based on the element's material properties. A BVH (bounding volume hierarchy) is used to speed-up the intersection computation.
Details (German):

**The C# versions of XRaySimulator on BitBucket currently don't support saving the rendered image to a file. In older versions, I used to use Tao.DevIL, but that only works on x86 and the preview releases of RyuJIT only emit x64 machine code. The C++ versions use a custom TGA output filter.