Microsoft's JIT team just released CTP 5 of "RyuJIT", its next generation JIT compiler for .NET. I just got around to test it with the SIMD version of my X-ray simulator. The updated benchmark results look like this:
Yes, performance actually dropped with the latest release. Again, the AABB ray intersection test is the bottleneck and this time the generated machine code is even worse compared to CTP3:
Let's just hope this is a temporal regression.
Friday, November 7, 2014
Wednesday, July 16, 2014
Methods for Reading Structured Binary Data: Benchmarks & Comparisons
In my previous post I demonstrated how one can use the NativeIntrop library to easily implement file I/O for structured binary data formats, STL in this particular case. In this example, we used NativeInterop.Stream.ReadUnmanagedStructRange to read a sequence of bytes from a file as an array of STL triangle structs.
When implementing ReadUnmanagedStructRange I had these design goals in mind:
In a language like C that guarantees neither type nor memory safety we would simply treat the raw byte array (represented as a byte*) as a STLTriangle* and be done. And while that approach is possible in C# as well, it requires unsafe code at every point we want to access the STL data. Put differently: As there is no way to change the "runtime type" of a .NET array (certainly for good reasons!) the solution of our problem boils down to either...
Note that I'm using a MemoryStream here to simulate reading from a Stream object while avoiding disk I/O.
Here we convert chunks of four bytes to the single precision floating point fields of an STLTriangle, representing the components of the vertices and the normal vector. The last component is the 2-byte attribute count field.
Marshal.PtrToStructure
The Marshal class provides a method PtrToStructure (or alternatively in generic form), which essentially reads an (unmanaged) struct from an arbitrary memory address:
Here we treat the (fixed) byte input array as an STLTriangle* which we can then dereference to store each triangle data in the result array of type STLTriangle[]. Note that you cannot implement this kind of operation in a generic way in C#, as C# does not allow to dereference pointers to objects of arbitrary type.
Easy. Of course, this introduces a dependency on Microsoft's C runtime library msvcrt.dll and is thus not platform independent.
While it's a bit clunky that code generation is required to get access to cpblk, the nice thing about it is that it's platform-independent: any CLI-compliant implementation must provide it. There is even hope that the next version of F# will come with an updated version of it's pointer intrinsics library that will also feature access to cpblk (cf. Additional intrinsics for the NativePtr module and the corresponding pull request).
NativeInterop.Buffer.Copy
My own little NativeInterop helper library also has tools for moving around stuff in memory. For instance, the Buffer module contains a method Copy that copies an input array of type T to an array of type U. Both T and U must be unmanaged types (Note: Neither the C# compiler nor the CLR can enforce this constraint!). Under the covers, Buffer.Copy(T[], U[]) creates an empty result array and then copies the input bytes using a custom block copy implementation to the result array. Using Buffer.Copy is simple:
Unfortunately, I couldn't use any of the other mem-copy methods to implement Buffer.Copy as they are either not generic enough (Marshal.Copy, Buffer.BlockCopy), introduce dependencies on some native libraries (memcpy via P/Invoke) or require runtime-code generation or unsafe C# code, both of which isn't available for Portable Class Libraries like NativeInterop (cpblk, *dst++=*src++).
I therefore experimented with different memcpy algorithms (different block sizes, different degrees of loop-unrolling, with or without considering alignment...) and—for the time being—setteled with a comparatively simple, aligned qword copying mechanism that offers decent performance on x64 with the current x64 JIT.
Given that copying (large) chunks of memory should be a bandwidth-limited problem, multi-threading shouldn't help much, at least for efficient implementations. Methods with a lot of overhead might profit a little more.
Each method described above ran up to 10 000 times with a time-out of 5 s for each of the 15 problem sizes. To minimize interference from other processes, I set the process priority to "high" and a garbage collection is kicked off before each round.
System.Diagnostics.Stopwatch, which I used to measure transfer times, has a resolution of 300 ns on my system. For very small working sets and the fastest methods, that's already too coarse. Of course I could have measured the total time for, say, 10 000 runs and just compute the average. Yet, it turns out that even with sufficient warm-up, there are still enough outliers in the measured timings to skew the result. After some experimenting, I decided to use the mean of the values below the 10th percentile instead. That seems to be more stable than relying on the timings of individual runs and also excludes extreme outliers. Still, I wouldn't trust the measurements for < 2^8 (25 kB) too much.
I ran the benchmarks on an Intel Core i7-2600K @ 3.4 - 4.2 GHz and 32 GiB of DDR3-1600 RAM (PC3-12800, dual channel; peak theoretical transfer rate: 25 GiB/s) under Windows 8.1 Pro. Both the CLR's current x64 JIT compiler "JIT64" as well as RyuJIT (CTP 4) received the oportunity to show of their performance in the benchmarks.
Ignoring the multithreaded versions for now (dashed lines), it is clear immediately that memcpy (P/Invoke) offers great overall performance for both the smallest and the largest data sets. Marshal.Copy and cpblk come as a close second. Unsafe C# (*dst++ = *src++) offers stable, but comparatively poor performance.
NativeInterop's Buffer.Copy doesn't even come close to the fastest methods for small data sets, but offers comparable performance for larger sets. Something in its implementation is generating way more overhead than neccessary... That "something" turns out to be the allocation of GCHandles for fixing the managed arrays: F# 3.1 doesn't support a "fixed" statement as C# does. To check whether that truly could be the source of the overhead, I implemented a version of the cpblk method that uses GCHandle.Alloc instead of fixed and—lo and behold—this slowed down cpblk to approximately the same speed as Buffer.Copy (light-blue).
Unsurprisingly, the multithreaded versions come with quite some overhead, so they can't compete for< 500 kB. Only for problem sizes that fit into Sandy Bridge's L3 cache we see some nice scaling. Problem sizes beyond the L3 cache size are again limited by the system's DRAM bandwidth and don't profit from multi-threading.
The performance of all other (non-memcpy-like) methods is abysmal at best. We best just skip them...
How does this picture change once we switch to an (early!) preview of the upcoming RyuJIT compiler? As it turns out, not by much; yet there is still happening something interesting:
Suddenly, "*dst++ = *src++" has become one of the fastest implementations. What's going on here? Let's have a look at the generated assembly for *dst++ = *src++; first let's see what JIT64 produces:
lea rdx,[r9+r8]
lea rax,[r9+32h]
mov rcx, r9
mov r9, rax
mov rax, qword ptr [rcx]
mov qword ptr [rsp+20h], rax
mov rax, qword ptr [rcx+8]
mov qword ptr [rsp+28h], rax
mov rax, qword ptr [rcx+10h]
mov qword ptr [rsp+30h], rax
mov rax, qword ptr [rcx+18h]
mov qword ptr [rsp+38h], rax
mov rax, qword ptr [rcx+20h]
mov qword ptr [rsp+40h], rax
mov rax, qword ptr [rcx+28h]
mov qword ptr [rsp+48h], rax
mov ax, word ptr [rcx+30h]
mov word ptr [rsp+50h], ax
lea rcx, [rsp+20h]
mov rax, qword ptr [rcx]
mov qword ptr [rdx], rax
mov rax, qword ptr [rcx+8]
mov qword ptr [rdx+8], rax
mov rax, qword ptr [rcx+10h]
mov qword ptr [rdx+10h], rax
mov rax, qword ptr [rcx+18h]
mov qword ptr [rdx+18h], rax
mov rax, qword ptr [rcx+20h]
mov qword ptr [rdx+20h], rax
mov rax, qword ptr [rcx+28h]
mov qword ptr [rdx+28h], rax
mov ax, word ptr [rcx+30h]
mov word ptr [rdx+30h], ax
inc r11d
cmp r11d, r10d
jl 00007FFE57865410
Hm, so it first moves data from [rcx] (offset into the original byte[]) to some temporary at [rsp+20h] and then copies that to the destination [rdx] (offset into the result STLTriangle[]). That's obviously one step that's not neccessary and where RyuJIT can improve upon (there are 28 mov instructions, but only 14 qword movs + 2 word movs are required to move 50 bytes from memory to another memory location). So, what does RyuJIT really do?
lea r9, [rcx+32h]
lea r10, [rax+32h]
movdqu xmm0, xmmword ptr [rax]
movdqu xmmword ptr [rcx], xmm0
movdqu xmm0, xmmword ptr [rax+10h]
movdqu xmmword ptr [rcx+10h], xmm0
movdqu xmm0, xmmword ptr [rax+20h]
movdqu xmmword ptr [rcx+20h], xmm0
mov r11w, word ptr [rax+30h]
mov word ptr [rcx+30h], r11w
inc r8d
cmp edx, r8d
mov rax, r10
mov rcx, r9
jg 00007FFE57884CE8
We see that RyuJIT is able to not only remove the unrequired intermediate load/store ops, but in addition it also issues (unaligned) SIMD mov instructions to copy 16 bytes at a time. This optimization also works for NativeInterop.Buffer.Copy: Once I increased the blocksize to at least 16 bytes (or larger) performance became identical to that of *dst++=*src++ in the RyuJIT case (aside from the overhead for small problem sizes due to GCHandle allocations).
While NativeInterop.Buffer.Copy isn't the fastest of the bunch (yet), it's performance is competitive for medium to large problem sizes and in any case orders of magnitudes faster then the usual BinaryReader-approach. If you want convenience, portability and genericity while maintaining reasonable performance, NativeInterop provides a good all-round solution, I believe. If you want raw speed, though, use memcpy (or whatever may be available on your platform).
When implementing ReadUnmanagedStructRange I had these design goals in mind:
- Ease-of-use (no user-side tinkering with unsafe code)
- Portability (after all, NativeInterop is a portable class library)
- Genericity (should work with any unmanaged data type)
- Efficiency
Choices
Implementing ReadUnmanagedStructRange (and its dual, WriteUnmanagedStructRange) essentially requires a method to convert a stream of bytes to a (managed) array of the target type (which is supposed to be unmanaged/blittable). In context of reading binary STL, we would therefore need to somehow convert a byte[] to an STLTriangle[].In a language like C that guarantees neither type nor memory safety we would simply treat the raw byte array (represented as a byte*) as a STLTriangle* and be done. And while that approach is possible in C# as well, it requires unsafe code at every point we want to access the STL data. Put differently: As there is no way to change the "runtime type" of a .NET array (certainly for good reasons!) the solution of our problem boils down to either...
- "Manually" construct objects of the target type from the byte[] data or store them in the result array, or...
- Unsafely copy the contents of the raw byte[] to an STLTrianlge[] ("memcpy"), without explicitly or implicitly creating intermediate/temporary objects
- BinaryReader.ReadXXX: read individual fields directly from the stream
- BitConverter.ToXXX: convert small chunks of the byte[] to individual fields of the target data type
- Buffer.BlockCopy: read multiple fields of homogenous type at once
- Marshal.PtrToStructure
: interpret larger chunks of the byte[] as complete STLTriangle structs
- Simple, unsafe C# code (*dst++ = *src++)
- Marshal.Copy
- memcpy (via P/Invoke)
- cpblk CIL instruction
- NativeInterop.Buffer.Copy using a custom blocked memcpy implementation (ReadUnmanagedStrucRange uses this function under the hood)
BinaryReader.ReadXXX
Probably the most straight-forward approach for parsing a binary file is using a BinaryReader to get the data for individual fields of the target data structure and then create the data structure from that data. An implementation for reading an STL file might look like the following bit of code:public static void BinaryReaderRead(byte[] triBytes, STLTriangle[] tris) { using (var br = new BinaryReader(new MemoryStream(triBytes))) { for (int i = 0; i < tris.Length; ++i) { var normX = br.ReadSingle(); var normY = br.ReadSingle(); var normZ = br.ReadSingle(); var aX = br.ReadSingle(); var aY = br.ReadSingle(); var aZ = br.ReadSingle(); var bX = br.ReadSingle(); var bY = br.ReadSingle(); var bZ = br.ReadSingle(); var cX = br.ReadSingle(); var cY = br.ReadSingle(); var cZ = br.ReadSingle(); var abc = br.ReadUInt16(); tris[i] = new STLTriangle( new STLVector(normX, normY, normZ), new STLVector(aX, aY, aZ), new STLVector(bX, bY, bZ), new STLVector(cX, cY, cZ), abc); } } }
Note that I'm using a MemoryStream here to simulate reading from a Stream object while avoiding disk I/O.
BitConverter.ToXXX
I guess revealing upfront the BinaryReader approach as not being exactly the most efficient approach will be hardly surprising for my dear readers. The next optimization step could therefore be to read the whole file into a flat byte[] at once and extract the data using BitConverter:public static void BitConverterTo(byte[] triBytes, STLTriangle[] tris) { for (int i = 0; i < tris.Length; ++i) { var offset = i * STLTriangle.Size; var normX = BitConverter.ToSingle(triBytes, offset); var normY = BitConverter.ToSingle(triBytes, offset + 4); var normZ = BitConverter.ToSingle(triBytes, offset + 8); var aX = BitConverter.ToSingle(triBytes, offset + 12); var aY = BitConverter.ToSingle(triBytes, offset + 16); var aZ = BitConverter.ToSingle(triBytes, offset + 20); var bX = BitConverter.ToSingle(triBytes, offset + 24); var bY = BitConverter.ToSingle(triBytes, offset + 28); var bZ = BitConverter.ToSingle(triBytes, offset + 32); var cX = BitConverter.ToSingle(triBytes, offset + 36); var cY = BitConverter.ToSingle(triBytes, offset + 40); var cZ = BitConverter.ToSingle(triBytes, offset + 44); var abc = BitConverter.ToUInt16(triBytes, offset + 48); tris[i] = new STLTriangle( new STLVector(normX, normY, normZ), new STLVector(aX, aY, aZ), new STLVector(bX, bY, bZ), new STLVector(cX, cY, cZ), abc); } }
Here we convert chunks of four bytes to the single precision floating point fields of an STLTriangle, representing the components of the vertices and the normal vector. The last component is the 2-byte attribute count field.
Buffer.BlockCopy
We can further refine the previous approach by extracting not single floats from the byte[], but all vertex/normal data for each single triangle at once using Buffer.BlockCopy:public static void BufferBlockCopy(byte[] triBytes, STLTriangle[] tris) { var coords = new float[12]; for (int i = 0; i < tris.Length; ++i) { var offset = i * STLTriangle.Size; System.Buffer.BlockCopy(triBytes, offset, coords, 0, 48); var abc = BitConverter.ToUInt16(triBytes, offset + 48); tris[i] = new STLTriangle( new STLVector(coords[0], coords[1], coords[2]), new STLVector(coords[3], coords[4], coords[5]), new STLVector(coords[6], coords[7], coords[8]), new STLVector(coords[9], coords[10], coords[11]), abc); } }
Unfortunately Buffer.BlockCopy can only handle arrays whose elements are of primitive type (in this case, we copy from a byte[] to a float[]). We can therfore not use this approach to copy all bytes at once from the byte[] to a STLTriangle[].
Marshal.PtrToStructure
The Marshal class provides a method PtrToStructurepublic static unsafe void MarshalPtrToStructureGeneric(byte[] triBytes, STLTriangle[] tris) { var triType = typeof(STLTriangle); fixed (byte* pBytes = &triBytes[0]) { for (int i = 0; i < tris.Length; ++i) { var offset = i * STLTriangle.Size; tris[i] = Marshal.PtrToStructure<STLTriangle>(new IntPtr(pBytes + offset)); } } }
*dst++ = *src++
Essentially the same as with Marshal.PtrToStructure can be achieved with a little bit of unsafe C#:public static unsafe void UnsafePointerArith(byte[] triBytes, STLTriangle[] tris) { fixed (byte* pbuffer = triBytes) fixed (STLTriangle* ptris = tris) { var pSrc = (STLTriangle*)pbuffer; var pDst = ptris; for (int i = 0; i < tris.Length; ++i) { *pDst++ = *pSrc++; } } }
Here we treat the (fixed) byte input array as an STLTriangle* which we can then dereference to store each triangle data in the result array of type STLTriangle[]. Note that you cannot implement this kind of operation in a generic way in C#, as C# does not allow to dereference pointers to objects of arbitrary type.
Marshal.Copy
The aformentioned unsafe C# approach already is a form of memcpy, albeit an unoptimized one (we copy blocks of 50 bytes at each iteration); furthermore it's specialized for copying from byte[] to STLTriangle[]. Marshal.Copy on the other hand is a little more generic as it can copy any array of primitive elements to an arbitrary memory location (and vice versa):public static unsafe void ConvertBufferMarshalCopy(byte[] triBytes, STLTriangle[] tris) { fixed (STLTriangle* pTris = &tris[0]) { Marshal.Copy(triBytes, 0, new IntPtr(pTris), triBytes.Length); } }
memcpy (P/Invoke)
As we already concluded that we essentially need an efficient way to copy one block of memory to some other location in memory, why not simply use the system-provided, highly optimized, memcpy implementation?[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false)] static extern IntPtr memcpy(IntPtr dest, IntPtr src, UIntPtr count); public static unsafe void ConvertBufferMemcpy(byte[] triBytes, STLTriangle[] tris) { fixed (byte* src = triBytes) fixed (STLTriangle* dst = tris) { memcpy(new IntPtr(dst), new IntPtr(src), new UIntPtr((uint)triBytes.Length)); } }
Easy. Of course, this introduces a dependency on Microsoft's C runtime library msvcrt.dll and is thus not platform independent.
cpblk
Interestingly, the CIL provides a specialized instruction just for copying blocks of memory, namely cpblk. A few years back, Alexander Mutel used cpblk in his nice performance comparison of different memcpy methods for .NET. To my knowledge, this instruction is currently not directly accessible from either C# or F# (not sure about C++/CLI, though). Yet, we can generate the neccessary instructions using a bit of runtime code generation; here I'm using Kevin Montrose's excellent Sigil library:// parameters: src, dst, length (bytes) static Action<IntPtr, IntPtr, uint> CpBlk = GenerateCpBlk(); static Action<IntPtr, IntPtr, uint> GenerateCpBlk() { var emitter = Sigil.Emit<Action<IntPtr, IntPtr, uint>>.NewDynamicMethod("CopyBlockIL"); // emit IL emitter.LoadArgument(1); // dest emitter.LoadArgument(0); // src emitter.LoadArgument(2); // len emitter.CopyBlock(); emitter.Return(); // compile to delegate return emitter.CreateDelegate(); } public static unsafe void ConvertBufferCpBlk(byte[] triBytes, STLTriangle[] tris) { fixed (byte* src = triBytes) fixed (STLTriangle* dst = tris) { CpBlk(new IntPtr(src), new IntPtr(dst), (uint)triBytes.Length); } }
While it's a bit clunky that code generation is required to get access to cpblk, the nice thing about it is that it's platform-independent: any CLI-compliant implementation must provide it. There is even hope that the next version of F# will come with an updated version of it's pointer intrinsics library that will also feature access to cpblk (cf. Additional intrinsics for the NativePtr module and the corresponding pull request).
NativeInterop.Buffer.Copy
My own little NativeInterop helper library also has tools for moving around stuff in memory. For instance, the Buffer module contains a method Copypublic static void ConvertBufferBufferConvert(byte[] triBytes, STLTriangle[] tris) { NativeInterop.Buffer.Copy(triBytes, tris); }
Unfortunately, I couldn't use any of the other mem-copy methods to implement Buffer.Copy as they are either not generic enough (Marshal.Copy, Buffer.BlockCopy), introduce dependencies on some native libraries (memcpy via P/Invoke) or require runtime-code generation or unsafe C# code, both of which isn't available for Portable Class Libraries like NativeInterop (cpblk, *dst++=*src++).
I therefore experimented with different memcpy algorithms (different block sizes, different degrees of loop-unrolling, with or without considering alignment...) and—for the time being—setteled with a comparatively simple, aligned qword copying mechanism that offers decent performance on x64 with the current x64 JIT.
Multi-Threading
For some of the methods we will also look at a multi-threaded variant. Essentially, all of those versions look similar to this (here: memcpy):public static unsafe void ConvertBufferMemcpyMultiThreaded(byte[] triBytes, STLTriangle[] tris) { var threadcount = Environment.ProcessorCount; var chunksize = triBytes.Length / threadcount; var remainder = triBytes.Length % threadcount; fixed (byte* pBytes = &triBytes[0]) fixed (STLTriangle* pTris = &tris[0]) { var tasks = new Action[threadcount]; for (var i = 0; i < threadcount - 1; ++i) { var offset = i * chunksize; var newSrc = new IntPtr(pBytes + offset); var newDst = new IntPtr((byte*)pTris + offset); tasks[i] = () => memcpy(newSrc, newDst, new UIntPtr((uint)chunksize)); } var finalOffset = (threadcount - 1) * chunksize; var finalSrc = new IntPtr(pBytes + finalOffset); var finalDst = new IntPtr((byte*)pTris + finalOffset); tasks[threadcount - 1] = () => memcpy(finalSrc, finalDst, new UIntPtr((uint)(chunksize + remainder))); Parallel.Invoke(tasks); } }
Given that copying (large) chunks of memory should be a bandwidth-limited problem, multi-threading shouldn't help much, at least for efficient implementations. Methods with a lot of overhead might profit a little more.
Benchmark
As a benchmark problem, I chose to create fake STL data from 2^6 up to 2^20 triangles. As each triangle struct is 50 bytes, the total amount of data transferred (read + write) varies between approx. 6 kB up to approx. 100 MB and should thus stress all cache levels.Each method described above ran up to 10 000 times with a time-out of 5 s for each of the 15 problem sizes. To minimize interference from other processes, I set the process priority to "high" and a garbage collection is kicked off before each round.
System.Diagnostics.Stopwatch, which I used to measure transfer times, has a resolution of 300 ns on my system. For very small working sets and the fastest methods, that's already too coarse. Of course I could have measured the total time for, say, 10 000 runs and just compute the average. Yet, it turns out that even with sufficient warm-up, there are still enough outliers in the measured timings to skew the result. After some experimenting, I decided to use the mean of the values below the 10th percentile instead. That seems to be more stable than relying on the timings of individual runs and also excludes extreme outliers. Still, I wouldn't trust the measurements for < 2^8 (25 kB) too much.
I ran the benchmarks on an Intel Core i7-2600K @ 3.4 - 4.2 GHz and 32 GiB of DDR3-1600 RAM (PC3-12800, dual channel; peak theoretical transfer rate: 25 GiB/s) under Windows 8.1 Pro. Both the CLR's current x64 JIT compiler "JIT64" as well as RyuJIT (CTP 4) received the oportunity to show of their performance in the benchmarks.
Results & Analysis
First let's have a look at the results from JIT64 (click for full-size):Ignoring the multithreaded versions for now (dashed lines), it is clear immediately that memcpy (P/Invoke) offers great overall performance for both the smallest and the largest data sets. Marshal.Copy and cpblk come as a close second. Unsafe C# (*dst++ = *src++) offers stable, but comparatively poor performance.
NativeInterop's Buffer.Copy doesn't even come close to the fastest methods for small data sets, but offers comparable performance for larger sets. Something in its implementation is generating way more overhead than neccessary... That "something" turns out to be the allocation of GCHandles for fixing the managed arrays: F# 3.1 doesn't support a "fixed" statement as C# does. To check whether that truly could be the source of the overhead, I implemented a version of the cpblk method that uses GCHandle.Alloc instead of fixed and—lo and behold—this slowed down cpblk to approximately the same speed as Buffer.Copy (light-blue).
Unsurprisingly, the multithreaded versions come with quite some overhead, so they can't compete for< 500 kB. Only for problem sizes that fit into Sandy Bridge's L3 cache we see some nice scaling. Problem sizes beyond the L3 cache size are again limited by the system's DRAM bandwidth and don't profit from multi-threading.
The performance of all other (non-memcpy-like) methods is abysmal at best. We best just skip them...
How does this picture change once we switch to an (early!) preview of the upcoming RyuJIT compiler? As it turns out, not by much; yet there is still happening something interesting:
Suddenly, "*dst++ = *src++" has become one of the fastest implementations. What's going on here? Let's have a look at the generated assembly for *dst++ = *src++; first let's see what JIT64 produces:
lea rdx,[r9+r8]
lea rax,[r9+32h]
mov rcx, r9
mov r9, rax
mov rax, qword ptr [rcx]
mov qword ptr [rsp+20h], rax
mov rax, qword ptr [rcx+8]
mov qword ptr [rsp+28h], rax
mov rax, qword ptr [rcx+10h]
mov qword ptr [rsp+30h], rax
mov rax, qword ptr [rcx+18h]
mov qword ptr [rsp+38h], rax
mov rax, qword ptr [rcx+20h]
mov qword ptr [rsp+40h], rax
mov rax, qword ptr [rcx+28h]
mov qword ptr [rsp+48h], rax
mov ax, word ptr [rcx+30h]
mov word ptr [rsp+50h], ax
lea rcx, [rsp+20h]
mov rax, qword ptr [rcx]
mov qword ptr [rdx], rax
mov rax, qword ptr [rcx+8]
mov qword ptr [rdx+8], rax
mov rax, qword ptr [rcx+10h]
mov qword ptr [rdx+10h], rax
mov rax, qword ptr [rcx+18h]
mov qword ptr [rdx+18h], rax
mov rax, qword ptr [rcx+20h]
mov qword ptr [rdx+20h], rax
mov rax, qword ptr [rcx+28h]
mov qword ptr [rdx+28h], rax
mov ax, word ptr [rcx+30h]
mov word ptr [rdx+30h], ax
inc r11d
cmp r11d, r10d
jl 00007FFE57865410
Hm, so it first moves data from [rcx] (offset into the original byte[]) to some temporary at [rsp+20h] and then copies that to the destination [rdx] (offset into the result STLTriangle[]). That's obviously one step that's not neccessary and where RyuJIT can improve upon (there are 28 mov instructions, but only 14 qword movs + 2 word movs are required to move 50 bytes from memory to another memory location). So, what does RyuJIT really do?
lea r9, [rcx+32h]
lea r10, [rax+32h]
movdqu xmm0, xmmword ptr [rax]
movdqu xmmword ptr [rcx], xmm0
movdqu xmm0, xmmword ptr [rax+10h]
movdqu xmmword ptr [rcx+10h], xmm0
movdqu xmm0, xmmword ptr [rax+20h]
movdqu xmmword ptr [rcx+20h], xmm0
mov r11w, word ptr [rax+30h]
mov word ptr [rcx+30h], r11w
inc r8d
cmp edx, r8d
mov rax, r10
mov rcx, r9
jg 00007FFE57884CE8
We see that RyuJIT is able to not only remove the unrequired intermediate load/store ops, but in addition it also issues (unaligned) SIMD mov instructions to copy 16 bytes at a time. This optimization also works for NativeInterop.Buffer.Copy
Conclusion
When your only concern is reading from/writing to a file on a comparatively slow disk, all of the above shouldn't bother you much. Most of the methods will be simply "fast enough". As soon as you start to move around data that's already in memory (e.g. cached) or maybe it resides on a very fast SSD, choosing the right method however becomes critical.While NativeInterop.Buffer.Copy isn't the fastest of the bunch (yet), it's performance is competitive for medium to large problem sizes and in any case orders of magnitudes faster then the usual BinaryReader-approach. If you want convenience, portability and genericity while maintaining reasonable performance, NativeInterop provides a good all-round solution, I believe. If you want raw speed, though, use memcpy (or whatever may be available on your platform).
Method | Convenience | Portability | Safety | Efficiency | Genericity |
---|---|---|---|---|---|
BinaryReader.ReadXXX | + | + | + | - | - |
BitConverter.ToXXX | - | + | o | - | - |
Buffer.BlockCopy | - | + | o | - | - |
Marshal.PtrToStructure |
- | + | o | - | + |
*dst++ = *src++ | o | + | - | o | o |
Marshal.Copy | + | + | - | + | o |
memcpy | o | - | - | + | + |
cpblk | - | + | - | + | + |
Buffer.Convert/Copy, Stream.ReadUnmanagedStructRange |
+ | + | - | o/+ | + |
Labels:
.NET,
C#,
F#,
NativeInterop,
Performance,
unsafe
Monday, May 19, 2014
Binary STL I/O using NativeInterop.Stream
There are essentially three common ways for reading/writing structured binary data of some user defined type T (read: "files of records") from/to files in C# and .NET in general:
Sadly, C# cannot constrain type parameters to unmanaged types and, as a consequence, does not support dereferencing generic pointer types. (In C# there is only the struct constraint, but that is not sufficient as a struct might contain fields of reference type.) What that means is that option #3 from above only works for concrete types.
Because F# can constrain a type parameter to be an unmanaged type, I used F# to build the NativeInterop library that—at its core—exposes the possiblity to (unsafely!) dereference generic pointers in C#. This in turn enabled the implementation of some generic extension methods (NativeInterop ≥ v1.4.0) for System.IO.Stream that provide both easy* to use and efficient** methods for reading and writing structured binary files.
A word of warning: The F# unmanaged constrain does not surface in the NativeInterop API if used from C# (and probably VB.NET). Thus the user of NativeInterop has to make sure, that his data types are truly unmanaged! If that is not the case NativeInterop may produce arbitrary garbage!
To represent the triangle information, we use the following user-defined unmanaged type(s):
We leave the normals at zero: Most software will derive the surface normals automatically correctly, if the order in which the vertices of each face are specified follow the right hand rule (i.e. vertices enumerated counter-clockwise when looking at the face from the outside).
And here we have it, our, ahem, "beautiful" tetrahedron rendered in MeshLab:
Note how all the surface normals are sticking outward.
*Easy, because the user of the library doesn't have to fiddle with unsafe code.
**Efficient, because under the hood it boils down to a generic version of option #3 with zero marshalling overhead.
- Use System.IO.BinaryReader/BinaryWriter and its' ReadXXX/Write methods to read/write individual fields of the data type T in question.
- Use one of the System.Runtime.InteropServices.Marshal.PtrToStructure/StructureToPtr methods to covert between (pinned) byte arrays (which can be written to and read from System.IO.Stream) and non-generic value types or "formatted class types" via .NET's COM marshalling infrastructure.
- Use a little bit of unsafe code and cast a byte* pointing at the first entry of a pinned byte buffer to a T* so structs of that type T can be written/read by simply dereferencing a pointer.
Sadly, C# cannot constrain type parameters to unmanaged types and, as a consequence, does not support dereferencing generic pointer types. (In C# there is only the struct constraint, but that is not sufficient as a struct might contain fields of reference type.) What that means is that option #3 from above only works for concrete types.
Because F# can constrain a type parameter to be an unmanaged type, I used F# to build the NativeInterop library that—at its core—exposes the possiblity to (unsafely!) dereference generic pointers in C#. This in turn enabled the implementation of some generic extension methods (NativeInterop ≥ v1.4.0) for System.IO.Stream that provide both easy* to use and efficient** methods for reading and writing structured binary files.
A word of warning: The F# unmanaged constrain does not surface in the NativeInterop API if used from C# (and probably VB.NET). Thus the user of NativeInterop has to make sure, that his data types are truly unmanaged! If that is not the case NativeInterop may produce arbitrary garbage!
Example: Writing and Reading Binary STL files
Let's look at a simple example for how to use the aformentioned library methods to handle a simple structured binary file format: STL files essentially contain a description of a triangle mesh (triangulated surface). The exact format of the binary version (there is also an ASCII variant) is described on Wikipedia:- An 80 byte ASCII header, which may contain some descriptive string
- The number of triangles in the mesh as an unsigned 32 bit integer
- For each triangle there is one record of the following format:
- A normal vector made up of three single precions floating point numbers
- Three vertices, each made up of three single precions floating point numbers
- A field called "Attribute byte count" (16 bit unsigned integer); this should usually be zero, but some software may interpret this as color information.
Modeling STL Records in C#
In this example, we will only explicitly model the triangle records. The header information is so simple that it's easier to just directly write out a 80 byte ASCII string.To represent the triangle information, we use the following user-defined unmanaged type(s):
[StructLayout(LayoutKind.Sequential, Pack = 1)] struct STLVector { public readonly float X; public readonly float Y; public readonly float Z; public STLVector(float x, float y, float z) { this.X = x; this.Y = y; this.Z = z; } } [StructLayout(LayoutKind.Sequential, Pack = 1)] struct STLTriangle { // 4 * 3 * 4 byte + 2 byte = 50 byte public readonly STLVector Normal; public readonly STLVector A; public readonly STLVector B; public readonly STLVector C; public readonly ushort AttributeByteCount; public STLTriangle( STLVector normalVec, STLVector vertex1, STLVector vertex2, STLVector vertex3, ushort attr = 0) { Normal = normalVec; A = vertex1; B = vertex2; C = vertex3; AttributeByteCount = attr; } }
Defining a Test Geometry
For testing purposes, we can now define a super-simple triangle mesh, a single tetrahedron:// tetrahedron, vertex order: right hand rule var mesh = new STLTriangle[] { new STLTriangle(new STLVector(0, 0, 0), new STLVector(0, 0, 0), new STLVector(0, 1, 0), new STLVector(1, 0, 0)), new STLTriangle(new STLVector(0, 0, 0), new STLVector(0, 0, 0), new STLVector(0, 0, 1), new STLVector(0, 1, 0)), new STLTriangle(new STLVector(0, 0, 0), new STLVector(0, 0, 0), new STLVector(0, 0, 1), new STLVector(1, 0, 0)), new STLTriangle(new STLVector(0, 0, 0), new STLVector(0, 1, 0), new STLVector(0, 0, 1), new STLVector(1, 0, 0)), };
We leave the normals at zero: Most software will derive the surface normals automatically correctly, if the order in which the vertices of each face are specified follow the right hand rule (i.e. vertices enumerated counter-clockwise when looking at the face from the outside).
Writing STL
Now it's really straightforward to generate a valid STL file from the available data:using (var bw = new BinaryWriter(File.OpenWrite(filename), Encoding.ASCII)) { // Encode the header string as ASCII and put it in a 80 bytes buffer var headerString = "Tetrahedron"; var header = new byte[80]; Encoding.ASCII.GetBytes(headerString, 0, headerString.Length, header, 0); bw.Write(header); // write #triangles bw.Write(mesh.Length); // use extension method from NativeInterop.Stream to write out the mesh data bw.BaseStream.WriteUnmanagedStructRange(mesh); }
And here we have it, our, ahem, "beautiful" tetrahedron rendered in MeshLab:
Note how all the surface normals are sticking outward.
Reading STL
By using the ReadStructRange<T> extension method, reading binary STL data is just as simple:string header; STLTriangle[] mesh; using (var br = new BinaryReader(File.OpenRead(filename), Encoding.ASCII)) { header = Encoding.ASCII.GetString(br.ReadBytes(80)); var triCount = br.ReadUInt32(); mesh = br.BaseStream.ReadUnmanagedStructRange<STLTriangle>((int)triCount); }
Conclusion
Reading and writing simple structured binary data is easy using NativeInterop.Stream. Get it now from NuGet! For reporting bugs or suggesting new features, please use the BitBucket issue tracker.*Easy, because the user of the library doesn't have to fiddle with unsafe code.
**Efficient, because under the hood it boils down to a generic version of option #3 with zero marshalling overhead.
Friday, April 11, 2014
NativeInterop is live on NuGet
Today I released a first version of my NativeInterop package on NuGet. You can find a description of the purpose and usage of the package as well as the source code on BitBucket.
The motivation to build and release this package really evolved from two practical issues I encountered when building performance critical C# code:
The performance offered by the NativeInterop package should be on par with non-generic unsafe C# code. The NativeInterop package also contains an implementation of NativeArray64 , but this time without using inline IL. It turned out that in newer versions of the .NET framework, the AGUs are utilized correctly for the address (offset) computation (instead of emitting IMUL instructions): Calling NativePtr.set<'T>/get<'T>/set64<'T>/get64<'T> (or NativePtr.Get/Set or IntPtr.Get/Set or NativePtr.get_Item/set_Item, respectively) should all result in the generation of a single mov instruction.
The motivation to build and release this package really evolved from two practical issues I encountered when building performance critical C# code:
- Creating a native, high-performance generic array data structure in C# seems to be impossible (see A minimalistic native 64 bit array ...).
- Reading structured binary data from a byte[] requires some ugly hacks to get both decent performance and genericity (see e.g. Reading Unmanaged Data Into Structures).
static unsafe T Read<T>(T* p) { return *p; }But in F#, you can; you'd simply encode this as
open NativeInterop let pVal = NativePtr.read ptrwhere ptr is of type nativeptr<'T> and 'T is constrained to unmanaged types.
The performance offered by the NativeInterop package should be on par with non-generic unsafe C# code. The NativeInterop package also contains an implementation of NativeArray64
Monday, April 7, 2014
A first look at RyuJIT CTP3 and SIMD (SSE2) support for .NET
Not being able to exploit today's processors SIMD processing capabilities is a major culprit when implementing high-performance (e. g. numerical) applications in C# (or any other CLI language). While there is Mono.Simd, there is no solutions for applications running on top of Microsoft's own runtime (CLR), despite popular demand ... until now!
With BUILD 2014, Microsoft released a new preview version of the next generation JIT compiler "RyuJIT" that, combined with a special SIMD library that can be installed via NuGet, supports SIMD intrinsics (only SSE2 for now, but AVX is in the works).
Finally! I couldn't wait to try out the new bits; thus I modified the C# version of my existing XRaySimulator* to make use of SSE2 by implementing a simple packet ray tracing technique, i. e. instead of tracing individual rays, this version traces bundles of 2x2 (SSE2) or 4x2 (AVX) rays. Because the rays are largely "coherent" they typically hit the same objects (cache hit rate!).
As expected "C++/AVX" blasts away the rest of the pack with a stunning 13 MRay/s. And while "C++/AVX" delivers a speed-up of 5.2 over "C++", "C#/SSE2" only improves by a factor of 2.5 compared to "C#" and displays only insignificant performance gains over the optimized scalar version "C# adj. trav."
Now, given that SSE2 uses only 128-bit-wide vector lanes compared to AVX's generous 256 bit and the generally much more aggressive optimizer of the Visual C++ compiler, it's not exactly surprising to see an obvious performance difference between the "C++/AVX" and "C#/SSE2" case. Yet, I still would have expected the speed-up of "C#/SSE2" to reach a value a little closer to 4x instead of 2.5. What's going on there?
According to Visual Studio's built-in profiler all of the implementations spend the majority of their time in the intersection routine of the AABB (axis-aligned bounding box) - which is a good thing, because this intersection test is very fast compared to a triangle intersection test. Thus the quality of the generated machine code for this method/function is critical for the overall performance of the renderer.
The source code of the C#/SSE2 version looks like this:
And here's the source for the C++/AVX version:
(Note: The C++ code uses a hard-coded vector lane width of 8 floats.)
Almost identical; yet, if you compare what both RyuJIT and Visual C++ make of these sources, you'll first notice that the machine code emitted by RyuJIT is much more convoluted and thus longer:
Most of that "additional stuff" that's going on in the C#/RyuJIT version seems to be related to null-pointer checks (both AABB and RayPack are classes and thus reference types). Still, I wonder if all those load/store operations are truly neccessary.
The future is bright!
*XRaySimulator is a visualization tool that renders X-ray-like images of finite element models. It uses a modified ray tracing algorithm to compute the energy absorption within each intersected element based on the element's material properties. A BVH (bounding volume hierarchy) is used to speed-up the intersection computation.
Details (German): http://www.uni-ulm.de/fileadmin/website_uni_ulm/uzwr/projekte/p10-2.pdf
**The C# versions of XRaySimulator on BitBucket currently don't support saving the rendered image to a file. In older versions, I used to use Tao.DevIL, but that only works on x86 and the preview releases of RyuJIT only emit x64 machine code. The C++ versions use a custom TGA output filter.
With BUILD 2014, Microsoft released a new preview version of the next generation JIT compiler "RyuJIT" that, combined with a special SIMD library that can be installed via NuGet, supports SIMD intrinsics (only SSE2 for now, but AVX is in the works).
Finally! I couldn't wait to try out the new bits; thus I modified the C# version of my existing XRaySimulator* to make use of SSE2 by implementing a simple packet ray tracing technique, i. e. instead of tracing individual rays, this version traces bundles of 2x2 (SSE2) or 4x2 (AVX) rays. Because the rays are largely "coherent" they typically hit the same objects (cache hit rate!).
The contenders
Currently there are a total of six different variants of the XRaySimulator:- "C#": This is the baseline, scalar managed implementation.
- "C# adj. trav.": A further optimized version that exploits the fact that once a ray is inside a volume (finite element) mesh, it must hit a face of an adjacent element (hexahedron).
- "C#/SSE2": Like "C#", but using 2x2 (X-)ray packets; doesn't use "adjacency traversal" due to the high branching factor
- "C++": A C++11 reimplementation of "C#"; I tried to stay as close as possible to "C#" while still using at least half-way decent, idiomatic C++.
- "C++ adj. trav.": Corresponds to "C# adj. trav."
- "C++/AVX": Vectorized version of "C++" using 4x2 ray bundles thanks to AVX
Performance analysis
So, who wins? The following figure shows the performance of the different versions in million rays per second (MRay/s) rendering an FE model consisting of 28672 hexahedral elements (344064 triangles) at a resolution of 6400 x 4800 pixels on an Intel Core i7-2600K (3.4 - 4.2 GHz) with 32 GB DDR4 RAM running under Windows 8.1 Pro:Now, given that SSE2 uses only 128-bit-wide vector lanes compared to AVX's generous 256 bit and the generally much more aggressive optimizer of the Visual C++ compiler, it's not exactly surprising to see an obvious performance difference between the "C++/AVX" and "C#/SSE2" case. Yet, I still would have expected the speed-up of "C#/SSE2" to reach a value a little closer to 4x instead of 2.5. What's going on there?
According to Visual Studio's built-in profiler all of the implementations spend the majority of their time in the intersection routine of the AABB (axis-aligned bounding box) - which is a good thing, because this intersection test is very fast compared to a triangle intersection test. Thus the quality of the generated machine code for this method/function is critical for the overall performance of the renderer.
The source code of the C#/SSE2 version looks like this:
Loading ....
And here's the source for the C++/AVX version:
Loading ....
(Note: The C++ code uses a hard-coded vector lane width of 8 floats.)
Almost identical; yet, if you compare what both RyuJIT and Visual C++ make of these sources, you'll first notice that the machine code emitted by RyuJIT is much more convoluted and thus longer:
Most of that "additional stuff" that's going on in the C#/RyuJIT version seems to be related to null-pointer checks (both AABB and RayPack are classes and thus reference types). Still, I wonder if all those load/store operations are truly neccessary.
Preliminary conclusions
It seems like Microsoft has finally awakend and makes the long overdue investments in .NET performance. Thanks Google and Apple! Although RyuJIT will still require a lot of optimizations, in particular with respect to the generated SIMD code, Redmond's latest moves are promising. A next generation JIT, SIMD support, AOT compilation using the Visual C++ optimizer backend... What will come next? GPGPU support? Large arrays? A decent, modern, performant desktop UI framework? True first-class support for F#?The future is bright!
*XRaySimulator is a visualization tool that renders X-ray-like images of finite element models. It uses a modified ray tracing algorithm to compute the energy absorption within each intersected element based on the element's material properties. A BVH (bounding volume hierarchy) is used to speed-up the intersection computation.
Details (German): http://www.uni-ulm.de/fileadmin/website_uni_ulm/uzwr/projekte/p10-2.pdf
**The C# versions of XRaySimulator on BitBucket currently don't support saving the rendered image to a file. In older versions, I used to use Tao.DevIL, but that only works on x86 and the preview releases of RyuJIT only emit x64 machine code. The C++ versions use a custom TGA output filter.
Labels:
.NET,
C#,
C++,
CLR,
JIT,
Performance,
ray tracing,
SIMD
Subscribe to:
Posts (Atom)