tag:blogger.com,1999:blog-25233945671805763052024-03-05T20:50:54.283+01:00Frank’s ScratchpadRandom thoughts about coding, computers, games, science & the net.Unknownnoreply@blogger.comBlogger13125tag:blogger.com,1999:blog-2523394567180576305.post-53212860784052302752016-10-04T22:51:00.002+02:002016-10-05T20:15:56.843+02:00Running TensorFlow natively on Windows 10<a href="https://www.tensorflow.org/" target="_blank">TensorFlow</a> is a library for evaluating numerical expressions of high-rank arrays (a.k.a. “multi-dimensional arrays” or “tensors”, which sometimes may actually represent <a href="https://en.wikipedia.org/wiki/Tensor" target="_blank">tensors in the mathematical sense</a>), a capability that is crucial for many scientific computing tasks. TensorFlow, however, specifically targets machine learning tasks, in particular ‘<a href="http://neuralnetworksanddeeplearning.com/chap6.html" target="_blank">deep learning</a>,’ whose practical viability critically depends on highly efficient multi-dimensional algebra routines and highly efficient high-dimensional gradient calculation. While TensorFlow’s core is implemented in C++, it comes with a Python API that enables interactive experimentation with the library.<br />
<br />
Unfortunately TensorFlow—or rather: its build system—hasn’t yet been ported to Windows (but the guys <a href="https://github.com/tensorflow/tensorflow/issues/17" target="_blank">are working on it</a>). Until then, one can get by using Docker containers or running full-blown Linux VMs. With the introduction of the <a href="https://msdn.microsoft.com/en-us/commandline/wsl/faq?f=255&MSPPError=-2147217396" target="_blank">WSL (Windows Subsystem for Linux)</a> as a part of Windows 10 Anniversary Update, however, it has become possible to run the Linux-version of TensorFlow on Windows in its Ubuntu user space (CPU only, sadly). WSL is still in beta, so there are some quirks to be expected. Follow the instructions below to set up a working IPython-TensorFlow environment on Windows 10 Pro:<br />
<h2>
Step 1: Activate WSL and “Bash on Ubuntu on Windows”</h2>
First you need to activate the Linux Subsystem and install the Ubuntu user land. First, open Windows Settings (Windows + I) and click on “Windows Update, recovery, backup” (Fig. 1).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQffOXNYXlmjkQGTTaQYo617QsrfDtklwJdc3MXMKdb6duWNsLwnTmLaw7zMww1JQU9hfUI8kpGlUjCgRbIQLT3bx7JaUXEXpAUC5QJTnkWN6WYN5z5wBYzz31idHj0gpqphrHe3MlM8E/s1600/step1-1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQffOXNYXlmjkQGTTaQYo617QsrfDtklwJdc3MXMKdb6duWNsLwnTmLaw7zMww1JQU9hfUI8kpGlUjCgRbIQLT3bx7JaUXEXpAUC5QJTnkWN6WYN5z5wBYzz31idHj0gpqphrHe3MlM8E/s640/step1-1.png" width="618" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1: Click on “Update” ...</td></tr>
</tbody></table>
Click on “Use developer features” and enable “developer mode.” (Fig. 2). This might take a while and you may have to restart your machine afterwards.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1XZugc8baNJ3yF_lDPZNvML576w_frpJ3QUcT58qgQccgB5twgGDSk8CGauw7VAkzQF16WZBiaQcpuOHhrwCp7_I4DP3vqOB4qvKuHBd7GTHpToJA2LQ8pr0tIVWSG6DTjaP22YlEWew/s1600/step1-2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1XZugc8baNJ3yF_lDPZNvML576w_frpJ3QUcT58qgQccgB5twgGDSk8CGauw7VAkzQF16WZBiaQcpuOHhrwCp7_I4DP3vqOB4qvKuHBd7GTHpToJA2LQ8pr0tIVWSG6DTjaP22YlEWew/s640/step1-2.png" width="618" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2: ... to enable developer mode.</td><td class="tr-caption" style="text-align: center;"><br /></td></tr>
</tbody></table>
Now we need to activate the Linux subsystem. Open the (classical) control panel and navigate to "Programs" → "Turn Windows features on or off" (Fig. 3 & 4).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieTrb0YiF1eSHit5boMMcHRlIXbX3PHAuV0-XsOv-nHo1ju7pgO-W38J6iAHPpLkEkv0DQ4Ai1LBV45wiuE-YbeCJPwnU-Qukzc31INP2TMIZqc9swhpvukWXU0GwThOTqHjDzBZ3RLDU/s1600/step1-3.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieTrb0YiF1eSHit5boMMcHRlIXbX3PHAuV0-XsOv-nHo1ju7pgO-W38J6iAHPpLkEkv0DQ4Ai1LBV45wiuE-YbeCJPwnU-Qukzc31INP2TMIZqc9swhpvukWXU0GwThOTqHjDzBZ3RLDU/s640/step1-3.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsJCuePl8x47kYXzanIbYc-m9yk9TLW97Q08TETe-hgexwL9KgnkgIJdkBuy86umDKGveq28Kg4UViZ6unTXQX1m-WrvWeZdnR9ug3E8FHu9JPZWEeZTmIHmuB3PB0effo-Pi6FNNdUNE/s1600/step1-4.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsJCuePl8x47kYXzanIbYc-m9yk9TLW97Q08TETe-hgexwL9KgnkgIJdkBuy86umDKGveq28Kg4UViZ6unTXQX1m-WrvWeZdnR9ug3E8FHu9JPZWEeZTmIHmuB3PB0effo-Pi6FNNdUNE/s640/step1-4.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 4</td></tr>
</tbody></table>
Select "Windows Subsytem for Linux (Beta)" in the dialog and click OK. You'll probably have to restart your machine again (Fig. 5).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8yTBISUvDDfwjDmAg7tGAOzRCiE-MqBNj3_g2Jc4GA6z7FmmASz00zcwUBsmPxfxtHQ-e-EActYwmS3KrELSwjXfS_l2Qi-OaKxVhoVJ-lo6KP1hFZsTHYH24fIVJHKU5VJSAbg4T5Ro/s1600/step1-5.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="353" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8yTBISUvDDfwjDmAg7tGAOzRCiE-MqBNj3_g2Jc4GA6z7FmmASz00zcwUBsmPxfxtHQ-e-EActYwmS3KrELSwjXfS_l2Qi-OaKxVhoVJ-lo6KP1hFZsTHYH24fIVJHKU5VJSAbg4T5Ro/s400/step1-5.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 5</td></tr>
</tbody></table>
Now you should be able to run "bash" (either from the start menu or from a cmd prompt), which guides you through the further installation process (c.f. <a href="https://msdn.microsoft.com/en-us/commandline/wsl/install_guide" target="_blank">WSL Installation Guide</a>). After this procedure, there should be a new start menu entry "Bash on Ubuntu on Windows" (Fig. 6).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOKtnLowUPuWSBBIm6G5niAHLsM_-OIE4RYLCzbukbb03m66OWCIOJB-dvzRadl3XD8oS2OrY_2tCiiyTYrwtLcnB-Y8jfT2nWE6wslpf-UPzDPA1uc0R34bLlXiybHCepZubkGQWZ0hg/s1600/step2-1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOKtnLowUPuWSBBIm6G5niAHLsM_-OIE4RYLCzbukbb03m66OWCIOJB-dvzRadl3XD8oS2OrY_2tCiiyTYrwtLcnB-Y8jfT2nWE6wslpf-UPzDPA1uc0R34bLlXiybHCepZubkGQWZ0hg/s640/step2-1.png" width="268" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 6</td></tr>
</tbody></table>
<h2>
Step 2 (optional): Install mintty for WSL</h2>
When you use the aforementioned short cut, a bash shell starts in a conventional cmd.exe console host. While that is perfectly useable, I personally much prefer <a href="https://mintty.github.io/" target="_blank">mintty</a> (<a href="https://cygwin.com/" target="_blank">Cygwin's</a> default terminal emulator). Luckily, there's already a <a href="https://github.com/mintty/wsltty" target="_blank">version for WSL</a> available: Just<a href="https://github.com/mintty/wsltty/releases" target="_blank"> download the installer</a>, run it, et voilà, ready to go. The installer also configures the explorer context menu to contain a handy "WSL in Mintty Here" shortcut, which opens a bash session in the current path.<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhM9Jkj6LtD8AWejNAaORIa0CIdAWmDwVpmf0lxoMrPG37_SCZRO5fsDvQIXmdinHPpd51NR8hfboVXlxIRmoR3k-qklUpMFwodXNafH09PJgqjdGfTUIGis4HcoaAYkUFbG7PmXiGYLro/s1600/step2-2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="377" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhM9Jkj6LtD8AWejNAaORIa0CIdAWmDwVpmf0lxoMrPG37_SCZRO5fsDvQIXmdinHPpd51NR8hfboVXlxIRmoR3k-qklUpMFwodXNafH09PJgqjdGfTUIGis4HcoaAYkUFbG7PmXiGYLro/s640/step2-2.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 7: mintty hosting a bash shell</td></tr>
</tbody></table>
<h2>
Step 3: Install Anaconda</h2>
<a href="https://www.continuum.io/downloads" target="_blank">Anaconda</a> from Continuum Analytics is <i>the </i>computational science Python distribution. While we could also simply use the default Python distribution from the Ubuntu repositories, Anaconda comes with <a href="https://software.intel.com/en-us/intel-mkl" target="_blank">Intel's MKL</a> and thus a substantial performance boost (not to mention its potent conda package manager). Start a bash shell and download the Anaconda installer by running the following commands<br />
<br />
$ cd ~<br />
$ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh<br />
$ chmod +x Anaconda3-4.2.0-Linux-x86_64.sh<br />
$ ./Anaconda3-4.2.0-Linux-x86_64.sh<br />
<br />
Note that this installs Anaconda into your WSL home directory. You could install it "system-wide" using sudo, but as WSL environments are per-Windows-user anyway, there isn't much point in doing so. At some point the installer will ask you, whether it should add Anaconda to your Linux PATH, effectively making it the default Python. Confirm by entering "yes" (Fig. 8).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnsPfcwD49MjLvpn8Ayg0i4hlcWpN5HifeenyzlOgcY17hydNI2e87DrWf26OT61MZLWOiecMapLHVTB_kklx3rCe6_X12AUJ9CTHhxE10AK8pmXoOeNn-j6q9sUlW7MRlJUsFfwgvJlQ/s1600/step3-1.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnsPfcwD49MjLvpn8Ayg0i4hlcWpN5HifeenyzlOgcY17hydNI2e87DrWf26OT61MZLWOiecMapLHVTB_kklx3rCe6_X12AUJ9CTHhxE10AK8pmXoOeNn-j6q9sUlW7MRlJUsFfwgvJlQ/s640/step3-1.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 8: YES!!!</td></tr>
</tbody></table>
You may have to start a new bash session in order to make the PATH change effective<code></code>. Alternatively you can "source" (reload/re-execute) .bashrc via<br />
<br />
$ . ~/.bashrc<br />
<br />
Now you should be able to run<br />
<br />
$ ipython<br />
<br />
And see a message like this:<br />
<br />
Python 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul 2 2016, 17:53:06)<br />
Type "copyright", "credits" or "license" for more information.<br />
<br />
IPython 5.1.0 -- An enhanced Interactive Python.<br />
? -> Introduction and overview of IPython's features.<br />
%quickref -> Quick reference.<br />
help -> Python's own help system.<br />
object? -> Details about 'object', use 'object??' for extra details.<br />
<br />
In [1]:<br />
<br />
Yet, when you enter the command (IPython magic)<br />
<br />
In [1]: %pylab<br />
<br />
Python will throw some PyQt4 error at us:<br />
<br />
---> 31 from .qt_compat import QtCore, QtGui, QtWidgets, _getSaveFileName, __version__<br />
32 from matplotlib.backends.qt_editor.formsubplottool import UiSubplotTool<br />
33<br />
<br />
/home/niemeyer/anaconda3/lib/python3.5/site-packages/matplotlib/backends/qt_compat.py in <module>()<br /> 135 # have been changed in the above if block<br /> 136 if QT_API in [QT_API_PYQT, QT_API_PYQTv2]: # PyQt4 API<br /> --> 137 from PyQt4 import QtCore, QtGui<br /> 138<br /> 139 try:<br /><br /> ImportError: No module named 'PyQt4'</module><br />
<h2>
Step 4: Fix Matplotlib PyQt4 Error</h2>
The above error is already <a href="https://github.com/ContinuumIO/anaconda-issues/issues/1068" target="_blank">known by the Anaconda developers</a>. Sadly, the proposed solutions like explicitly selecting the Qt5Agg backend or downgrading to Qt4 didn't work for me. What <i>did</i> work was switchting to the TkAgg. For that you need to create a new text file<br />
<br />
$ vi ~/.config/matplotlib/matplotlibrc<br />
<br />
(use nano, if you can't handle vi...) and add the following line<br />
<br />
backend : TkAgg<br />
<br />
When you now start IPython again, executing %pylab should work fine ...<br />
<br />
In [1]: %pylab<br />
Using matplotlib backend: TkAgg<br />
Populating the interactive namespace from numpy and matplotlib<br />
<br />
... only to run into the next error when trying to create a little test plot:<br />
<br />
In [2]: x = linspace(0, 10, 1000)<br />
In [3]: plot(x, x**2)<br />
OMP: Error #100: Fatal system error detected.<br />
OMP: System error #22: Invalid argument<br />
<h2>
Step 5: Work Around OpenMP Error</h2>
The previous error <a href="https://github.com/Microsoft/BashOnWindows/issues/785" target="_blank">also is already known</a>, though not yet fixed. To work around this bug(?), edit your .bashrc<br />
<br />
$ vi ~/.bashrc<br />
<br />
and add the line<br />
<br />
export KMP_AFFINITY=disabled<br />
<br />
to the end of the file. Run<br />
<br />
$ . ~/.bashrc<br />
<br />
again and re-try %pylab and plotting. This time, IPython will reward us with a new error message:<br />
<br />
-> 1868 self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)<br />
1869 if useTk:<br />
1870 self._loadtk()<br />
<br />
TclError: no display name and no $DISPLAY environment variable<br />
<h2>
Step 6: Install X11 Server and set $DISPLAY</h2>
Matplotlib needs an <a href="https://en.wikipedia.org/wiki/X_Window_System" target="_blank">X server</a> to draw its plot windows. Nowadays I recommend <a href="https://sourceforge.net/projects/vcxsrv/" target="_blank">VcXsrv</a>, which is easy to install and just works out of the box. You could use Cygwin/X or Xming, but at least the former <a href="http://superuser.com/questions/898781/putty-cygwin-x11-forwarding-cant-open-display-error" target="_blank">requires some fiddeling with its setting</a> for it to work with WSL.<br />
<br />
After having installed and started your X server of choice, edit your .bashrc again to add the following line<br />
<br />
export DISPLAY=:0.0<br />
<br />
Again,<br />
<br />
$ . ~/.bashrc <br />
<br />
Now, IPython/matplotlib should finally work.<br />
<h2>
Step 7: Install TensorFlow</h2>
The TensorFlow installation itself is pretty straight-forward: Execute<br />
<br />
$ conda install -c conda-forge tensorflow<br />
<br />
Alongside of raw TensorFlow, you may also want to install a deep learning library like <a href="https://keras.io/" target="_blank">Keras</a>, which is easily installed via PIP<br />
<br />
$ pip install keras<br />
<br />
When you now start again IPython and enter<br />
<br />
import keras<br />
<br />
you should get a little "using TensorFlow backend" message, indicating your successful installation of TensorFlow on Windows!<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTiDgntNxI32kEHuegUS3bWJXkpEz5YgYQUoRGE_9jpF8-ibvyHj8UW0YTteS_eS02nHI1abW5GSuzdy3Sem7ck1BtK_1bQOQ42YsPdlJyKGUumGAeKnB2xWF1hDHDsIxtzEy7ZahF89w/s1600/step8.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiTiDgntNxI32kEHuegUS3bWJXkpEz5YgYQUoRGE_9jpF8-ibvyHj8UW0YTteS_eS02nHI1abW5GSuzdy3Sem7ck1BtK_1bQOQ42YsPdlJyKGUumGAeKnB2xWF1hDHDsIxtzEy7ZahF89w/s640/step8.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 9: When installed correctly, Keras defaults to TensorFlow as its backend</td></tr>
</tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPTTBWLCdFg6NxT03jUudjkYLslCO320qJqcuESAJU4NHYy1ZFCSkyOPSBv7YGZKheSgptH_ozmg6mxvksAu3Tzv1YBJe7HDq2Z-yhrIa-vChTY4_j0db1rnygHilACaC9LFubDSRPs3Q/s1600/step8-2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="468" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPTTBWLCdFg6NxT03jUudjkYLslCO320qJqcuESAJU4NHYy1ZFCSkyOPSBv7YGZKheSgptH_ozmg6mxvksAu3Tzv1YBJe7HDq2Z-yhrIa-vChTY4_j0db1rnygHilACaC9LFubDSRPs3Q/s640/step8-2.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 10: Training a (small) CNN using TensorFlow on WSL</td></tr>
</tbody></table>
<br />Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-2523394567180576305.post-39749414818371106512015-08-28T23:10:00.001+02:002015-09-17T16:47:59.166+02:00SIMD Fundamentals. Part III: Implementation & Benchmarks<i>This is the third part of my series on SIMD Fundamentals, targeted at .NET developers. You may read the other two parts here:</i>
<br />
<ul>
<li><i><a href="http://frankniemeyer.blogspot.de/2015/06/simd-fundamentals-part-i-from-sisd-to.html">Part I: From SISD to SIMD</a></i></li>
<li><i><a href="http://frankniemeyer.blogspot.de/2015/06/simd-fundamentals-part-ii-aos-soa.html">Part II: AoS, SoA, Gather/Scatter - Oh my!</a></i></li>
</ul>
While the previous two parts where more focused on theory and concepts, we are now going to actually get our hands dirty and write some code to see how different approaches to SIMD processing compare in practice, both from a performance an an ease-of-implementation point of view.
<br />
<h1>
Prelude</h1>
In <a href="http://frankniemeyer.blogspot.de/2015/06/simd-fundamentals-part-ii-aos-soa.html">Part II</a> I used F# (pseudo-)code to illustrate the difference between the AoS (array of structures) and SoA (structure of array) approaches. That's why I thought using F# for implementing the benchmarks might be a good idea. So I installed Visual Studio 2015, created a new F# console application project and installed <a href="https://www.nuget.org/packages/System.Numerics.Vectors/">System.Numerics.Vectors</a> in its most recent 4.1.0 incarnation via NuGet. Yet, when I tried to use <a href="https://msdn.microsoft.com/en-us/library/dn858385%28v=vs.111%29.aspx">System.Numerics.Vector<T></a> IntelliSense wanted to convince me <i>there was no such thing:</i><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdWRAOe28IssE3d-UZ6dAhEmFG75H3XGJQRHGdjowJdr04wIA8mCMJ3eV9lgpEks0WdvQVSTwwPJZzB6UQCd3SSW0BOo0HZJLW_7E1H7cgeMCEPmgGJIMv1x5LsZ6KRJvaRLm6AzDcSlk/s1600/novector.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdWRAOe28IssE3d-UZ6dAhEmFG75H3XGJQRHGdjowJdr04wIA8mCMJ3eV9lgpEks0WdvQVSTwwPJZzB6UQCd3SSW0BOo0HZJLW_7E1H7cgeMCEPmgGJIMv1x5LsZ6KRJvaRLm6AzDcSlk/s1600/novector.png" /></a></div>
Maybe just a problem with the F# language service? I tried to run this little hello world sample<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/6298593f26f888d9f468.js"></script>
but that didn't work either, because it references the wrong version of System.Numerics.Vectors:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/9344736a00e9fd0b6b57.js"></script>
I didn't have luck with manually replacing the "C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Numerics.Vectors.dll" reference with the one delivered by the System.Numerics.Vectors NuGet package either.
<p>For now, I thus resorted to using C# as an implementation language instead.</p>
<h1>
Managed Implementations</h1>
For the upcoming benchmarks we will use the same problem we stated in the <a href="http://frankniemeyer.blogspot.de/2015/06/simd-fundamentals-part-ii-aos-soa.html">previous post</a>: Compute the squared L2 norm of a set of 3-vectors. According to the previous post, we want to compare five possible approaches:
<br />
<ul>
<li>Scalar AoS</li>
<li>AoS vectorized across each 3-vector</li>
<li>On-the-fly vectorization (converting AoS to SoA)</li>
<li>Scalar SoA</li>
<li>Vectorized SoA</li>
</ul>
To avoid any memory bottlenecks and keep everything cached, we will use a comparatively small set of vectors. I found that an array of 1024 to 2048 Vec3fs yielded peak performance. Note that this has the nice side effect of being a multiple of the AVX vector lane width (32 byte or 8 single precision floating point values). To get a reliable estimate of the average performance, we repeat the procedure a large number of times: For the benchmarks, all procedures had to compute a total of 32 · 2^30 (approximately 32 billion) dot products.
<p>For further details, you can consult the full benchmark code <a href="https://bitbucket.org/frank_niemeyer/aos-vs.-soa">here</a>. In case you want to run it yourself and change some parameters, make sure that vectorLen is a multiple of laneWidth, as the remaing code assumes.</p>
<p>All of the following benchmark code was compiled and run on an Intel Core i5-4570 (Haswell) with 16 GB auf DD3-SDRAM on Windows 10 Pro. The managed implementation was developed using Visual Studio 2015 on .NET 4.6 and targeted x64 (release build, "RyuJIT").</p>
<h2>
Scalar AoS</h2>
This version is probably the easiest to understand and, as long as you don't intend to vectorize, a pretty reasonable one: We have an array of Vector3 structures (here we simply use the one provided by System.Numerics.Vectors) and compute the resulting dot products vector by vector:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/9686afe1f2312827dd74.js"></script>
The outer loop over j is there to ensure the total number of computed dot products is 32 billion. For the inner loop, the JIT compiler generates the following machine code:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/62860fa87432a047d402.js"></script>
As expected, it uses scalar AVX instructions to compute the dot products (VMULSS, VADDSS).
<br />
<h2>
AoS vectorized across each 3-vector</h2>
In this version, we still compute a single dot product per iteration, but we compute the squares of the components at once and then determine the (horizontal) sum of those squared components:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/e136caf93b03f964020b.js"></script>
When using Vector3.Dot, the compiler emits code that uses the DPPS (SSE 4.1) or VDPPS (AVX) instructions:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/f15d173dc31e955e81c8.js"></script>
For some reason, it's loading the data <i>twice</i>, first into XMM0 and then into XMM1 (VMOVSS, VMOVSD, VSHUFPS).
<br />
<h2>
On-the-fly vectorization</h2>
Because AoS isn't a data layout well suited for vectorization, last time we came up with the idea of reordering the data on the fly. Vector gather instructions would help with that, but System.Numerics.Vector<T> only supports loading consecutive elements for now. The only managed solution I could come with is to first manually gather the required data into temporary arrays and then creating the vector instances from these temporary data structures:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/bb74e23c059ffd433edb.js"></script>
That works, in principle, meaning that the compiler can now emit VMULPS and VADDPS instructions to compute 8 dot products at once. Yet, because the JIT compiler doesn't employ VGATHERDPS all this gathering becomes quite cumbersome:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/75fe6aeea8e969d6cbc1.js"></script>
As you can probably imagine, this code isn't exactly a candidate for the world's fastest dot3 product implementation...
<br />
<h2>
Scalar SoA</h2>
Let's move on to a proper SoA data layout. The scalar version is similar to the scalar AoS version, only that we now index the components instead of the array of vectors:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/7dad43ee87445ae6cf46.js"></script>
The resulting machine code is likewise similar to scalar AoS:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/77ef3f5981375a8caf10.js"></script>
<h2>
Vectorized SoA</h2>
The SoA layout makes it easy to use vector arithmetic to compute 8 dot products at once:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/41d78f5220ddcb68e452.js"></script>
This is what the compiler makes of it:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/34b464202857f1f8bf68.js"></script>
Nice, but sprinkled with range (?) checks. I also wonder, why it emits VMOVUPD instead of VMOVUPS instructions.
<br />
<h1>
Unmanaged Implementations</h1>
After implementing and running the above variants in C#, I figured it would be useful to have something to compare the results to. Thus, I ported the benchmark code to C++ to see what the Visual C++ optimizer, its auto-vectorizer and SIMD intrinsics can do and how close we can get to the theoretical peak performance of the Haswell CPU. For the "native" implementation I used the Visual C++ 2015 compiler with the following flags:
<br />
<pre>/GS- /GL /W3 /Gy /Zi /Gm- /Ox /Ob2 /Zc:inline /fp:fast /WX- /Zc:forScope /arch:AVX2 /Gd /Oy /Oi /MD /Ot</pre>
<h2>
Scalar AoS</h2>
Again, the code for this version is pretty straightforward:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/7fa84d49bab32c6e85c5.js"></script>
In case you wonder about the inner loop construction: while (i--) turned out to result in slightly faster code than a more traditional for loop.<p>No surprises regarding the machine code, either:</p>
<script src="https://gist.github.com/FrankNiemeyer/b08b94b0f3983b145887.js"></script>
<h2>
AoS vectorized across each 3-vector</h2>
Let's use the DPPS instruction via intrinsics:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/90df575a33b4189e8ac2.js"></script>
Notice the little trick of directly loading four consecutive floats instead of scalar loads and shuffling. Strictly speaking, this might go wrong for the last element of the vector, if you try to access unallocated memory... In reality you'd handle that special case separately (or simply allocate a few more bytes). The corresponding machine code is really compact:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/50f69897e4a8477ba035.js"></script>
<h2>
On-the-fly vectorization</h2>
In contrast to the managed version, we can now employ AVX's vector gather instructions to load eight of each 3ed component value into YMM registers:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/1e727dc50bfa15fde6b8.js"></script>
_mm256_fmadd_ps results in FMA3 (fused multiply add) instructions, combining multiplication and addition/accumulation in one instruction:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/43738881a99edfebfbda.js"></script>
<h2>
Scalar SoA</h2>
<script src="https://gist.github.com/FrankNiemeyer/4098009c9f198743152e.js"></script>
<script src="https://gist.github.com/FrankNiemeyer/a7283577301609ba61f1.js"></script>
<h2>
Auto-vectorized SoA</h2>
In order for the auto-vectorizer to kick in, we need to use a for-loop for the inner iteration:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/ca7a02c5adb25e0c4e99.js"></script>
Now the compiler even generates FMA instructions:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/e2f0e124450a71c549f7.js"></script>
<h2>
Vectorized SoA</h2>
Of course we can also vectorize manually by using intrinsics:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/dae98dad57199c1abfdb.js"></script>
<script src="https://gist.github.com/FrankNiemeyer/01a7b01e7c6be521ec09.js"></script>
<h2>
Vectorized SoA using FMA</h2>
This is the same as above, but it additionaly makes use of FMA:
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/1f0d4d02d9dcb9e9afa0.js"></script>
<script src="https://gist.github.com/FrankNiemeyer/a9cd6b26f01b9a60c4a8.js"></script>
<h1>
Results</h1>
The following figure displays the performance in GFLOP/s of the different versions. The dashed line is at 51.2 GFLOP/s, the theoretical peak performance of a single Haswell core (single precision):
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJ9NulP06bHaxfumiCynprrfBnZXZ-xTcWSfGja03upo_PUEsF5dojaGNWTspkj66PzCIb_ulDvQ5XE1Yx1QfgU_PktBaueMI6zvv20byIKWVToXbosZ-ymClaX-EjapKP41pg45yDxZ0/s1600/benchmark_results.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="416" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJ9NulP06bHaxfumiCynprrfBnZXZ-xTcWSfGja03upo_PUEsF5dojaGNWTspkj66PzCIb_ulDvQ5XE1Yx1QfgU_PktBaueMI6zvv20byIKWVToXbosZ-ymClaX-EjapKP41pg45yDxZ0/s640/benchmark_results.png" width="640" /></a></div>
First of all, both, all the AoS variants and the scalar SoA version, don't even come close to the vectorized SoA versions. Second, any attempts at accelerating the original AoS version failed (C#) or only provide insignificant performance gains (C++). Even vector gather can't save the day and in fact further impairs performance. In any event, the gains don't justify the more complicated code.
<p>If you really need the performance SIMD can provide, you <i>have</i> to switch to a SoA layout: While Visual C++'s auto-vectorizer may relieve you of writing SIMD intrinsics directly, it still requires SIMD-friendly—that is: SoA—code. As long as it works, it provides the most accessible way of writing high-performance code. The second-best way, from a usability stand point, is probably C# and System.Numerics.Vectors, which enables (explicit) SIMD programming via a comparably easy-to-use interface.</p>
<p>Yet, the plot above also shows that non of the managed solutions is really able to keep up with any of the vectorized C++ versions. One reason for that is the inferior code generation of the JIT compiler compared to the C++ optimizer. Others are more intrinsic to the managed programming model (null-pointer checks, range checks). But also System.Numerics.Vectors is far from being complete: For instance, there is no support for FMA or scatter/gather operations. A "Vector8" type could help treating a float[] as AVX-sized chunks.</p>
<h1>Conclusions</h1>
Want speed? Use a vectorized SoA approach. Want more speed? Use C++. That's what I learned from this little investigation. That and—until we have an AI driven optimizing compiler, that's smart enough to understand, what we are trying to achieve and then automatically transforms our code to the most efficient form—that writing high-performance code will remain both an art <i>and</i> hard work for the time being.
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-43008224295179280372015-06-13T14:57:00.000+02:002015-09-17T16:55:26.781+02:00SIMD Fundamentals. Part II: AoS, SoA, Gather/Scatter - Oh my!<a href="http://frankniemeyer.blogspot.com/2015/06/simd-fundamentals-part-i-from-sisd-to.html">Last time</a> we looked at a very basic example of a data parallel problem: adding two arrays. Unfortunately, data parallelism is not always so easy to extract from existing code bases and often requires considerable effort. Today we will therefore address a slightly more complicated problem and move step by step from a pure scalar to a fully vectorized version.<br />
<h2>
The problem and its SISD solution</h2>
Imagine we wanted to compute the squared ℓ²-norm of a set of <i>m</i>
3-vectors <i><b>u</b></i> in R³, what is really just the dot product of each vector <i><b>x</b></i> with itself: ‖<i><b>x</b></i>‖² = <i>x</i>² + <i>y</i>² + <i>z</i>²<br />
<br />
In good OO fashion, you'd probably first define a Vec3f data type, in its simplest form like so (F# syntax)
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/e1654ddfcd957f7d59fa.js"></script>
Our vector field <i><b>u</b></i> can then simply be represented by an array of these structs and the code that performs the computation of the dot products is
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/3c7cf3c57a2cce1f5f20.js"></script>
Load the three components, square them, add up the squares and store the result in dp:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWx3rfQvhRY42S4sciTPf53lFAMzL9fFedtWpNwXHufC0RhCEnT_fPfH-kenHCJQBVOtRcgAhz14uCBDDAGpxISLWUFLctGA0cxB66JK_5VEi2vCgdkHLgOix_JGgFRrd3pa9U8x_bjJQ/s1600/aos-sisd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="274" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWx3rfQvhRY42S4sciTPf53lFAMzL9fFedtWpNwXHufC0RhCEnT_fPfH-kenHCJQBVOtRcgAhz14uCBDDAGpxISLWUFLctGA0cxB66JK_5VEi2vCgdkHLgOix_JGgFRrd3pa9U8x_bjJQ/s640/aos-sisd.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
For SSE2, that's a total of three MULSS and two ADDSS instructions, so the one dot product we get out is worth 5 FLOPs. Now we would of course like to make use of our shiny SIMD hardware to accelerate the computation. There are a few options of how we can approach this:</div>
<h2 class="separator" style="clear: both; text-align: left;">
Pseudo-SIMD: Bolting SIMD on top of AoS</h2>
<div class="separator" style="clear: both; text-align: left;">
</div>
<div class="separator" style="clear: both; text-align: left;">
Our current data layout is an "array of structures" (AoS). As we shall
see further below, AoS isn't a particular good choice for vectorization
and SIMDifying such code won't yield peak performance. Yet it can be
done and depending on the used SIMD instruction set it may even provide
some speed-up.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Here's the idea: Instead of loading the three scalar components into separate scalar registers, we store all of them in a single SIMD register. In case of SSE2, <i>n</i> = 4 for single precision floats, so one of the elements in the register is a dummy value that we can ignore. Now we can multiply the SIMD register with itself to get the squared components (plus the ignored fourth) one. Then we need to horizontally add up the three squares that we are interested in; in pseudo-F#:</div><br/><br/><script src="https://gist.github.com/FrankNiemeyer/6978e902f084ada7bb54.js"></script>
Graphically:
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEIzCxQdRqI40a5bjbBjQ9XJO1qnETtr-kSNSr5e7L6IUjdYQHchYDWyU1N2djbuWYiCkNmYd1pEPedoMTVKKyi87vOijrfHJMvef-j7MFKZUOuEb5vke9bMDdZH7GmY8GJiXDRJitppY/s1600/aos-simd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="268" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEIzCxQdRqI40a5bjbBjQ9XJO1qnETtr-kSNSr5e7L6IUjdYQHchYDWyU1N2djbuWYiCkNmYd1pEPedoMTVKKyi87vOijrfHJMvef-j7MFKZUOuEb5vke9bMDdZH7GmY8GJiXDRJitppY/s640/aos-simd.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
Although we were able to replace three MULSS with a single MULPS and therefore express 5 FLOPs through only 3 arithmetic instructions, there are multiple issues with this idea: </div>
<ol>
<li>We need three scalar loads plus shuffling operations to place the three components in a single SIMD register. </li>
<li>We effectively only use 3 of the 4 vector lanes, as we operate on Vec3f, not Vec4f structs. This problem becomes even more severe with wider vector paths.</li>
<li>We use vector operations only for multiplication, but not for addition. </li>
<li>Horizontal operations like adding the components are comparatively slow, require shuffling the components around and extracting a single scalar (the result).</li>
<li>We still only compute one dot product per iteration.</li>
</ol>
<div class="separator" style="clear: both; text-align: left;">
Another example for this kind of pseudo-vectorization can be found in <a href="https://code.msdn.microsoft.com/SIMD-Sample-f2c8c35a">Microsoft's SIMD sample pack</a> (ray tracer sample). Yet, for all the reasons mentioned above, you should avoid this approach whenever possible.</div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<h2>
On-the-fly vectorization</h2>
If you recall our "Hello, world!" example from <a href="http://frankniemeyer.blogspot.de/2015/06/simd-fundamentals-part-i-from-sisd-to.html">Part I</a>, then you may also remember that vectorizing our loop meant that we computed <i>n</i> results per iteration instead of one. And that is what we have to achieve for our current dot-product example as well: We need to compute <i>n</i> dot products in each iteration. How can we accomplish this?<br />
<br />
Imagine our data (array of Vec3f structs) to form a 3×m matrix (three rows, <i>m</i> columns). Each column represents a Vec3f, each row the <i>x</i>, <i>y</i> or <i>z</i> components of <i>all </i>Vec3fs. In the previous section, we tried to vectorize along the columns by parallelizing the computation of a single dot product—and largely failed.<br />
<br />
The reason is that the data parallelism in this example can really be found along the <i>rows </i>of our imagined matrix, as <i>each dot product</i> can be computed independently and thus in parallel from each other. Furthermore <i>m</i>, the number of Vec3fs, is typically much larger then <i>n</i> and so we don't face problems in utilizing the full SIMD width. The wider the SIMD vector is, the more dot products we can compute per iteration.<br />
<br />
As with our example from last time, the vectorized version of the algorithm is actually very similar to the plain scalar one. That's why vectorizing a data-parallel problem isn't really hard once you get the idea. The only difference is that we don't handle individual floats, but chunks of <i>n</i> floats:<br />
<ol>
<li>Load <i>n x</i> values and square them</li>
<li>Load <i>n y</i> values and square them</li>
<li>Load <i>n z</i> values and square them</li>
<li>Add the <i>n x</i>, <i>y</i> and <i>z</i> values</li>
<li>Store the <i>n</i> resulting dot products</li>
</ol>
In pseudo-F# the algorithm is now
<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/91e4a92a76fc5425b2fa.js"></script>
Graphically:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiQQEAtHBqKn5PQNgfkG4OHr36IHLh6-aE7R3RNgBSQvrUYI0Dm9xWE-XQyxRH5tI0010_Wum6szCf6p-MM6l_ygJaaNOtF1P3Caw2J34_x5cTdbHZBRqQYh6j8aCpJrZvqWHrNBoebHs/s1600/otf-soa-simd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiQQEAtHBqKn5PQNgfkG4OHr36IHLh6-aE7R3RNgBSQvrUYI0Dm9xWE-XQyxRH5tI0010_Wum6szCf6p-MM6l_ygJaaNOtF1P3Caw2J34_x5cTdbHZBRqQYh6j8aCpJrZvqWHrNBoebHs/s640/otf-soa-simd.png" width="640" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
That good part about this version is that it uses vector arithmetic instructions throughout, performing <i>n</i> times the work of its scalar counterparts, performing a total of 20 FLOPs each iteration.<br />
<br />
The bad part is the one labeled "scalar loads & shuffling" in the picture above: Before we can compute the <i>n</i> results, we have to gather<i> n</i> <i>x</i>, <i>y</i> and <i>z</i> values, but our data is still laid out in memory as an AoS, i.e. ordered [xyzxyzxyzxyzxyz...]. Loading logically successive [x0 x1 x2 x3 ...] values thus requires indexed/indirect load instructions (<i>vector</i> <i>gather</i>). SIMD instruction sets without gather/scatter support, like SSE2, have to load the data using <i>n</i> conventional scalar loads and appropriately place them in the vector registers, hurting performance considerably.<br />
<h2>
Approaching nirvana: Structure of arrays (SoA)</h2>
To avoid this drawback, maybe we should just store the data in a SIMD-friendly way, as a <i>Structure of Arrays</i> (SoA): Instead of modeling the vector field <i><b>u</b></i> as a number of Vec3f structs, we consider it to be an object of three float arrays:<br /><br/>
<script src="https://gist.github.com/FrankNiemeyer/bb1faf8a47f0a6374bbe.js"></script>
This is indeed very similar to switching from a column-major to row-major matrix layout, because it changes our data layout from [xyzxyzxyz...]. to [xxxxx...yyyyy...zzzzz...].<br />
<br />
Observe how this implementation differs from the previous one only in that we index the components instead of the vectors:<br />
<br/><script src="https://gist.github.com/FrankNiemeyer/893436328a0a3b9a4aad.js"></script>
And yet it saves us a lot of loads and shuffling, resulting in pure, 100% vectorized SIMD code:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj47nsK5TO4M6Yu6xug1Qjs7kp5B-yHqAUQPnIjfXSK1NERpD1yGZSxQpaiuPqBDUavBKGE83LIulQztGLHl6EY0BMXTzvuYsvFLqOITL0iueebMZsbDEzbpqfnHC3S6A9jY-Q77T7eDC0/s1600/soa-simd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="254" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj47nsK5TO4M6Yu6xug1Qjs7kp5B-yHqAUQPnIjfXSK1NERpD1yGZSxQpaiuPqBDUavBKGE83LIulQztGLHl6EY0BMXTzvuYsvFLqOITL0iueebMZsbDEzbpqfnHC3S6A9jY-Q77T7eDC0/s640/soa-simd.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVzQYa4_i7T4vfCvHp0oFrlDvEz4GJDM58xP7e7bc2X5K6gGO3U6GY1KvopVN8h_FhuzUdo5jRsqPErYAEpccwD2BLVm-9CxAqtNDfTKDdKbaBh-iy0fP0ncfCMKOAIVdUW9JAgU2VkfI/s1600/soa-simd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;">
</a></div>
As this version really only replaces the scalar instructions with vector equivalents but executes 20 FLOPs in each iteration, we should now indeed get about 4x the performance of the scalar version and a much more favourable arithmetic-to-branch/load/store ratio. <br />
<br />
In fact, once the hard part of the work is done (switching from AoS to an SoA data layout), many compilers can even automatically vectorize loops operating on that kind of data.<br />
<h2>
Conclusions</h2>
Vectorizing OO code needs some getting used to, the SoA view of the world may be a bit alien to non-<a href="http://en.wikipedia.org/wiki/APL_%28programming_language%29">APL</a> programmers. If you can foresee the need for vectorizing your code at some point in the future, it may be wise to use an SoA layout from the get-go instead of having to rewrite half of your code later on. Experience certainly helps in identifying data-parallel problems; but as a rule of thumb good candidates for extracting data parallelism are those problems
where thinking in terms of operations that apply to a whole array/field
of items comes naturally. <br />
<br />
In part III we are going to discuss concrete implementations of the above concepts using F#, RyuJIT and System.Numerics.Vectors and compare the performance of the different versions—once they sorted out <a href="https://github.com/dotnet/corefx/issues/313">this issue</a>.<br />
<script src="https://raw.githubusercontent.com/moski/gist-Blogger/master/public/gistLoader.js" type="text/javascript"></script>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-65985377560591254342015-06-07T22:14:00.000+02:002015-06-07T23:17:16.263+02:00SIMD Fundamentals. Part I: From SISD to SIMDFor a long time, .NET developers didn't have to care about SIMD
programming, as it was simply not available to them (except maybe for Mono's
Mono.Simd). Only recently Microsoft introduced a new JIT compiler <a href="http://blogs.msdn.com/b/dotnet/archive/2013/09/30/ryujit-the-next-generation-jit-compiler.aspx">RyuJIT</a>
for .NET that, in conjunction with a special library
<a href="https://www.nuget.org/packages/System.Numerics.Vectors/">System.Numerics.Vectors</a>, offers access to the SIMD hardware of modern
CPUs. The goal of this three-part series of articles is therefore to introduce the fundamental ideas of SIMD to .NET developers who want to make use of all the power today's CPUs provide.<br />
<h2>
Parallelism in modern CPUs </h2>
In recent years CPUs gained performance mainly by increasing parallelism on all levels of execution. The advent of x86-CPUs with multiple cores established true <i>thread level parallelism </i>in the PC world. Solving the "multi-core problem", the question of how to distribute workloads between different threads in order to use all that parallel hardware goodness—ideally (semi-)automatically, suddenly became and continues to be one of <i>the</i> predominant goals of current hard- and software related research.<br />
<br />
Years before the whole multi-core issue started, CPUs already utilized another kind of parallelism to increase performance: <i>Instruction level parallelism</i> (ILP) was exploited via pipelining, super-scalar and out-of-order execution and other advanced techniques. The nice thing about this kind of parallelism is that your average Joe Developer doesn't has to think about it, it's all handled by smart compilers and smart processors.<br />
<br />
But then in 1997 Intel introduced the P55C and with it a new 64-bit-wide SIMD instruction set called MMX ("multimedia extension"; after all, "multimedia" was "the cloud" of the 90s). MMX made available a level of parallelism new to most PC developers: <i>data level parallelism </i>(DLP). Contrary to ILP however, DLP requires specifically designed code to be of any good. This was true for MMX and it remains to be true for its successors like Intel's 256-bit-wide<a href="http://en.wikipedia.org/wiki/Advanced_Vector_Extensions#AVX2"> AVX2</a>. Just as with multi threading, programmers need to understand how to use those capabilities properly in order to exploit the tremendous amounts of floating point horse power.<br />
<h2>
From SISD to SIMD</h2>
Whenever I learn new concepts, I like to first think of examples and generalize afterwards (inductive reasoning). In my experience that's true for most people, so let us therefore start with the "Hello, world!" of array programming: Suppose you have two floating point arrays, say xs and ys of length <i>m</i> and, for some reason, you want to add those arrays component-wise and store the result in an array zs. The conventional "scalar" or SISD (Single Instruction Single Data, see <a href="http://en.wikipedia.org/wiki/Flynn%27s_taxonomy">Flynn's taxonomy</a>) way of doing this looks like this (C# syntax)<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">for</span> (<span style="color: blue;">var</span> i = 0; i < m; ++i) {
zs[i] = xs[i] + ys[i];
}</pre>
<br />
Fetch two floats, add them up and store the result in zs[i]. Easy:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgW27Fz6wWTZpL20DRzUid08FdmZXghg37LyZZDIPlZOWckYK89igU46g2moWI9k-s_XmlTJEJWpXScQgGAJ2pAjdpo8fsaMHmUrZoTqgGwQS-ZJYvL7FiVOpIaR_ga3dRCWBnOclCKx1g/s1600/sisd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="276" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgW27Fz6wWTZpL20DRzUid08FdmZXghg37LyZZDIPlZOWckYK89igU46g2moWI9k-s_XmlTJEJWpXScQgGAJ2pAjdpo8fsaMHmUrZoTqgGwQS-ZJYvL7FiVOpIaR_ga3dRCWBnOclCKx1g/s640/sisd.png" width="640" /></a></div>
For this specific example, adding SIMD (Single Instruction Multiple Data) is almost trivial: For simplicity, let us further assume that <i>m</i> is evenly divisible by <i>n</i>, the vector lane width. Now instead of performing one addition per iteration, we add up xs and ys in chunks of <i>n</i> items, in pseudo-C# with an imagined array range expression syntax<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">for</span> (<span style="color: blue;">var</span> i = 0; i < m / n; i += n) {
zs[i:(i+n-1)] = xs[i:(i+n-1)] + ys[i:(i+n-1)];
}</pre>
<br />
We fetch <i>n</i> items from xs and <i>n</i> items from ys using vector load instructions add them up using a single vector add and store the <i>n</i> results in zs. Compared to the scalar version, we need to decode fewer instructions and perform <i>n</i> times the work in each iteration. Think of each SIMD register as a window, <i>n</i> values wide into an arbitrarily long stream (array) of scalar values. Each SIMD instructions modifies <i>n</i> scalar values of that stream <i>at once</i> (that's where the speed-up comes from) and moves on to the next chunk of <i>n</i> values:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiG335OkdXiF_OC02dJtDqVGTpnFAUIqiFtYYofNQMiNZbr_04a1E570QKx_vuBpg3-e7_2NkDgixmfKZGbCw58SCcacFoWLSxXIm0m3wi95WmFY44LUXCTtxI0JFzUzg-b_q2tkgksChI/s1600/simd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="286" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiG335OkdXiF_OC02dJtDqVGTpnFAUIqiFtYYofNQMiNZbr_04a1E570QKx_vuBpg3-e7_2NkDgixmfKZGbCw58SCcacFoWLSxXIm0m3wi95WmFY44LUXCTtxI0JFzUzg-b_q2tkgksChI/s640/simd.png" width="640" /></a></div>
So SIMD really is just a form of array processing, where you think in terms of applying one operation (single instruction) to a lot of different elements (multiple data), loop-unrolling on steroids. And that's already really all you need to know about SIMD, basically.<br />
<br />
Yet, this example is perhaps the most obvious data parallel case one could think of. It gets a whole lot more interesting once you add OOP and the data layout this paradigm propagates to the mix. We will look into the issue of vectorizing typical OOP code in the next part of this series.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-9414376765918098002015-06-04T00:44:00.003+02:002015-06-06T17:16:28.814+02:00NativeInterop 2.4.0NativeInterop v2.4.0 is ready and awaiting your download from <a href="https://www.nuget.org/packages/NativeInterop/">NuGet</a>. This version brings a revised version of Buffer.Copy<t u=""> tuned for .NET 4.6/RyuJIT: By using a 16 byte block copy, the JITter can generate movq (x86) or movdqu (x64) instructions for further improved performance. Enjoy!</t><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyR4LM1fjIwlW6yS1egioBasHzMowEaEVyqexfbV90oZDxYKnXIHtwWCGlhk_2a2XPhQjq8zE65TIL5llsYF4Ryr57FGpbeioUz7lxA9KUKEZr6E_ED9xUSxKaBrAbkPKbiIJKPMks7Ws/s1600/native-interop-2.4.0.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyR4LM1fjIwlW6yS1egioBasHzMowEaEVyqexfbV90oZDxYKnXIHtwWCGlhk_2a2XPhQjq8zE65TIL5llsYF4Ryr57FGpbeioUz7lxA9KUKEZr6E_ED9xUSxKaBrAbkPKbiIJKPMks7Ws/s1600/native-interop-2.4.0.png" /></a></div>
<br />
Now I just have to figure out, where the larger performance drop for large data sizes comes from compared to the other methods. memcpy somehow reaches approx. 20 GB/s beyond the L3 cache.<br />
<br />
Btw., I have a further post on SIMD programming with .NET almost ready, but I can't publish it yet due to <a href="https://github.com/dotnet/corefx/issues/313#issuecomment-107379684">problems</a> with the current version of System.Numerics.Vectors in combination with .NET 4.6/VS 2015 RC. Hope that'll get fixed soon!<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-62070511406484473922014-11-07T10:52:00.000+01:002014-11-07T10:53:24.619+01:00A quick look at RyuJIT CTP 5Microsoft's JIT team just released CTP 5 of "RyuJIT", its next generation JIT compiler for .NET. I just got around to test it with the SIMD version of my <a href="http://frankniemeyer.blogspot.de/2014/04/a-first-look-at-ruyjit-ctp3-and-simd.html" target="_blank">X-ray simulator</a>. The updated benchmark results look like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVRzQg3zsODaIN7xx-Rnlsjfv3JwkqaGnTrQhJLgHXN5sa4e9_nstD-ogbZFPbYHxOtQPFVDANpYe-Z377xs9ye_IFEt-7lmEaHgLcwxXF-xhAou4fA0_60bTSZtN-BaBe36vuRKGDvZ0/s1600/ctp5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVRzQg3zsODaIN7xx-Rnlsjfv3JwkqaGnTrQhJLgHXN5sa4e9_nstD-ogbZFPbYHxOtQPFVDANpYe-Z377xs9ye_IFEt-7lmEaHgLcwxXF-xhAou4fA0_60bTSZtN-BaBe36vuRKGDvZ0/s1600/ctp5.png" height="396" width="640" /></a></div>
<br />
Yes, performance actually <i>dropped</i> with the latest release. Again, the AABB ray intersection test is the bottleneck and this time the generated machine code is even worse compared to CTP3:<br />
<br />
<ul>
<li><a href="https://gist.github.com/anonymous/52e4bac6b5cbecd0a7cc" target="_blank">AABB intersection test, C#/SSE2 via RyuJIT CTP5</a></li>
<li><a href="https://gist.github.com/anonymous/10020560" target="_blank">AABB intersection test, C#/SSE2 via RyuJIT CTP3</a></li>
</ul>
<br />
Let's just hope this is a temporal regression.
Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-2523394567180576305.post-13007523204109087562014-07-16T14:50:00.002+02:002014-07-16T22:31:30.266+02:00Methods for Reading Structured Binary Data: Benchmarks & ComparisonsIn <a href="http://frankniemeyer.blogspot.com/2014/05/binary-stl-io-using-nativeinteropstream.html" target="_blank">my previous post</a> I demonstrated how one can use the <a href="https://www.nuget.org/packages/NativeInterop/" target="_blank">NativeIntrop</a> library to easily implement file I/O for structured binary data formats, STL in this particular case. In this example, we used NativeInterop.Stream.ReadUnmanagedStructRange to read a sequence of bytes from a file as an array of STL triangle structs.<br />
<br />
When implementing ReadUnmanagedStructRange I had these design goals in mind:<br />
<ul>
<li>Ease-of-use (no user-side tinkering with unsafe code)</li>
<li>Portability (after all, NativeInterop is a portable class library)</li>
<li>Genericity (should work with any unmanaged data type) </li>
<li>Efficiency</li>
</ul>
Today's post concerns how I tried to fulfill the first three bullet points without sacrificing too much of the last one ("efficiency"/"performance").<br />
<h3>
Choices</h3>
Implementing ReadUnmanagedStructRange (and its dual, WriteUnmanagedStructRange) essentially requires a method to convert a stream of bytes to a (managed) array of the target type (which is supposed to be unmanaged/blittable). In context of reading binary STL, we would therefore need to somehow convert a byte[] to an STLTriangle[].<br />
<br />
In a language like C that guarantees neither type nor memory safety we would simply treat the raw byte array (represented as a byte*) as a STLTriangle* and be done. And while that approach is possible in C# as well, it requires unsafe code at every point we want to access the STL data. Put differently: As there is no way to change the "runtime type" of a .NET array (certainly for good reasons!) the solution of our problem boils down to either...<br />
<ol>
<li>"Manually" construct objects of the target type from the byte[] data or store them in the result array, or...</li>
<li><b>Unsafely</b> copy the contents of the raw byte[] to an STLTrianlge[] ("memcpy"), without explicitly or implicitly creating intermediate/temporary objects</li>
</ol>
Both options can be implemented in .NET in multiple ways; for option (1) that is<br />
<ul>
<li>BinaryReader.ReadXXX: read individual fields directly from the stream</li>
<li>BitConverter.ToXXX: convert small chunks of the byte[] to individual fields of the target data type</li>
<li>Buffer.BlockCopy: read multiple fields of homogenous type at once</li>
<li>Marshal.PtrToStructure<t>: interpret larger chunks of the byte[] as complete STLTriangle structs</t></li>
</ul>
For option (2) ("memcpy" approach) we will consider<br />
<ul>
<li>Simple, unsafe C# code (*dst++ = *src++) </li>
<li>Marshal.Copy</li>
<li>memcpy (via P/Invoke)</li>
<li>cpblk CIL instruction</li>
<li>NativeInterop.Buffer.Copy using a custom blocked memcpy implementation (ReadUnmanagedStrucRange uses this function under the hood) </li>
</ul>
For our sample scenario, we assume that the STL data is already available in-memory as a byte[]; otherwise we would only be measuring I/O transfer speed. We also don't want to measure GC performance; therefore the following implementations will use a pre-allocated result array ("tris") to store the re-interpreted STLTriangle data.<br />
<h3>
BinaryReader.ReadXXX</h3>
Probably the most straight-forward approach for parsing a binary file is using a BinaryReader to get the data for individual fields of the target data structure and then create the data structure from that data. An implementation for reading an STL file might look like the following bit of code:<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">void</span> BinaryReaderRead(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">using</span> (<span style="color: blue;">var</span> br = <span style="color: blue;">new</span> <span style="color: #2b91af;">BinaryReader</span>(<span style="color: blue;">new</span> <span style="color: #2b91af;">MemoryStream</span>(triBytes))) {
<span style="color: blue;">for</span> (<span style="color: blue;">int</span> i = 0; i < tris.Length; ++i) {
<span style="color: blue;">var</span> normX = br.ReadSingle();
<span style="color: blue;">var</span> normY = br.ReadSingle();
<span style="color: blue;">var</span> normZ = br.ReadSingle();
<span style="color: blue;">var</span> aX = br.ReadSingle();
<span style="color: blue;">var</span> aY = br.ReadSingle();
<span style="color: blue;">var</span> aZ = br.ReadSingle();
<span style="color: blue;">var</span> bX = br.ReadSingle();
<span style="color: blue;">var</span> bY = br.ReadSingle();
<span style="color: blue;">var</span> bZ = br.ReadSingle();
<span style="color: blue;">var</span> cX = br.ReadSingle();
<span style="color: blue;">var</span> cY = br.ReadSingle();
<span style="color: blue;">var</span> cZ = br.ReadSingle();
<span style="color: blue;">var</span> abc = br.ReadUInt16();
tris[i] = <span style="color: blue;">new</span> <span style="color: #2b91af;">STLTriangle</span>(
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(normX, normY, normZ),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(aX, aY, aZ),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(bX, bY, bZ),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(cX, cY, cZ),
abc);
}
}
}</pre>
<br />
Note that I'm using a MemoryStream here to simulate reading from a Stream object while avoiding disk I/O.
<br />
<h3>
BitConverter.ToXXX</h3>
I guess revealing upfront the BinaryReader approach as not being exactly the most efficient approach will be hardly surprising for my dear readers. The next optimization step could therefore be to read the whole file into a flat byte[] at once and extract the data using BitConverter:<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">void</span> BitConverterTo(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">for</span> (<span style="color: blue;">int</span> i = 0; i < tris.Length; ++i) {
<span style="color: blue;">var</span> offset = i * <span style="color: #2b91af;">STLTriangle</span>.Size;
<span style="color: blue;">var</span> normX = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset);
<span style="color: blue;">var</span> normY = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 4);
<span style="color: blue;">var</span> normZ = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 8);
<span style="color: blue;">var</span> aX = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 12);
<span style="color: blue;">var</span> aY = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 16);
<span style="color: blue;">var</span> aZ = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 20);
<span style="color: blue;">var</span> bX = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 24);
<span style="color: blue;">var</span> bY = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 28);
<span style="color: blue;">var</span> bZ = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 32);
<span style="color: blue;">var</span> cX = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 36);
<span style="color: blue;">var</span> cY = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 40);
<span style="color: blue;">var</span> cZ = <span style="color: #2b91af;">BitConverter</span>.ToSingle(triBytes, offset + 44);
<span style="color: blue;">var</span> abc = <span style="color: #2b91af;">BitConverter</span>.ToUInt16(triBytes, offset + 48);
tris[i] = <span style="color: blue;">new</span> <span style="color: #2b91af;">STLTriangle</span>(
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(normX, normY, normZ),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(aX, aY, aZ),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(bX, bY, bZ),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(cX, cY, cZ),
abc);
}
}</pre>
<br />
Here we convert chunks of four bytes to the single precision floating point fields of an STLTriangle, representing the components of the vertices and the normal vector. The last component is the 2-byte attribute count field.<br />
<h3>
Buffer.BlockCopy</h3>
We can further refine the previous approach by extracting not single floats from the byte[], but all vertex/normal data for each single triangle at once using Buffer.BlockCopy:<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">void</span> BufferBlockCopy(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">var</span> coords = <span style="color: blue;">new</span> <span style="color: blue;">float</span>[12];
<span style="color: blue;">for</span> (<span style="color: blue;">int</span> i = 0; i < tris.Length; ++i) {
<span style="color: blue;">var</span> offset = i * <span style="color: #2b91af;">STLTriangle</span>.Size;
System.<span style="color: #2b91af;">Buffer</span>.BlockCopy(triBytes, offset, coords, 0, 48);
<span style="color: blue;">var</span> abc = <span style="color: #2b91af;">BitConverter</span>.ToUInt16(triBytes, offset + 48);
tris[i] = <span style="color: blue;">new</span> <span style="color: #2b91af;">STLTriangle</span>(
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(coords[0], coords[1], coords[2]),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(coords[3], coords[4], coords[5]),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(coords[6], coords[7], coords[8]),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(coords[9], coords[10], coords[11]),
abc);
}
}</pre>
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"> </pre>
Unfortunately Buffer.BlockCopy can only handle arrays whose elements are of primitive type (in this case, we copy from a byte[] to a float[]). We can therfore <i>not</i> use this approach to copy all bytes at once from the byte[] to a STLTriangle[].<br />
<h3>
Marshal.PtrToStructure<t></t></h3>
The Marshal class provides a method PtrToStructure<t> (or alternatively in generic form), which essentially reads an (unmanaged) struct from an arbitrary memory address:</t><br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">unsafe</span> <span style="color: blue;">void</span> MarshalPtrToStructureGeneric(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">var</span> triType = <span style="color: blue;">typeof</span>(<span style="color: #2b91af;">STLTriangle</span>);
<span style="color: blue;">fixed</span> (<span style="color: blue;">byte</span>* pBytes = &triBytes[0])
{
<span style="color: blue;">for</span> (<span style="color: blue;">int</span> i = 0; i < tris.Length; ++i) {
<span style="color: blue;">var</span> offset = i * <span style="color: #2b91af;">STLTriangle</span>.Size;
tris[i] = <span style="color: #2b91af;">Marshal</span>.PtrToStructure<<span style="color: #2b91af;">STLTriangle</span>>(<span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>(pBytes + offset));
}
}
}</pre>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h3>
*dst++ = *src++ </h3>
Essentially the same as with Marshal.PtrToStructure can be achieved with a little bit of unsafe C#:<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">unsafe</span> <span style="color: blue;">void</span> UnsafePointerArith(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">fixed</span> (<span style="color: blue;">byte</span>* pbuffer = triBytes)
<span style="color: blue;">fixed</span> (<span style="color: #2b91af;">STLTriangle</span>* ptris = tris)
{
<span style="color: blue;">var</span> pSrc = (<span style="color: #2b91af;">STLTriangle</span>*)pbuffer;
<span style="color: blue;">var</span> pDst = ptris;
<span style="color: blue;">for</span> (<span style="color: blue;">int</span> i = 0; i < tris.Length; ++i) {
*pDst++ = *pSrc++;
}
}
}</pre>
<br />
Here we treat the (fixed) byte input array as an STLTriangle* which we can then dereference to store each triangle data in the result array of type STLTriangle[]. Note that you cannot implement this kind of operation in a generic way in C#, as C# does not allow to dereference pointers to objects of arbitrary type.<br />
<h3>
Marshal.Copy</h3>
The aformentioned unsafe C# approach already is a form of memcpy, albeit an unoptimized one (we copy blocks of 50 bytes at each iteration); furthermore it's specialized for copying from byte[] to STLTriangle[]. Marshal.Copy on the other hand is a little more generic as it can copy any array of <i>primitive</i> elements to an arbitrary memory location (and vice versa):<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">unsafe</span> <span style="color: blue;">void</span> ConvertBufferMarshalCopy(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">fixed</span> (<span style="color: #2b91af;">STLTriangle</span>* pTris = &tris[0]) {
<span style="color: #2b91af;">Marshal</span>.Copy(triBytes, 0, <span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>(pTris), triBytes.Length);
}
}</pre>
<h3>
memcpy (P/Invoke)</h3>
As we already concluded that we essentially need an efficient way to copy one block of memory to some other location in memory, why not simply <i>use</i> the system-provided, highly optimized, memcpy implementation?<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;">[<span style="color: #2b91af;">DllImport</span>(<span style="color: #a31515;">"msvcrt.dll"</span>, EntryPoint = <span style="color: #a31515;">"memcpy"</span>, CallingConvention = <span style="color: #2b91af;">CallingConvention</span>.Cdecl, SetLastError = <span style="color: blue;">false</span>)]
<span style="color: blue;">static</span> <span style="color: blue;">extern</span> <span style="color: #2b91af;">IntPtr</span> memcpy(<span style="color: #2b91af;">IntPtr</span> dest, <span style="color: #2b91af;">IntPtr</span> src, <span style="color: #2b91af;">UIntPtr</span> count);
<span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">unsafe</span> <span style="color: blue;">void</span> ConvertBufferMemcpy(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">fixed</span> (<span style="color: blue;">byte</span>* src = triBytes)
<span style="color: blue;">fixed</span> (<span style="color: #2b91af;">STLTriangle</span>* dst = tris)
{
memcpy(<span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>(dst), <span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>(src), <span style="color: blue;">new</span> <span style="color: #2b91af;">UIntPtr</span>((<span style="color: blue;">uint</span>)triBytes.Length));
}
}</pre>
<br />
Easy. Of course, this introduces a dependency on Microsoft's C runtime library msvcrt.dll and is thus not platform independent.<br />
<h3>
cpblk</h3>
Interestingly, the CIL provides a specialized instruction just for copying blocks of memory, namely <a href="http://msdn.microsoft.com/en-us/library/system.reflection.emit.opcodes.cpblk%28v=vs.110%29.ASPX" target="_blank">cpblk</a>. A few years back, Alexander Mutel used cpblk in his <a href="http://code4k.blogspot.de/2010/10/high-performance-memcpy-gotchas-in-c.html" target="_blank">nice performance comparison</a> of different
memcpy methods for .NET. To my knowledge, this instruction is currently not directly accessible from either C# or F# (not sure about C++/CLI, though). Yet, we can generate the neccessary instructions using a bit of runtime code generation; here I'm using <a href="http://kevinmontrose.com/" target="_blank">Kevin Montrose's</a> excellent <a href="https://www.nuget.org/packages/Sigil/" target="_blank">Sigil</a> library:<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: green;">// parameters: src, dst, length (bytes)</span>
<span style="color: blue;">static</span> <span style="color: #2b91af;">Action</span><<span style="color: #2b91af;">IntPtr</span>, <span style="color: #2b91af;">IntPtr</span>, <span style="color: blue;">uint</span>> CpBlk = GenerateCpBlk();
<span style="color: blue;">static</span> <span style="color: #2b91af;">Action</span><<span style="color: #2b91af;">IntPtr</span>, <span style="color: #2b91af;">IntPtr</span>, <span style="color: blue;">uint</span>> GenerateCpBlk() {
<span style="color: blue;">var</span> emitter = Sigil.<span style="color: #2b91af;">Emit</span><<span style="color: #2b91af;">Action</span><<span style="color: #2b91af;">IntPtr</span>, <span style="color: #2b91af;">IntPtr</span>, <span style="color: blue;">uint</span>>>.NewDynamicMethod(<span style="color: #a31515;">"CopyBlockIL"</span>);
<span style="color: green;">// emit IL</span>
emitter.LoadArgument(1); <span style="color: green;">// dest</span>
emitter.LoadArgument(0); <span style="color: green;">// src</span>
emitter.LoadArgument(2); <span style="color: green;">// len</span>
emitter.CopyBlock();
emitter.Return();
<span style="color: green;">// compile to delegate</span>
<span style="color: blue;">return</span> emitter.CreateDelegate();
}
<span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">unsafe</span> <span style="color: blue;">void</span> ConvertBufferCpBlk(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">fixed</span> (<span style="color: blue;">byte</span>* src = triBytes)
<span style="color: blue;">fixed</span> (<span style="color: #2b91af;">STLTriangle</span>* dst = tris)
{
CpBlk(<span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>(src), <span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>(dst), (<span style="color: blue;">uint</span>)triBytes.Length);
}
}</pre>
<br />
While it's a bit clunky that code generation is required to get access to cpblk, the nice thing about it is that it's platform-independent: any CLI-compliant implementation must provide it. There is even hope that the next version of F# will come with an updated version of it's pointer intrinsics library that will also feature access to cpblk (cf. <a href="https://fslang.uservoice.com/forums/245727-f-language/suggestions/5670328-additional-intrinsics-for-the-nativeptr-module" target="_blank">Additional intrinsics for the NativePtr module</a> and the corresponding <a href="https://visualfsharp.codeplex.com/SourceControl/network/forks/jackpappas/fsharpcontrib?branch=native-interop" target="_blank">pull request</a>).<br />
<h3>
NativeInterop.Buffer.Copy<t u=""></t></h3>
My own little <a href="https://www.nuget.org/packages/NativeInterop/" target="_blank">NativeInterop</a> helper library also has tools for moving around stuff in memory. For instance, the Buffer module contains a method Copy<t u=""> that copies an input array of type T to an array of type U. Both T and U must be unmanaged types (Note: Neither the C# compiler nor the CLR can enforce this constraint!). Under the covers, </t>Buffer.Copy<t u=""><t u="">(T[], U[]) <t u=""><t u="">creates an empty result array and then copies the input bytes using a custom block copy implementation to the result array. Using Buffer.Copy<t u=""> is simple:</t></t></t></t></t><br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">void</span> ConvertBufferBufferConvert(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
NativeInterop.<span style="color: #2b91af;">Buffer</span>.Copy(triBytes, tris);
}</pre>
<br />
Unfortunately, I couldn't use any of the other mem-copy methods to implement Buffer.Copy as they are either not generic enough (Marshal.Copy, Buffer.BlockCopy), introduce dependencies on some native libraries (memcpy via P/Invoke) or require runtime-code generation or unsafe C# code, both of which isn't available for Portable Class Libraries like NativeInterop (cpblk, *dst++=*src++).<br />
<br />
I therefore experimented with different memcpy algorithms (different block sizes, different degrees of loop-unrolling, with or without considering alignment...) and—for the time being—setteled with a comparatively simple, aligned qword copying mechanism that offers decent performance on x64 with the current x64 JIT.<br />
<h3>
Multi-Threading</h3>
For some of the methods we will also look at a multi-threaded variant. Essentially, all of those versions look similar to this (here: memcpy):<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">public</span> <span style="color: blue;">static</span> <span style="color: blue;">unsafe</span> <span style="color: blue;">void</span> ConvertBufferMemcpyMultiThreaded(<span style="color: blue;">byte</span>[] triBytes, <span style="color: #2b91af;">STLTriangle</span>[] tris) {
<span style="color: blue;">var</span> threadcount = <span style="color: #2b91af;">Environment</span>.ProcessorCount;
<span style="color: blue;">var</span> chunksize = triBytes.Length / threadcount;
<span style="color: blue;">var</span> remainder = triBytes.Length % threadcount;
<span style="color: blue;">fixed</span> (<span style="color: blue;">byte</span>* pBytes = &triBytes[0])
<span style="color: blue;">fixed</span> (<span style="color: #2b91af;">STLTriangle</span>* pTris = &tris[0])
{
<span style="color: blue;">var</span> tasks = <span style="color: blue;">new</span> <span style="color: #2b91af;">Action</span>[threadcount];
<span style="color: blue;">for</span> (<span style="color: blue;">var</span> i = 0; i < threadcount - 1; ++i) {
<span style="color: blue;">var</span> offset = i * chunksize;
<span style="color: blue;">var</span> newSrc = <span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>(pBytes + offset);
<span style="color: blue;">var</span> newDst = <span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>((<span style="color: blue;">byte</span>*)pTris + offset);
tasks[i] = () => memcpy(newSrc, newDst, <span style="color: blue;">new</span> <span style="color: #2b91af;">UIntPtr</span>((<span style="color: blue;">uint</span>)chunksize));
}
<span style="color: blue;">var</span> finalOffset = (threadcount - 1) * chunksize;
<span style="color: blue;">var</span> finalSrc = <span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>(pBytes + finalOffset);
<span style="color: blue;">var</span> finalDst = <span style="color: blue;">new</span> <span style="color: #2b91af;">IntPtr</span>((<span style="color: blue;">byte</span>*)pTris + finalOffset);
tasks[threadcount - 1] = () => memcpy(finalSrc, finalDst, <span style="color: blue;">new</span> <span style="color: #2b91af;">UIntPtr</span>((<span style="color: blue;">uint</span>)(chunksize + remainder)));
<span style="color: #2b91af;">Parallel</span>.Invoke(tasks);
}
}</pre>
<br />
Given that copying (large) chunks of memory should be a bandwidth-limited problem, multi-threading shouldn't help much, at least for <i>efficient</i> implementations. Methods with a lot of overhead might profit a little more.<br />
<h3>
Benchmark</h3>
As a benchmark problem, I chose to create fake STL data from 2^6 up to 2^20 triangles. As each triangle struct is 50 bytes, the total amount of data transferred (read + write) varies between approx. 6 kB up to approx. 100 MB and should thus stress all cache levels.<br />
<br />
Each method described above ran up to 10 000 times with a time-out of 5 s for each of the 15 problem sizes. To minimize interference from other processes, I set the process priority to "high" and a garbage collection is kicked off before each round.<br />
<br />
System.Diagnostics.Stopwatch, which I used to measure transfer times, has a resolution of 300 ns on my system. For very small working sets and the fastest methods, that's already too coarse. Of course I could have measured the total time for, say, 10 000 runs and just compute the average. Yet, it turns out that even with sufficient warm-up, there are still enough outliers in the measured timings to skew the result. After some experimenting, I decided to use the mean of the values below the 10th percentile instead. That seems to be more stable than relying on the timings of individual runs and also excludes extreme outliers. Still, I wouldn't trust the measurements for < 2^8 (25 kB) too much.<br />
<br />
I ran the benchmarks on an Intel Core i7-2600K @ 3.4 - 4.2 GHz and 32 GiB of DDR3-1600 RAM (PC3-12800, dual channel; peak theoretical transfer rate: 25 GiB/s) under Windows 8.1 Pro. Both the CLR's current x64 JIT compiler "JIT64" as well as <a href="http://blogs.msdn.com/b/dotnet/archive/2013/09/30/ryujit-the-next-generation-jit-compiler.aspx" target="_blank">RyuJIT</a> (CTP 4) received the oportunity to show of their performance in the benchmarks.<br />
<h3>
Results & Analysis</h3>
First let's have a look at the results from JIT64 (click for full-size):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwtIjL1-0D8E9ut8q8uuLodMhWJNZG8WA4fIl4Ip3TqIjtKkBxR-pFrh6ndkBswwxl5wjakKx3paHf36iHxEn2p-fAxI9AK0iJ_U0YujzoGVdZvVvlS2Ca-P6m7Q2xhv8ikPia8Lz5Ars/s1600/jit64.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwtIjL1-0D8E9ut8q8uuLodMhWJNZG8WA4fIl4Ip3TqIjtKkBxR-pFrh6ndkBswwxl5wjakKx3paHf36iHxEn2p-fAxI9AK0iJ_U0YujzoGVdZvVvlS2Ca-P6m7Q2xhv8ikPia8Lz5Ars/s1600/jit64.png" height="500" width="640" /></a></div>
<br />
Ignoring the multithreaded versions for now (dashed lines), it is clear immediately that memcpy (P/Invoke) offers great overall performance for both the smallest and the largest data sets. Marshal.Copy and cpblk come as a close second. Unsafe C# (*dst++ = *src++) offers stable, but comparatively poor performance.<br />
<br />
NativeInterop's Buffer.Copy doesn't even come close to the fastest methods for small data sets, but offers comparable performance for larger sets. Something in its implementation is generating way more overhead than neccessary... That "something" turns out to be the allocation of GCHandles for fixing the managed arrays: F# 3.1 doesn't support a "fixed" statement as C# does. To check whether that truly could be the source of the overhead, I implemented a version of the cpblk method that uses GCHandle.Alloc instead of fixed and—lo and behold—this slowed down cpblk to approximately the same speed as Buffer.Copy (light-blue).<br />
<br />
Unsurprisingly, the multithreaded versions come with quite some overhead, so they can't compete for< 500 kB. Only for problem sizes that fit into Sandy Bridge's L3 cache we see some nice scaling. Problem sizes beyond the L3 cache size are again limited by the system's DRAM bandwidth and don't profit from multi-threading.<br />
<br />
The performance of all other (non-memcpy-like) methods is abysmal at best. We best just skip them...<br />
<br />
How does this picture change once we switch to an (early!) preview of the upcoming RyuJIT compiler? As it turns out, not by much; yet there is still happening something interesting:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKTNQ2kK9RvodH0u8BNf5lrg1cFci6XfaiiOn4HRlCecJwxKTRMvDtUzckRFqmOzOM1XMjy9E1iKWd0272MO-ckQodzvAQ7hSWp_GKevmaMfhxGlN60BVEAou0zaLuDVa6ikg2QLbbmrQ/s1600/ryujitctp4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKTNQ2kK9RvodH0u8BNf5lrg1cFci6XfaiiOn4HRlCecJwxKTRMvDtUzckRFqmOzOM1XMjy9E1iKWd0272MO-ckQodzvAQ7hSWp_GKevmaMfhxGlN60BVEAou0zaLuDVa6ikg2QLbbmrQ/s1600/ryujitctp4.png" height="502" width="640" /></a></div>
<br />
Suddenly, "*dst++ = *src++" has become one of the fastest implementations. What's going on here? Let's have a look at the generated assembly for *dst++ = *src++; first let's see what JIT64 produces:<br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;"><br />lea rdx,[r9+r8] <br /> lea rax,[r9+32h] <br /> mov rcx, r9 <br /> mov r9, rax <br /> mov rax, qword ptr [rcx] <br /> mov qword ptr [rsp+20h], rax <br /> mov rax, qword ptr [rcx+8] <br /> mov qword ptr [rsp+28h], rax <br /> mov rax, qword ptr [rcx+10h] <br /> mov qword ptr [rsp+30h], rax <br /> mov rax, qword ptr [rcx+18h] <br /> mov qword ptr [rsp+38h], rax <br /> mov rax, qword ptr [rcx+20h] <br /> mov qword ptr [rsp+40h], rax <br /> mov rax, qword ptr [rcx+28h] <br /> mov qword ptr [rsp+48h], rax <br /> mov ax, word ptr [rcx+30h] <br /> mov word ptr [rsp+50h], ax <br /> lea rcx, [rsp+20h] <br /> mov rax, qword ptr [rcx] <br /> mov qword ptr [rdx], rax <br /> mov rax, qword ptr [rcx+8] <br /> mov qword ptr [rdx+8], rax <br /> mov rax, qword ptr [rcx+10h] <br /> mov qword ptr [rdx+10h], rax <br /> mov rax, qword ptr [rcx+18h] <br /> mov qword ptr [rdx+18h], rax <br /> mov rax, qword ptr [rcx+20h] <br /> mov qword ptr [rdx+20h], rax <br /> mov rax, qword ptr [rcx+28h] <br /> mov qword ptr [rdx+28h], rax <br /> mov ax, word ptr [rcx+30h] <br /> mov word ptr [rdx+30h], ax <br /> inc r11d <br /> cmp r11d, r10d <br /> jl 00007FFE57865410</span><br />
<br />
Hm, so it first moves data from [rcx] (offset into the original byte[]) to some temporary at [rsp+20h] and then copies that to the destination [rdx] (offset into the result STLTriangle[]). That's obviously one step that's not neccessary and where RyuJIT can improve upon (there are 28 mov instructions, but only 14 qword movs + 2 word movs are required to move 50 bytes from memory to another memory location). So, what does RyuJIT really do?<br />
<br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">lea r9, [rcx+32h] </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">lea r10, [rax+32h] </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">movdqu xmm0, xmmword ptr [rax] </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">movdqu xmmword ptr [rcx], xmm0 </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">movdqu xmm0, xmmword ptr [rax+10h] </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">movdqu xmmword ptr [rcx+10h], xmm0 </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">movdqu xmm0, xmmword ptr [rax+20h] </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">movdqu xmmword ptr [rcx+20h], xmm0 </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">mov r11w, word ptr [rax+30h] </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">mov word ptr [rcx+30h], r11w </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">inc r8d </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">cmp edx, r8d </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">mov rax, r10 </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">mov rcx, r9 </span><br />
<span style="font-family: "Consolas","Monaco","Courier New",Courier,monospace;">jg 00007FFE57884CE8</span><br />
<br />
We see that RyuJIT is able to not only remove the unrequired intermediate load/store ops, but in addition it also issues (unaligned) SIMD mov instructions to copy 16 bytes at a time. This optimization also works for NativeInterop.Buffer.Copy<t>: Once I increased the blocksize to at least 16 bytes (or larger) performance became identical to that of *dst++=*src++ in the RyuJIT case (aside from the overhead for small problem sizes due to GCHandle allocations).</t><br />
<h3>
Conclusion</h3>
When your only concern is reading from/writing to a file on a comparatively slow disk, all of the above shouldn't bother you much. Most of the methods will be simply "fast enough". As soon as you start to move around data that's already in memory (e.g. cached) or maybe it resides on a very fast SSD, choosing the right method however becomes critical.<br />
<br />
While NativeInterop.Buffer.Copy isn't the fastest of the bunch (yet), it's performance is competitive for medium to large problem sizes and in any case orders of magnitudes faster then the usual BinaryReader-approach. If you want convenience, portability and genericity while maintaining reasonable performance, NativeInterop provides a good all-round solution, I believe. If you want raw speed, though, use memcpy (or whatever may be available on your platform).<br />
<br />
<table border="0" style="border: 1px solid black; padding: 5px;">
<tbody>
<tr>
<th>Method</th>
<th>Convenience</th>
<th>Portability</th>
<th>Safety</th>
<th>Efficiency</th>
<th>Genericity</th>
</tr>
<tr>
<td>BinaryReader.ReadXXX</td>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td>BitConverter.ToXXX</td>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">o</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td>Buffer.BlockCopy</td>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">o</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td>Marshal.PtrToStructure<t></t></td>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">o</td>
<td align="center">-</td>
<td align="center">+</td>
</tr>
<tr>
<td>*dst++ = *src++</td>
<td align="center">o</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">o</td>
<td align="center">o</td>
</tr>
<tr>
<td>Marshal.Copy</td>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">o</td>
</tr>
<tr>
<td>memcpy</td>
<td align="center">o</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">+</td>
</tr>
<tr>
<td>cpblk</td>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">+</td>
<td align="center">+</td>
</tr>
<tr>
<td>Buffer.Convert/Copy, Stream.ReadUnmanagedStructRange<t u=""></t></td>
<td align="center">+</td>
<td align="center">+</td>
<td align="center">-</td>
<td align="center">o/+</td>
<td align="center">+</td>
</tr>
</tbody></table>
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-2523394567180576305.post-16538648800048394462014-05-19T14:59:00.000+02:002015-06-11T21:56:54.863+02:00Binary STL I/O using NativeInterop.StreamThere are essentially three common ways for reading/writing structured binary data of some user defined type <span style="font-family: Consolas,monospace; font-size: 14;">T</span> (read: "files of records") from/to files in C# and .NET in general:<br />
<ol>
<li>Use <span style="font-family: Consolas,monospace; font-size: 14;">System.IO.BinaryReader/BinaryWriter</span>
and its' <span style="font-family: Consolas,monospace; font-size: 14;">ReadXXX/Write</span> methods to read/write individual fields of the data type T in question. </li>
<li>Use one of the <span style="font-family: Consolas,monospace; font-size: 14;">System.Runtime.InteropServices.Marshal.PtrToStructure/StructureToPtr</span> methods to covert between (pinned) byte arrays (which can be written to and read from <span style="font-family: Consolas,monospace; font-size: 14;">System.IO.Stream</span>) and <a href="http://msdn.microsoft.com/en-US/library/2zhzfk83%28v=vs.110%29.aspx" target="_blank">non-generic value types or "formatted class types"</a> via .NET's COM marshalling infrastructure.</li>
<li>Use a little bit of unsafe code and cast a <span style="font-family: Consolas,monospace; font-size: 14;">byte*</span> pointing at the first entry of a pinned byte buffer to a <span style="font-family: Consolas,monospace; font-size: 14;">T*</span> so structs of that type <span style="font-family: Consolas,monospace; font-size: 14;">T</span> can be written/read by simply dereferencing a pointer.</li>
</ol>
Generally, the whole task becomes a lot easier, when <span style="font-family: Consolas,monospace; font-size: 14;">T</span> is an <i>unmanaged</i> type (required for option #3 to work properly). Unmanaged types are either primitive types or value types (<span style="font-family: Consolas,monospace; font-size: 14;">struct</span> in C#) composed of only unmanaged types. See the recursion in the definition? What that essentially boils down to is that an unmanaged type may be composed of arbitrarily nested structs, as long as no reference type is involved at any level.<br />
<br />
Sadly, C# cannot constrain type parameters to unmanaged types and, as a consequence, does not support dereferencing generic pointer types. (In C# there is only the <span style="font-family: Consolas,monospace; font-size: 14;">struct</span> constraint, but that is not sufficient as a struct might contain fields of reference type.) What that means is that option #3 from above only works for concrete types.<br />
<br />
Because F# can constrain a type parameter to be an unmanaged type, I used F# to build the <a href="http://frankniemeyer.blogspot.com/2014/04/nativeinterop-is-live-on-nuget.html" target="_blank">NativeInterop library</a> that—at its core—exposes the possiblity to (unsafely!) dereference generic pointers in C#. This in turn enabled the implementation of some <i>generic</i> extension methods (NativeInterop ≥ v1.4.0) for <span style="font-family: Consolas,monospace; font-size: 14;">System.IO.Stream</span> that provide both easy* to use <i>and</i> efficient** methods for reading and writing structured binary files.<br />
<br />
<b>A word of warning:</b> The F# unmanaged constrain does not surface in the NativeInterop API if used from C# (and probably VB.NET). Thus the user of NativeInterop has to make sure, that his data types are truly unmanaged! <i>If that is not the case NativeInterop may produce arbitrary garbage!</i><br />
<h2>
Example: Writing and Reading Binary STL files</h2>
Let's look at a simple example for how to use the aformentioned library methods to handle a simple structured binary file format: STL files essentially contain a description of a triangle mesh (triangulated surface). The exact format of the binary version (there is also an ASCII variant) is described on <a href="http://en.wikipedia.org/wiki/STL_%28file_format%29#Binary_STL" target="_blank">Wikipedia</a>:<br />
<ul>
<li>An 80 byte ASCII header, which may contain some descriptive string</li>
<li>The number of triangles in the mesh as an unsigned 32 bit integer </li>
<li>For each triangle there is one record of the following format:</li>
<ul>
<li>A normal vector made up of three single precions floating point numbers</li>
<li>Three vertices, each made up of three single precions floating point numbers</li>
<li>A field called "Attribute byte count" (16 bit unsigned integer); this should usually be zero, but some software may interpret this as color information.</li>
</ul>
</ul>
<h3>
Modeling STL Records in C#</h3>
In this example, we will only explicitly model the triangle records. The header information is so simple that it's easier to just directly write out a 80 byte ASCII string.<br />
<br />
To represent the triangle information, we use the following user-defined unmanaged type(s):<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;">[<span style="color: #2b91af;">StructLayout</span>(<span style="color: #2b91af;">LayoutKind</span>.Sequential, Pack = 1)]
<span style="color: blue;">struct</span> <span style="color: #2b91af;">STLVector</span>
{
<span style="color: blue;">public</span> <span style="color: blue;">readonly</span> <span style="color: blue;">float</span> X;
<span style="color: blue;">public</span> <span style="color: blue;">readonly</span> <span style="color: blue;">float</span> Y;
<span style="color: blue;">public</span> <span style="color: blue;">readonly</span> <span style="color: blue;">float</span> Z;
<span style="color: blue;">public</span> STLVector(<span style="color: blue;">float</span> x, <span style="color: blue;">float</span> y, <span style="color: blue;">float</span> z) {
<span style="color: blue;">this</span>.X = x;
<span style="color: blue;">this</span>.Y = y;
<span style="color: blue;">this</span>.Z = z;
}
}
[<span style="color: #2b91af;">StructLayout</span>(<span style="color: #2b91af;">LayoutKind</span>.Sequential, Pack = 1)]
<span style="color: blue;">struct</span> <span style="color: #2b91af;">STLTriangle</span>
{
<span style="color: green;">// 4 * 3 * 4 byte + 2 byte = 50 byte</span>
<span style="color: blue;">public</span> <span style="color: blue;">readonly</span> <span style="color: #2b91af;">STLVector</span> Normal;
<span style="color: blue;">public</span> <span style="color: blue;">readonly</span> <span style="color: #2b91af;">STLVector</span> A;
<span style="color: blue;">public</span> <span style="color: blue;">readonly</span> <span style="color: #2b91af;">STLVector</span> B;
<span style="color: blue;">public</span> <span style="color: blue;">readonly</span> <span style="color: #2b91af;">STLVector</span> C;
<span style="color: blue;">public</span> <span style="color: blue;">readonly</span> <span style="color: blue;">ushort</span> AttributeByteCount;
<span style="color: blue;">public</span> STLTriangle(
<span style="color: #2b91af;">STLVector</span> normalVec,
<span style="color: #2b91af;">STLVector</span> vertex1,
<span style="color: #2b91af;">STLVector</span> vertex2,
<span style="color: #2b91af;">STLVector</span> vertex3,
<span style="color: blue;">ushort</span> attr = 0)
{
Normal = normalVec;
A = vertex1;
B = vertex2;
C = vertex3;
AttributeByteCount = attr;
}
}
</pre>
<br />
<h3>
Defining a Test Geometry</h3>
For testing purposes, we can now define a super-simple triangle mesh, a single tetrahedron:<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: green;">// tetrahedron, vertex order: right hand rule</span>
<span style="color: blue;">var</span> mesh = <span style="color: blue;">new</span> <span style="color: #2b91af;">STLTriangle</span>[] {
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLTriangle</span>(<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 1, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(1, 0, 0)),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLTriangle</span>(<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 1),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 1, 0)),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLTriangle</span>(<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 1),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(1, 0, 0)),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLTriangle</span>(<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 1, 0),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(0, 0, 1),
<span style="color: blue;">new</span> <span style="color: #2b91af;">STLVector</span>(1, 0, 0)),
};</pre>
<br />
We leave the normals at zero: Most software will derive the surface normals automatically correctly, if the order in which the vertices of each face are specified follow the right hand rule (i.e. vertices enumerated counter-clockwise when looking at the face from the outside).<br />
<h3>
Writing STL</h3>
Now it's really straightforward to generate a valid STL file from the available data:<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">using</span> (<span style="color: blue;">var</span> bw = <span style="color: blue;">new</span> <span style="color: #2b91af;">BinaryWriter</span>(<span style="color: #2b91af;">File</span>.OpenWrite(filename), <span style="color: #2b91af;">Encoding</span>.ASCII)) {
<span style="color: green;">// Encode the header string as ASCII and put it in a 80 bytes buffer</span>
<span style="color: blue;">var</span> headerString = <span style="color: #a31515;">"Tetrahedron"</span>;
<span style="color: blue;">var</span> header = <span style="color: blue;">new</span> <span style="color: blue;">byte</span>[80];
<span style="color: #2b91af;">Encoding</span>.ASCII.GetBytes(headerString, 0, headerString.Length, header, 0);
bw.Write(header);
<span style="color: green;">// write #triangles</span>
bw.Write(mesh.Length);
<span style="color: green;">// use extension method from NativeInterop.Stream to write out the mesh data</span>
bw.BaseStream.WriteUnmanagedStructRange(mesh);
}</pre>
<br />
And here we have it, our, ahem, "beautiful" tetrahedron rendered in <a href="http://meshlab.sourceforge.net/" target="_blank">MeshLab</a>:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxH_iRu9Ex60mW-9xsRe3M5HSiI33AZLVxoD4upPkbedLzDzxir2o7HsblYxYqW0aKt8ZRxP3QR5F4S_HBMuVSLuAX7Ou5mmiHkvEUi9HOMKQxaicQxspje0vMiKRpKWHb9xv1qs16A8I/s1600/tet.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxH_iRu9Ex60mW-9xsRe3M5HSiI33AZLVxoD4upPkbedLzDzxir2o7HsblYxYqW0aKt8ZRxP3QR5F4S_HBMuVSLuAX7Ou5mmiHkvEUi9HOMKQxaicQxspje0vMiKRpKWHb9xv1qs16A8I/s1600/tet.png" /></a></div>
<br />
Note how all the surface normals are sticking outward.<br />
<h3>
Reading STL</h3>
By using the <span style="background: white; color: black; font-family: Consolas; font-size: 14;">ReadStructRange<<span style="color: #2b91af;">T</span>></span>
extension method, reading binary STL data is just as simple:<br />
<br />
<pre style="background: white; color: black; font-family: Consolas; font-size: 13;"><span style="color: blue;">string</span> header;
<span style="color: #2b91af;">STLTriangle</span>[] mesh;
<span style="color: blue;">using</span> (<span style="color: blue;">var</span> br = <span style="color: blue;">new</span> <span style="color: #2b91af;">BinaryReader</span>(<span style="color: #2b91af;">File</span>.OpenRead(filename), <span style="color: #2b91af;">Encoding</span>.ASCII)) {
header = <span style="color: #2b91af;">Encoding</span>.ASCII.GetString(br.ReadBytes(80));
<span style="color: blue;">var</span> triCount = br.ReadUInt32();
mesh = br.BaseStream.ReadUnmanagedStructRange<<span style="color: #2b91af;">STLTriangle</span>>((<span style="color: blue;">int</span>)triCount);
} </pre>
<h2>
Conclusion</h2>
Reading and writing simple structured binary data is easy using NativeInterop.Stream. <a href="https://www.nuget.org/packages/NativeInterop/" target="_blank">Get it now from NuGet!</a> For reporting bugs or suggesting new features, please use the <a href="https://bitbucket.org/frank_niemeyer/nativeinterop/issues?status=new&status=open" target="_blank">BitBucket issue tracker</a>.<br />
<br />
<span style="font-size: small;">*Easy, because the user of the library doesn't have to fiddle with unsafe code.</span><br />
<span style="font-size: small;">**Efficient, because under the hood it boils down to a generic version of option #3 with zero marshalling overhead.</span>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-7868845404719953172014-04-11T14:42:00.002+02:002014-05-21T13:32:14.997+02:00NativeInterop is live on NuGetToday I released a first version of my <a href="https://www.nuget.org/packages/NativeInterop/" target="_blank">NativeInterop package on NuGet</a>. You can find a description of the purpose and usage of the package as well as the <a href="https://bitbucket.org/frank_niemeyer/nativeinterop" target="_blank">source code on BitBucket</a>.<br />
<br />
The motivation to build and release this package really evolved from two practical issues I encountered when building performance critical C# code:<br />
<br />
<ol>
<li>Creating a native, high-performance generic array data structure in C# seems to be impossible (see <a href="http://frankniemeyer.blogspot.com/2010/04/minimalistic-native-64-bit-array.html" target="_blank">A minimalistic native 64 bit array ...</a>).</li>
<li>Reading structured binary data from a byte[] requires some ugly hacks to get both decent performance and genericity (see e.g. <a href="http://www.codeproject.com/Articles/25896/Reading-Unmanaged-Data-Into-Structures" target="_blank">Reading Unmanaged Data Into Structures</a>).</li>
</ol>
The reason for both of these issues is the fact, that C# lacks a "unmanaged" type constraint and thus you cannot express something like<br />
<pre style="background: white; color: black; font-family: Consolas,monospace; font-size: 13;"><span style="color: blue;">static</span> <span style="color: blue;">unsafe</span> <span style="color: #2b91af;">T</span> Read<<span style="color: #2b91af;">T</span>>(<span style="color: #2b91af;">T</span>* p) {
<span style="color: blue;">return</span> *p;
}</pre>
But in F#, you can; you'd simply encode this as<br />
<pre style="background: white; color: black; font-family: Consolas,monospace; font-size: 13;"><span style="color: blue;">open</span> NativeInterop
<span style="color: blue;">let</span> pVal = NativePtr.read ptr</pre>
where ptr is of type nativeptr<'T> and 'T is constrained to unmanaged types.<br />
<br />
The performance offered by the NativeInterop package should be on par with non-generic unsafe C# code. The NativeInterop package also contains an implementation of <a href="http://frankniemeyer.blogspot.com/2010/04/minimalistic-native-64-bit-array.html" target="_blank">NativeArray64<t></t></a>, but this time without using inline IL. It turned out that in newer versions of the .NET framework, the AGUs are utilized correctly for the address (offset) computation (instead of emitting IMUL instructions): Calling NativePtr.set<'T>/get<'T>/set64<'T>/get64<'T> (or NativePtr.Get<t>/Set<t> or IntPtr.Get<t>/Set<t> or NativePtr<t>.get_Item/set_Item, respectively) should all result in the generation of a single mov instruction.</t></t></t></t></t>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-71346973369390780332014-04-07T18:16:00.002+02:002014-05-21T13:32:31.800+02:00A first look at RyuJIT CTP3 and SIMD (SSE2) support for .NETNot being able to exploit today's processors SIMD processing capabilities is a major culprit when implementing high-performance (e. g. numerical) applications in C# (or any other CLI language). While there is <a href="http://tirania.org/blog/archive/2008/Nov-03.html" target="_blank">Mono.Simd</a>, there is no solutions for applications running on top of Microsoft's own runtime (CLR), <a href="http://visualstudio.uservoice.com/forums/121579-visual-studio/suggestions/2212443-c-and-simd" target="_blank">despite popular demand</a> ... until now!<br />
<br />
With BUILD 2014, Microsoft <a href="http://blogs.msdn.com/b/clrcodegeneration/archive/2014/04/03/ryujit-ctp3-how-to-use-simd.aspx" target="_blank">released a new preview version</a> of the next generation JIT compiler "RyuJIT" that, combined with a <a href="http://www.nuget.org/packages/Microsoft.Bcl.Simd" target="_blank">special SIMD library</a> that can be installed via NuGet, supports SIMD intrinsics (only SSE2 for now, but AVX is in the works).<br />
<br />
Finally! I couldn't wait to try out the new bits; thus I modified the C# version of my existing XRaySimulator* to make use of SSE2 by implementing a simple packet ray tracing technique, i. e. instead of tracing individual rays, this version traces bundles of 2x2 (SSE2) or 4x2 (AVX) rays. Because the rays are largely "coherent" they typically hit the same objects (cache hit rate!).<br />
<h3>
The contenders</h3>
Currently there are a total of six different variants of the XRaySimulator:<br />
<ul>
<li> <a href="https://bitbucket.org/frank_niemeyer/xraysimulator-managed/src/ade0e3cd52adc9c6556ace135f49fcaa9faf94ca/?at=Scalar-NoAdjTrav" target="_blank">"C#"</a>: This is the baseline, scalar managed implementation.</li>
<li> <a href="https://bitbucket.org/frank_niemeyer/xraysimulator-managed/src/12996fe451e6695156de80ae1cb5d884c924b7a6/?at=default" target="_blank">"C# adj. trav."</a>: A further optimized version that exploits the fact that once a ray is inside a volume (finite element) mesh, it must hit a face of an adjacent element (hexahedron).</li>
<li> <a href="https://bitbucket.org/frank_niemeyer/xraysimulator-managed/src/e3685bc9b217cba689ffcce32bd0390db6dfb653/?at=SIMD-NoWrap" target="_blank">"C#/SSE2"</a>: Like "C#", but using 2x2 (X-)ray packets; doesn't use "adjacency traversal" due to the high branching factor</li>
<li> "C++": A C++11 reimplementation of "C#"; I tried to stay as close as possible to "C#" while still using at least half-way decent, idiomatic C++.</li>
<li> <a href="https://bitbucket.org/frank_niemeyer/xraysimulator/src/ee7ab422f8b703839c22bb31c824018324b8b1ff/?at=default" target="_blank">"C++ adj. trav."</a>: Corresponds to "C# adj. trav."</li>
<li> <a href="https://bitbucket.org/frank_niemeyer/xraysimulator/src/f65d5a14ef7ddc2e8d41714c95475db3d534c19b/?at=AVX" target="_blank">"C++/AVX"</a>: Vectorized version of "C++" using 4x2 ray bundles thanks to AVX</li>
</ul>
Note that most of these implementations are to be considered "quick-and-dirty, yet somehow working hacks..." If you don't mind the ugliness, though, you may follow the links to BitBucket and have a look at the code (Visual Studio 2013 projects; you also need the latest <a href="http://msdn.microsoft.com/en-us/vstudio/roslyn.aspx" target="_blank">Roslyn CTP</a> in order to compile the C#/SSE2 branch).<br />
<h3>
Performance analysis </h3>
So, who wins? The following figure shows the performance of the different versions in million rays per second (MRay/s) rendering an FE model consisting of 28672 hexahedral elements (344064 triangles) at a resolution of 6400 x 4800 pixels on an Intel Core i7-2600K (3.4 - 4.2 GHz) with 32 GB DDR4 RAM running under Windows 8.1 Pro:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimMPi_78E5sVGci21pf6gFRl7DTeWrOIa8xWSr9jMYC9G5DAn1HIkRrM4HXuVwS0Fx6ppjS9GDbGWRy3Yj6el83c6nCBNYixWsfiqQsVYa2RV76YYwiGkQUBrztc4T0bEQXJk89oNCzYI/s1600/xraybench.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimMPi_78E5sVGci21pf6gFRl7DTeWrOIa8xWSr9jMYC9G5DAn1HIkRrM4HXuVwS0Fx6ppjS9GDbGWRy3Yj6el83c6nCBNYixWsfiqQsVYa2RV76YYwiGkQUBrztc4T0bEQXJk89oNCzYI/s1600/xraybench.png" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
As expected "C++/AVX" blasts away the rest of the pack with a stunning 13 MRay/s. And while "C++/AVX" delivers a speed-up of 5.2 over "C++", "C#/SSE2" only improves by a factor of 2.5 compared to "C#" and displays only insignificant performance gains over the optimized scalar version "C# adj. trav."<br />
<br />
Now, given that SSE2 uses only 128-bit-wide vector lanes compared to AVX's generous 256 bit and the generally much more aggressive optimizer of the Visual C++ compiler, it's not exactly surprising to see an obvious performance difference between the "C++/AVX" and "C#/SSE2" case. Yet, I still would have expected the speed-up of "C#/SSE2" to reach a value a little closer to 4x instead of 2.5. What's going on there?<br />
<br />
According to Visual Studio's built-in profiler all of the implementations spend the majority of their time in the intersection routine of the AABB (axis-aligned bounding box) - which is a good thing, because this intersection test is very fast compared to a triangle intersection test. Thus the quality of the generated machine code for this method/function is critical for the overall performance of the renderer.<br />
<br />
The source code of the C#/SSE2 version looks like this:
<br />
<div class="gistLoad" data-id="10020539" id="gist-10020539">
Loading ....</div>
<br />
And here's the source for the C++/AVX version:
<br />
<div class="gistLoad" data-id="10020502" id="gist-10020502">
Loading ....</div>
<br />
(Note: The C++ code uses a hard-coded vector lane width of 8 floats.)<br />
<br />
Almost identical; yet, if you compare what both RyuJIT and Visual C++ make of these sources, you'll first notice that the machine code emitted by RyuJIT is much more convoluted and thus longer: <br />
<ul>
<li><a href="https://gist.github.com/anonymous/10020560" target="_blank">C#/SSE2 via RyuJIT CTP3</a></li>
<li><a href="https://gist.github.com/anonymous/10020578" target="_blank">C++/AVX via Visual C++ 2013</a></li>
</ul>
Most of that "additional stuff" that's going on in the C#/RyuJIT version seems to be related to null-pointer checks (both AABB and RayPack are classes and thus reference types). Still, I wonder if all those load/store operations are truly neccessary.<br />
<h3>
Preliminary conclusions</h3>
It seems like Microsoft has finally awakend and makes the long overdue investments in .NET performance. Thanks Google and Apple! Although RyuJIT will still require a lot of optimizations, in particular with respect to the generated SIMD code, Redmond's latest moves are promising. A next generation JIT, SIMD support, <a href="http://msdn.microsoft.com/en-US/vstudio/dotnetnative" target="_blank">AOT compilation using the Visual C++ optimizer backend</a>... What will come next? GPGPU support? Large arrays? A decent, modern, performant desktop UI framework? True first-class support for F#?<br />
<br />
The future is bright!<br />
<br />
<span style="color: #666666;"><span style="font-size: small;">*XRaySimulator is a visualization tool that renders X-ray-like images of finite element models. It uses a modified ray tracing algorithm to compute the energy absorption within each intersected element based on the element's material properties. A BVH (bounding volume hierarchy) is used to speed-up the intersection computation.</span></span><br />
<span style="color: #666666;"><span style="font-size: small;">Details (German): <a href="http://www.uni-ulm.de/fileadmin/website_uni_ulm/uzwr/projekte/p10-2.pdf">http://www.uni-ulm.de/fileadmin/website_uni_ulm/uzwr/projekte/p10-2.pdf</a></span></span><br />
<span style="font-size: x-small;"><span style="color: #666666;"><span style="font-size: small;"><br />**The C# versions of XRaySimulator on BitBucket currently don't support saving the rendered image to a file. In older versions, I used to use Tao.DevIL, but that only works on x86 and the preview releases of RyuJIT only emit x64 machine code. The C++ versions use a custom TGA output filter.</span></span></span>
<script src="https://raw.github.com/moski/gist-Blogger/master/public/gistLoader.js" type="text/javascript"></script>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-64674307207065376222010-04-25T01:08:00.001+02:002014-05-21T13:32:41.922+02:00A minimalistic native 64 bit array implementation for .NET (updated code)If you ever felt the need to process huge amounts of data via a algorithm implemented using .NET/the CLR, you’ve surely ran into the 2^31-items-limit of the CLR’s current array implementation that only supports Int32 array indices (this also affects other collections like List<T> as those use standard arrays for storage internally).<br />
You can try to circumvent this limitation by implementing your own array-like data-structure, either by emulating continuous storage via a collection of standard .NET-arrays (partition your data in chunks with 2^31 items each), or you can use native APIs and some evil pointer arithmetic to get maximum performance.<br />
A while ago I tried to implement the latter approach in C#, which isn’t a big deal, only a matter of some unsafe-blocks for pointer arithmetic and a call to Marshal.AllocHGlobal for allocating memory on the unmanged heap. However, when I tried to make that custom collection into a generic one, I ran into an unsolvable problem:<br />
<div class="csharpcode">
<pre class="alt"><span class="kwrd">public</span> <span class="kwrd">unsafe</span> T <span class="kwrd">this</span>[<span class="kwrd">long</span> index]</pre>
<pre>{</pre>
<pre class="alt"> <span class="kwrd">get</span></pre>
<pre> { </pre>
<pre class="alt"> <span class="kwrd">return</span> *((T*)pBase + index);</pre>
<pre> }</pre>
<pre class="alt"> <span class="kwrd">set</span></pre>
<pre> {</pre>
<pre class="alt"> *((T*)pBase + index) = <span class="kwrd">value</span>;</pre>
<pre> }</pre>
<pre class="alt">}</pre>
</div>
<style type="text/css">
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, "Courier New", courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }</style>
<br />
This code does not compile. The reason for that is, that there is no way to tell the C# compiler that T shall be constrained to unmanaged types.<br />
Interestingly, F# 2.0 <i>does</i> feature such a constraint! This is how a minimalistic F# implementation of such an native 64 bit array could look like:<br />
<div id="codeSnippetWrapper">
<div class="csharpcode">
<pre class="alt"><span class="kwrd">namespace</span> NativeTools</pre>
<pre> </pre>
<pre class="alt">#nowarn <span class="str">"9"</span></pre>
<pre>#nowarn <span class="str">"42"</span></pre>
<pre class="alt"> </pre>
<pre>open System</pre>
<pre class="alt">open System.Runtime</pre>
<pre>open Microsoft.FSharp.NativeInterop</pre>
<pre class="alt"> </pre>
<pre><span class="kwrd">module internal</span> PointerArithmetic =</pre>
<pre class="alt"> [<CompiledName(<span class="str">"AddIntPtrToIntPtr"</span>)>]</pre>
<pre> [<Unverifiable>]</pre>
<pre class="alt"> <span class="kwrd">let inline</span> addNativeInt (x: nativeptr<'T>) (n: nativeint) : nativeptr<'T> = </pre>
<pre> (NativePtr.toNativeInt x) + n * (# <span class="str">"sizeof !0"</span> type('T) : nativeint #) |> NativePtr.ofNativeInt</pre>
<pre class="alt"> </pre>
<pre> // <span class="rem">"reinterpret_cast<IntPtr>(x)"... EVIL!</span></pre>
<pre class="alt"> [<CompiledName(<span class="str">"Int64ToIntPtr"</span>)>]</pre>
<pre> [<Unverifiable>]</pre>
<pre class="alt"> <span class="kwrd">let inline</span> int64ToNativeint (x: int64) = (# <span class="str">""</span> x : nativeint #)</pre>
<pre> </pre>
<pre class="alt"> [<CompiledName(<span class="str">"AddInt64ToIntPtr"</span>)>]</pre>
<pre> [<Unverifiable>]</pre>
<pre class="alt"> <span class="kwrd">let inline</span> addInt64 (x: nativeptr<'a>) (o: int64) : nativeptr<'a> = addNativeInt x (int64ToNativeint o)</pre>
<pre> </pre>
<pre class="alt">[<Sealed>]</pre>
<pre><span class="kwrd">type</span> NativeArray64<'T <span class="kwrd">when</span> 'T: unmanaged>(length: int64) =</pre>
<pre class="alt"> <span class="kwrd">let</span> itemSize: int64 = (int64)(InteropServices.Marshal.SizeOf(<span class="kwrd">typeof</span><'T>))</pre>
<pre> <span class="kwrd">let mutable</span> isDisposed = <span class="kwrd">false</span></pre>
<pre class="alt"> <span class="kwrd">let</span> allocatedBytes = length * itemSize</pre>
<pre> <span class="kwrd">let</span> blob = InteropServices.Marshal.AllocHGlobal(nativeint allocatedBytes)</pre>
<pre class="alt"> <span class="kwrd">let</span> pBlobBase: nativeptr<'T> = NativePtr.ofNativeInt blob</pre>
<pre> <span class="kwrd">let</span> disposeLock = <span class="kwrd">new</span> Object()</pre>
<pre class="alt"> </pre>
<pre> <span class="kwrd">member</span> this.Length = length</pre>
<pre class="alt"> <span class="kwrd">member</span> this.BaseAddress = pBlobBase</pre>
<pre> <span class="kwrd">member</span> this.ItemSize = itemSize</pre>
<pre class="alt"> <span class="kwrd">member</span> this.IsDisposed = isDisposed</pre>
<pre> <span class="kwrd">member</span> this.AllocatedBytes = allocatedBytes</pre>
<pre class="alt"> </pre>
<pre> <span class="kwrd">member</span> <span class="kwrd">private</span> this.Free () =</pre>
<pre class="alt"> lock disposeLock (<span class="kwrd">fun</span> () -></pre>
<pre> <span class="kwrd">if</span> isDisposed</pre>
<pre class="alt"> <span class="kwrd">then</span> ()</pre>
<pre> <span class="kwrd">else</span> InteropServices.Marshal.FreeHGlobal blob</pre>
<pre class="alt"> isDisposed <- <span class="kwrd">true</span></pre>
<pre> )</pre>
<pre class="alt"> </pre>
<pre> <span class="kwrd">member</span> this.Item</pre>
<pre class="alt"> <span class="kwrd">with</span> <span class="kwrd">get</span> (idx: int64) =</pre>
<pre> NativePtr.read (PointerArithmetic.addInt64 pBlobBase idx) </pre>
<pre class="alt"> <span class="kwrd">and</span> <span class="kwrd">set</span> (idx: int64) (value: <span class="rem">'T) =</span></pre>
<pre> NativePtr.write (PointerArithmetic.addInt64 pBlobBase idx) value</pre>
<pre class="alt"> </pre>
<pre> <span class="kwrd">member</span> <span class="kwrd">private</span> this.Items = seq {</pre>
<pre class="alt"> <span class="kwrd">for</span> i <span class="kwrd">in</span> 0L .. length - 1L <span class="kwrd">do</span></pre>
<pre> <span class="kwrd">yield</span> this.[i]</pre>
<pre class="alt"> }</pre>
<pre> </pre>
<pre class="alt"> override this.Finalize () = this.Free()</pre>
<pre> </pre>
<pre class="alt"> <span class="kwrd">interface</span> IDisposable <span class="kwrd">with</span></pre>
<pre> member this.Dispose () =</pre>
<pre class="alt"> GC.SuppressFinalize this</pre>
<pre> this.Free()</pre>
<pre class="alt"> </pre>
<pre> <span class="kwrd">interface</span> Collections.Generic.IEnumerable<'T> with</pre>
<pre class="alt"> <span class="kwrd">member</span> this.GetEnumerator () : Collections.Generic.IEnumerator<'T> =</pre>
<pre> this.Items.GetEnumerator()</pre>
<pre class="alt"> <span class="kwrd">member</span> this.GetEnumerator () : Collections.IEnumerator =</pre>
<pre> this.Items.GetEnumerator() :> Collections.IEnumerator</pre>
</div>
<style type="text/css">
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, "Courier New", courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }</style></div>
<style type="text/css">
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, "Courier New", courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }</style>
<br />
<b>UPDATE 2010-04-25:</b> Removed a few bugs.<br />
<br />
You can use this data structure in your C# code like a normal array:<br />
<div id="codeSnippetWrapper">
<div id="codeSnippet" style="background-color: #f4f4f4; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;">
<pre style="background-color: white; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;">var length = 8L * 1024L * 1024L * 1024L;</pre>
<!--CRLF-->
<br />
<pre style="background-color: #f4f4f4; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;"><span style="color: green;">// allocate a byte-array of 8 GiB</span></pre>
<!--CRLF-->
<br />
<pre style="background-color: white; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;"><span style="color: blue;">using</span>(arr = <span style="color: blue;">new</span> NativeTools.NativeArray64<<span style="color: blue;">byte</span>>(length))</pre>
<!--CRLF-->
<br />
<pre style="background-color: #f4f4f4; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;">{</pre>
<!--CRLF-->
<br />
<pre style="background-color: white; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;"> arr[0] = 123;</pre>
<!--CRLF-->
<br />
<pre style="background-color: #f4f4f4; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;"> arr[length-1] = 222;</pre>
<!--CRLF-->
<br />
<pre style="background-color: white; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;"> Console.WriteLine(<span style="color: #006080;">"Allocated "</span> + arr.AllocatedBytes);</pre>
<!--CRLF-->
<br />
<pre style="background-color: #f4f4f4; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;">}</pre>
<!--CRLF-->
<br />
<pre style="background-color: white; border-bottom-style: none; border-left-style: none; border-right-style: none; border-top-style: none; color: black; direction: ltr; font-family: consolas, 'Courier New', courier, monospace; font-size: 10pt; line-height: 12pt; margin: 0em; overflow: visible; padding-bottom: 0px; padding-left: 0px; padding-right: 0px; padding-top: 0px; text-align: left; width: 100%;"><span style="color: green;">// auto-disposed ...</span></pre>
<!--CRLF--></div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-50703861941803246382009-04-11T19:10:00.013+02:002009-04-12T02:24:51.439+02:00Generic .NET Math — First BenchmarksAs promised, I did some first benchmarking of <a href="http://frankniemeyer.blogspot.com/2009/04/generic-arithmetic-for-net.html">different methods to implement generic arithmetic for .NET.</a>
I implemented simple vector classes using the Operator<T> class found in the <a href="http://www.yoda.arachsys.com/csharp/miscutil/">MiscUtil</a> library, another generic vector class using the approach described in my last posting (using INumeric/Calculators) and compared that to a completely non-generic vector implementation.
The benchmarking method solved the following recursive problem for i = 100,000,000 on a Core 2 Quad Q9550:<blockquote>
sum_i := sum_(i-1) + <u_i, v> + <u_i, t>
<br/>
u_i := u_(i-1) * sum_i
</blockquote>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQx_8H1whiIQOtUO54I3fgzhyphenhyphenu7I76lOnKl1aaCFg5j5Z07IKK9CE9LplDVKzk0ks3GmRX_AhfzHJwuzRhjmRSr23oyrNJlVnSlVak051iUfSw_BwJ-9PAb7LOQBBiX8EO7OL2S32uLpY/s1600-h/perf.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 369px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQx_8H1whiIQOtUO54I3fgzhyphenhyphenu7I76lOnKl1aaCFg5j5Z07IKK9CE9LplDVKzk0ks3GmRX_AhfzHJwuzRhjmRSr23oyrNJlVnSlVak051iUfSw_BwJ-9PAb7LOQBBiX8EO7OL2S32uLpY/s400/perf.png" alt="" id="BLOGGER_PHOTO_ID_5323482703021645266" border="0" /></a>
As the figure above shows, the non-generic version is far superior to the other implementations, especially in Long Mode (x64). The x86 version of the interface constraint approach (INumeric/Calculator) is only a tiny bit faster than MiscUtil’s Operator<T> class using runtime code generation. For x64 however, it’s about twice as fast.
<br/><br/>
<span style="font-weight: bold;">References:</span>
<ol>
<li><a href="http://www.codeproject.com/KB/cs/genericnumerics.aspx">Rüdiger Klaehn: Using generics for calculations</a></li>
<li><a href="http://www.codeproject.com/KB/cs/genericoperators.aspx">Keith Farmer: Operator Overloading with Generics</a></li>
<li><a href="http://www.dogspots.com/DSBE/post/2009/02/17/GenericArithmetic.aspx">Bill Fugina: Arithmetic in Generic Classes</a></li>
<li><a href="http://rogeralsing.com/2008/02/27/linq-expressions-calculating-with-generics/">Roger Alsing: Linq Expressions - Calculating with generics</a></li>
</ol>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-2523394567180576305.post-28599263165867798632009-04-10T23:43:00.059+02:002009-04-11T14:42:43.982+02:00Generic Arithmetic for .NETTo me, one of the major design flaws and sources of frustration in .NET’s generics system is the absence of suppport for some sort of common arithmetic interface for numeric types (float, double, int, etc.).
<br /><br />
But why should anyone care about such things? Well, the problem becomes obvious
as soon as you try to implement—say—a generic vector class:
<br /><br />
<span style="color: rgb(102, 102, 102);font-size:85%;" >(Note: I'll only show the relavant parts of the code here to avoid clutter and confusion)</span>
<!-- code formatted by http://manoli.net/csharpformat/ -->
<br /><br />
<div class="csharpcode">
<pre class="alt"><span class="lnum"> 1: </span><span class="rem">// a vector in R^n</span></pre><pre><span class="lnum"> 2: </span><span class="kwrd">class</span> Vector<T>
</pre><pre class="alt"><span class="lnum"> 3: </span>{
</pre><pre><span class="lnum"> 4: </span> T[] components;
</pre><pre class="alt"><span class="lnum"> 5: </span>
</pre><pre><span class="lnum"> 6: </span> <span class="rem">// ctors, properties etc. ...</span>
</pre><pre class="alt"><span class="lnum"> 7: </span>
</pre><pre><span class="lnum"> 8: </span> <span class="rem">// operator defintion, here: vector addition</span>
</pre><pre class="alt"><span class="lnum"> 9: </span> <span class="kwrd">public</span> staticVector<T> <span class="kwrd">operator</span> +(Vector<T> lhs,
</pre><pre><span class="lnum"> 10: </span> Vector<T> rhs)
</pre><pre class="alt"><span class="lnum"> 11: </span> {
</pre><pre><span class="lnum"> 12: </span> Vector<T> res = <span class="kwrd">new</span> Vector<T>(lhs.Length);
</pre><pre class="alt"><span class="lnum"> 13: </span>
</pre><pre><span class="lnum"> 14: </span> <span class="kwrd">for</span> (<span class="kwrd">int</span> i = 0; i < res.Length; ++i)
</pre><pre class="alt"><span class="lnum"> 15: </span> {
</pre><pre><span class="lnum"> 16: </span> <span class="rem">// compiler error!</span>
</pre><pre class="alt"><span class="lnum"> 17: </span> res[i] = lhs[i] + rhs[i];
</pre><pre><span class="lnum"> 18: </span> }
</pre><pre class="alt"><span class="lnum"> 19: </span>
</pre><pre><span class="lnum"> 20: </span> <span class="kwrd">return</span> res;
</pre><pre class="alt"><span class="lnum"> 21: </span> }
</pre><pre><span class="lnum"> 22: </span>}</pre>
</div>
<br />
The reason for the compiler error is, that there’s simply not enough information available to the compiler to check, whether T supports op_Addition. Now, if the CLI designers had included some interface definition like
<br /><br />
<!-- code formatted by http://manoli.net/csharpformat/ -->
<div class="csharpcode">
<pre class="alt"><span class="lnum"> 1: </span><span class="kwrd">interface</span> INumeric<T></pre>
<pre><span class="lnum"> 2: </span>{</pre>
<pre class="alt"><span class="lnum"> 3: </span> T Add(T rhs);</pre>
<pre><span class="lnum"> 4: </span> T Sub(T rhs);</pre>
<pre class="alt"><span class="lnum"> 5: </span> T Mul(T rhs);</pre>
<pre><span class="lnum"> 6: </span> T Div(T rhs);</pre>
<pre class="alt"><span class="lnum"> 7: </span> <span class="rem">// etc.</span></pre></div>
<br />
... (or similar) and if they had implemented this interface for every numeric primitive type, we could simply use an interface constraint to solve our problem:
<br /><br />
<!-- code formatted by http://manoli.net/csharpformat/ -->
<div class="csharpcode">
<pre class="alt"><span class="lnum"> 1: </span><span class="rem">// a vector in R^n</span></pre>
<pre><span class="lnum"> 2: </span><span class="kwrd">class</span> Vector<T> <span class="kwrd">where</span> T: INumeric<T></pre>
<pre class="alt"><span class="lnum"> 3: </span>{</pre>
<pre><span class="lnum"> 4: </span> T[] components;</pre>
<pre class="alt"><span class="lnum"> 5: </span> </pre>
<pre><span class="lnum"> 6: </span> <span class="rem">// ctors, properties ...</span></pre>
<pre class="alt"><span class="lnum"> 7: </span> </pre>
<pre><span class="lnum"> 8: </span> <span class="kwrd">public</span> <span class="kwrd">static</span> Vector<T> <span class="kwrd">operator</span> +(Vector<T> lhs, Vector<T> rhs)</pre>
<pre class="alt"><span class="lnum"> 9: </span> {</pre>
<pre><span class="lnum"> 10: </span> Vector<T> res = <span class="kwrd">new</span> Vector<T>(lhs.Length);</pre>
<pre class="alt"><span class="lnum"> 11: </span> </pre>
<pre><span class="lnum"> 12: </span> <span class="kwrd">for</span> (<span class="kwrd">int</span> i = 0; i < res.Length; ++i)</pre>
<pre class="alt"><span class="lnum"> 13: </span> {</pre>
<pre><span class="lnum"> 14: </span> res[i] = lhs[i].Add(rhs[i]); <span class="rem">// OK, T implements INumeric<T></span></pre>
<pre class="alt"><span class="lnum"> 15: </span> }</pre>
<pre><span class="lnum"> 16: </span> </pre>
<pre class="alt"><span class="lnum"> 17: </span> <span class="kwrd">return</span> res;</pre>
<pre><span class="lnum"> 18: </span> }</pre>
<pre class="alt"><span class="lnum"> 19: </span>}</pre>
</div>
<br />
(Note: Operators are always statically defined in .NET, hence we can only use normal prefix method calls as static methods can never be virtual, thus neither abstract nor part of an interface defintion.)
<br /><br />
Unfortunately, such an interface does not exist. Several people already tried to circumvent this limitation. <a href="http://www.codeproject.com/KB/cs/genericnumerics.aspx">One</a> of the <a href="http://weblog.ikvm.net/permalink.aspx?guid=94EED5A8-6C9C-4D1D-A001-321344A2D2F6">first</a> approaches uses special numerical interfaces that implement arithmetic operations for concrete types. This technique works reasonably well, besides having the disadvantage of forcing the user to explicitly provide an implementation of such an interfaces (or at least a type name of a type implementing the according interface).
<br /><br />
More recently, a <a href="http://rogeralsing.com/2008/02/27/linq-expressions-calculating-with-generics/">new solution</a> came up, which uses runtime code generation by building and compiling expression trees on the fly (or using LCG for older .NET version). This indeed is a very elegant solution to the problem. Using an implementation of this concept like the one found in the <a href="http://www.yoda.arachsys.com/csharp/miscutil/">MiscUtil</a> library you can simply write something like
<br /><br />
<!-- code formatted by http://manoli.net/csharpformat/ -->
<div class="csharpcode">
<pre class="alt"><span class="lnum"> 1: </span>T sum = Operator<T>.Add(lhs, rhs);</pre>
</div>
<br />
... inside your generic math algorithms.
<br /><br />
The appropriate "Add" method for type T is generated on-the-fly upon the first call and cached for later reuse. The drawback of this method is the need for delegate invocations, which tend to be a lot more expensive than interface calls (at least my first benchmarks indicate that, especially on x64; curiously, the effect is much less severe on x86).
<br /><br />
Interestingly, the latest <a href="http://research.microsoft.com/en-us/um/cambridge/projects/fsharp/">F# CTP</a> also provides generic vector and matrix classes (namespace Microsoft.FSharp.Math in assembly FSharp.PowerPack.dll). You can instantiate e.g. matrices like that:
<br /><br />
<!-- code formatted by http://manoli.net/csharpformat/ -->
<div class="csharpcode">
<pre class="alt"><span class="lnum"> 1: </span>Vector<<span class="kwrd">float</span>> v = VectorModule.Generic.create<<span class="kwrd">float</span>>(3, 3, 0f);</pre>
</div>
<br />
This returns a 3×3 matrix of floats initialized to 0.0f.
So how did they do that? As far as I understand, their approach is very similar to the first method I described (using interfaces to define "calculator" types). To avoid having the user explicitly specify the operators’ implementation they use some kind of “type registry”, mapping each known numeric type to an appropriate implementation of an interface that defines basic arithmetic operations.
<br /><br />
I’ve tried to do a quick n’ dirty implementation of this idea in C# which resulted in something like this (I haven’t yet checked if this actually runs...):
<br /><br />
<!-- code formatted by http://manoli.net/csharpformat/ -->
<div class="csharpcode">
<pre class="alt"><span class="lnum"> 1: </span><span class="kwrd">namespace</span> GenericMath</pre>
<pre><span class="lnum"> 2: </span>{</pre>
<pre class="alt"><span class="lnum"> 3: </span> <span class="kwrd">public</span> <span class="kwrd">interface</span> INumeric<T></pre>
<pre><span class="lnum"> 4: </span> {</pre>
<pre class="alt"><span class="lnum"> 5: </span> T Add(T lhs, T rhs);</pre>
<pre><span class="lnum"> 6: </span> T Sub(T lhs, T rhs);</pre>
<pre class="alt"><span class="lnum"> 7: </span> T Mul(T lhs, T rhs);</pre>
<pre><span class="lnum"> 8: </span> T Div(T lhs, T rhs);</pre>
<pre class="alt"><span class="lnum"> 9: </span> T Abs(T x);</pre>
<pre><span class="lnum"> 10: </span> T Zero { get; }</pre>
<pre class="alt"><span class="lnum"> 11: </span> T One { get; }</pre>
<pre><span class="lnum"> 12: </span> }</pre>
<pre class="alt"><span class="lnum"> 13: </span> </pre>
<pre><span class="lnum"> 14: </span> <span class="kwrd">class</span> FloatNumerics: INumeric<<span class="kwrd">float</span>></pre>
<pre class="alt"><span class="lnum"> 15: </span> {</pre>
<pre><span class="lnum"> 16: </span> <span class="kwrd">public</span> <span class="kwrd">float</span> Add(<span class="kwrd">float</span> lhs, <span class="kwrd">float</span> rhs)</pre>
<pre class="alt"><span class="lnum"> 17: </span> {</pre>
<pre><span class="lnum"> 18: </span> <span class="kwrd">return</span> lhs + rhs;</pre>
<pre class="alt"><span class="lnum"> 19: </span> }</pre>
<pre><span class="lnum"> 20: </span> </pre>
<pre class="alt"><span class="lnum"> 21: </span> <span class="kwrd">public</span> <span class="kwrd">float</span> Sub(<span class="kwrd">float</span> lhs, <span class="kwrd">float</span> rhs)</pre>
<pre><span class="lnum"> 22: </span> {</pre>
<pre class="alt"><span class="lnum"> 23: </span> <span class="kwrd">return</span> lhs - rhs;</pre>
<pre><span class="lnum"> 24: </span> }</pre>
<pre class="alt"><span class="lnum"> 25: </span> </pre>
<pre><span class="lnum"> 26: </span> <span class="kwrd">public</span> <span class="kwrd">float</span> Mul(<span class="kwrd">float</span> lhs, <span class="kwrd">float</span> rhs)</pre>
<pre class="alt"><span class="lnum"> 27: </span> {</pre>
<pre><span class="lnum"> 28: </span> <span class="kwrd">return</span> lhs * rhs;</pre>
<pre class="alt"><span class="lnum"> 29: </span> }</pre>
<pre><span class="lnum"> 30: </span> </pre>
<pre class="alt"><span class="lnum"> 31: </span> <span class="kwrd">public</span> <span class="kwrd">float</span> Div(<span class="kwrd">float</span> lhs, <span class="kwrd">float</span> rhs)</pre>
<pre><span class="lnum"> 32: </span> {</pre>
<pre class="alt"><span class="lnum"> 33: </span> <span class="kwrd">return</span> lhs / rhs;</pre>
<pre><span class="lnum"> 34: </span> }</pre>
<pre class="alt"><span class="lnum"> 35: </span> </pre>
<pre><span class="lnum"> 36: </span> <span class="kwrd">public</span> <span class="kwrd">float</span> Abs(<span class="kwrd">float</span> x)</pre>
<pre class="alt"><span class="lnum"> 37: </span> {</pre>
<pre><span class="lnum"> 38: </span> <span class="kwrd">return</span> Math.Abs(x);</pre>
<pre class="alt"><span class="lnum"> 39: </span> }</pre>
<pre><span class="lnum"> 40: </span> </pre>
<pre class="alt"><span class="lnum"> 41: </span> <span class="kwrd">public</span> <span class="kwrd">float</span> Zero</pre>
<pre><span class="lnum"> 42: </span> {</pre>
<pre class="alt"><span class="lnum"> 43: </span> get { <span class="kwrd">return</span> 0f; }</pre>
<pre><span class="lnum"> 44: </span> }</pre>
<pre class="alt"><span class="lnum"> 45: </span> </pre>
<pre><span class="lnum"> 46: </span> <span class="kwrd">public</span> <span class="kwrd">float</span> One</pre>
<pre class="alt"><span class="lnum"> 47: </span> {</pre>
<pre><span class="lnum"> 48: </span> get { <span class="kwrd">return</span> 1f; }</pre>
<pre class="alt"><span class="lnum"> 49: </span> }</pre>
<pre><span class="lnum"> 50: </span> }</pre>
<pre class="alt"><span class="lnum"> 51: </span> </pre>
<pre><span class="lnum"> 52: </span> <span class="rem">// implementations for other types ...</span></pre>
<pre class="alt"><span class="lnum"> 53: </span> </pre>
<pre><span class="lnum"> 54: </span> <span class="kwrd">public</span> <span class="kwrd">static</span> <span class="kwrd">class</span> ArithmeticAssociations</pre>
<pre class="alt"><span class="lnum"> 55: </span> {</pre>
<pre><span class="lnum"> 56: </span> <span class="kwrd">static</span> <span class="kwrd">readonly</span> IDictionary<Type, <span class="kwrd">object</span>> operationsDict = <span class="kwrd">new</span> Dictionary<Type, <span class="kwrd">object</span>></pre>
<pre class="alt"><span class="lnum"> 57: </span> {</pre>
<pre><span class="lnum"> 58: </span> { <span class="kwrd">typeof</span>(<span class="kwrd">float</span>), <span class="kwrd">new</span> FloatNumerics() },</pre>
<pre class="alt"><span class="lnum"> 59: </span> { <span class="kwrd">typeof</span>(<span class="kwrd">double</span>), <span class="kwrd">new</span> DoubleNumerics() },</pre>
<pre><span class="lnum"> 60: </span> { <span class="kwrd">typeof</span>(<span class="kwrd">int</span>), <span class="kwrd">new</span> Int32Numerics() }</pre>
<pre class="alt"><span class="lnum"> 61: </span> };</pre>
<pre><span class="lnum"> 62: </span> </pre>
<pre class="alt"><span class="lnum"> 63: </span> <span class="kwrd">static</span> <span class="kwrd">string</span> numericIfaceName = <span class="kwrd">typeof</span>(INumeric<>).FullName;</pre>
<pre><span class="lnum"> 64: </span> <span class="kwrd">static</span> <span class="kwrd">string</span> numericIfaceNameShort = <span class="kwrd">typeof</span>(INumeric<>).Name;</pre>
<pre class="alt"><span class="lnum"> 65: </span> </pre>
<pre><span class="lnum"> 66: </span> <span class="kwrd">public</span> <span class="kwrd">static</span> <span class="kwrd">bool</span> UnregisterType<T>()</pre>
<pre class="alt"><span class="lnum"> 67: </span> {</pre>
<pre><span class="lnum"> 68: </span> <span class="kwrd">return</span> operationsDict.Remove(<span class="kwrd">typeof</span>(T));</pre>
<pre class="alt"><span class="lnum"> 69: </span> }</pre>
<pre><span class="lnum"> 70: </span> </pre>
<pre class="alt"><span class="lnum"> 71: </span> <span class="kwrd">public</span> <span class="kwrd">static</span> <span class="kwrd">void</span> RegisterType<T>(INumeric<T> arithmetics)</pre>
<pre><span class="lnum"> 72: </span> {</pre>
<pre class="alt"><span class="lnum"> 73: </span> operationsDict[<span class="kwrd">typeof</span>(T)] = arithmetics;</pre>
<pre><span class="lnum"> 74: </span> }</pre>
<pre class="alt"><span class="lnum"> 75: </span> </pre>
<pre><span class="lnum"> 76: </span> <span class="kwrd">public</span> <span class="kwrd">static</span> INumeric<T> TryGetNumericOps<T>()</pre>
<pre class="alt"><span class="lnum"> 77: </span> {</pre>
<pre><span class="lnum"> 78: </span> Type t = <span class="kwrd">typeof</span>(T);</pre>
<pre class="alt"><span class="lnum"> 79: </span> </pre>
<pre><span class="lnum"> 80: </span> <span class="kwrd">if</span> (!operationsDict.ContainsKey(t))</pre>
<pre class="alt"><span class="lnum"> 81: </span> {</pre>
<pre><span class="lnum"> 82: </span> <span class="kwrd">throw</span> <span class="kwrd">new</span> ArgumentException(<span class="str">"No implementation of "</span> </pre>
<pre class="alt"><span class="lnum"> 83: </span> + numericIfaceNameShort + <span class="str">"<"</span> + t.Name + <span class="str">"> registered."</span>);</pre>
<pre><span class="lnum"> 84: </span> }</pre>
<pre class="alt"><span class="lnum"> 85: </span> </pre>
<pre><span class="lnum"> 86: </span> Type opsType = operationsDict[t].GetType();</pre>
<pre class="alt"><span class="lnum"> 87: </span> </pre>
<pre><span class="lnum"> 88: </span> <span class="kwrd">if</span> (opsType.GetInterface(numericIfaceName) == <span class="kwrd">null</span>)</pre>
<pre class="alt"><span class="lnum"> 89: </span> {</pre>
<pre><span class="lnum"> 90: </span> <span class="kwrd">throw</span> <span class="kwrd">new</span> ArgumentException(<span class="str">"Arithmetic object associated with "</span> </pre>
<pre class="alt"><span class="lnum"> 91: </span> + t.Name + <span class="str">" does not implement "</span> + numericIfaceNameShort);</pre>
<pre><span class="lnum"> 92: </span> }</pre>
<pre class="alt"><span class="lnum"> 93: </span> </pre>
<pre><span class="lnum"> 94: </span> <span class="kwrd">return</span> (INumeric<T>)operationsDict[t];</pre>
<pre class="alt"><span class="lnum"> 95: </span> }</pre>
<pre><span class="lnum"> 96: </span> }</pre>
<pre class="alt"><span class="lnum"> 97: </span> </pre>
<pre><span class="lnum"> 98: </span> <span class="rem">// a sample class that uses INumeric + ArithmeticAssociations</span></pre>
<pre class="alt"><span class="lnum"> 99: </span> <span class="kwrd">public</span> <span class="kwrd">sealed</span> <span class="kwrd">class</span> Vector<T></pre>
<pre><span class="lnum"> 100: </span> {</pre>
<pre class="alt"><span class="lnum"> 101: </span> <span class="kwrd">static</span> INumeric<T> ops = ArithmeticAssociations.TryGetNumericOps<T>();</pre>
<pre><span class="lnum"> 102: </span> </pre>
<pre class="alt"><span class="lnum"> 103: </span> T[] components;</pre>
<pre><span class="lnum"> 104: </span> </pre>
<pre class="alt"><span class="lnum"> 105: </span> <span class="kwrd">public</span> T <span class="kwrd">this</span>[<span class="kwrd">int</span> i]</pre>
<pre><span class="lnum"> 106: </span> {</pre>
<pre class="alt"><span class="lnum"> 107: </span> get { <span class="kwrd">return</span> components[i]; }</pre>
<pre><span class="lnum"> 108: </span> set { components[i] = <span class="kwrd">value</span>; }</pre>
<pre class="alt"><span class="lnum"> 109: </span> }</pre>
<pre><span class="lnum"> 110: </span> </pre>
<pre class="alt"><span class="lnum"> 111: </span> <span class="kwrd">public</span> <span class="kwrd">int</span> Length { get { <span class="kwrd">return</span> components.Length; } }</pre>
<pre><span class="lnum"> 112: </span> </pre>
<pre class="alt"><span class="lnum"> 113: </span> <span class="kwrd">public</span> Vector(<span class="kwrd">int</span> size)</pre>
<pre><span class="lnum"> 114: </span> {</pre>
<pre class="alt"><span class="lnum"> 115: </span> components = <span class="kwrd">new</span> T[size];</pre>
<pre><span class="lnum"> 116: </span> }</pre>
<pre class="alt"><span class="lnum"> 117: </span> </pre>
<pre><span class="lnum"> 118: </span> <span class="kwrd">public</span> <span class="kwrd">static</span> T <span class="kwrd">operator</span> *(Vector<T> lhs, Vector<T> rhs)</pre>
<pre class="alt"><span class="lnum"> 119: </span> {</pre>
<pre><span class="lnum"> 120: </span> T res = ops.Zero;</pre>
<pre class="alt"><span class="lnum"> 121: </span> </pre>
<pre><span class="lnum"> 122: </span> <span class="kwrd">for</span> (<span class="kwrd">int</span> i = 0; i < lhs.Length; ++i)</pre>
<pre class="alt"><span class="lnum"> 123: </span> {</pre>
<pre><span class="lnum"> 124: </span> res = ops.Add(res, ops.Mul(lhs[i], rhs[i]));</pre>
<pre class="alt"><span class="lnum"> 125: </span> }</pre>
<pre><span class="lnum"> 126: </span> </pre>
<pre class="alt"><span class="lnum"> 127: </span> <span class="kwrd">return</span> res;</pre>
<pre><span class="lnum"> 128: </span> }</pre>
<pre class="alt"><span class="lnum"> 129: </span> </pre>
<pre><span class="lnum"> 130: </span> <span class="kwrd">public</span> <span class="kwrd">static</span> Vector<T> <span class="kwrd">operator</span> *(T lambda, Vector<T> vec)</pre>
<pre class="alt"><span class="lnum"> 131: </span> {</pre>
<pre><span class="lnum"> 132: </span> Vector<T> result = <span class="kwrd">new</span> Vector<T>(vec.Length);</pre>
<pre class="alt"><span class="lnum"> 133: </span> </pre>
<pre><span class="lnum"> 134: </span> <span class="kwrd">for</span> (<span class="kwrd">int</span> i = 0; i < result.Length; ++i)</pre>
<pre class="alt"><span class="lnum"> 135: </span> {</pre>
<pre><span class="lnum"> 136: </span> result[i] = ops.Mul(vec[i], lambda);</pre>
<pre class="alt"><span class="lnum"> 137: </span> }</pre>
<pre><span class="lnum"> 138: </span> </pre>
<pre class="alt"><span class="lnum"> 139: </span> <span class="kwrd">return</span> result;</pre>
<pre><span class="lnum"> 140: </span> }</pre>
<pre class="alt"><span class="lnum"> 141: </span> </pre>
<pre><span class="lnum"> 142: </span> <span class="rem">// other properties, methods and operators ...</span></pre>
<pre class="alt"><span class="lnum"> 143: </span> }</pre>
<pre><span class="lnum"> 144: </span> </pre>
<pre class="alt"><span class="lnum"> 145: </span> <span class="rem">// ...</span></pre>
<pre><span class="lnum"> 146: </span>}</pre>
</div>
<br />
Objects of the type Vector<t> can now be created by this simple line:
<br /><br />
<!-- code formatted by http://manoli.net/csharpformat/ -->
<div class="csharpcode">
<pre class="alt"><span class="lnum"> 1: </span>Vector<<span class="kwrd">float</span>> v = <span class="kwrd">new</span> Vector<<span class="kwrd">float</span>>(3);</pre>
</div>
<br />
As you can see, this is totally transparent to the user (no need to specifiy a "Calculator" type parameter) and doesn’t have the potential performance problems of the delegate/Func method.
<br /><br />
Maybe someone finds this useful and can even improve the idea. I also plan to do a performance comparision of the different methods soon.</t><br/><br/>
<span style="font-weight:bold;">Update 2009-04-11:</span>
<br />I found another <a href="http://www.dogspots.com/DSBE/post/2009/02/17/GenericArithmetic.aspx">nice blog entry</a> that covers the same topic and also presents a solution similar to the one showed above.Unknownnoreply@blogger.com0