General-purpose computing on GPU in .NET world

CUDAfy does seemingly very simple thing. It converts selected C# code to OpenCL or CUDA C during compilation. In simplification, it works as you can see on below picture. In fact, it’s much more complicated, but basic examples look very friendly.

PS Part 1 is available here.

Which devices are supported by CUDAfy?

NVIDIA graphics cards with CUDA support
Devices, which support OpenCL standard – graphics cards (Radeon), APU, etc.
It has a built-in emulator, and it’s huge advantage

CUDAfy creators brag about supporting many platforms like Windows or Linux and practically all of modern graphics cards and processors. Theoretically, it would be possible to use CUDAfy even on a tablet because Google Nexus 10 has Tegra system, which supports OpenCL.

But the reality isn’t as much colourful. CUDAfy provides much bigger support for CUDA compared to OpenCL (this is clearly shown by the samples attached to the package). That support results from few factors, differences between CUDA and OpenCL, and different versions of CUDA and OpenCL. Because of that, we should realise projects with particular architecture or device in mind.

You can read about differences between CUDA and OpenCL here.
You can download CUDAfy from here.
The step-by-step installation process is explained in manual.

CUDAfy package

It contains few elements:

Cudafy .NET Library
- Cudafy Translator (Convert .NET code to CUDA C)
- Cudafy Library (CUDA support for .NET)
- Cudafy Host (Host device wrapper)
- Cudafy Math (FFT, BLAS, RAND, SPARSE)
Cudafy .NET Library
Cudafy Module Viewer
Cudafy Command Line Tool

The most important elements are:

Cudafy Translator –after compilation it converts .NET code to CUDA or OpenCL code. Once it was working based on .NET reflector, but now it uses ILSpy,

CudafyHost– it’s an interface, which enables communication between the program and GPGPU device,

Cudafy by Example – it’s a set of examples, based on “CUDA By Example” book. They were created mostly for CUDA, but most of them also work on OpenCL,

CudafyExamples – it’s more advanced set of examples, but most of them work only on CUDA.

Why exactly do we need CUDAfy?

For parallel computing! I’ll try to explain this by using a trivial example of adding boards. Imagine two one-dimensional boards, which are one hundred elements long. We want, to sum up their values.
compute
01
The simplest solution is to write a loop and to add every value one by one.

compute
02
It can be assumed that the primary operation is to add two values from the boards and then to save the score to the third board. It’s easy to count that there are one hundred of operations like that. Of course, we can use a processor to parallelize it, but the number of threads limits us. By using GPGPU, we can perform these one hundred operations at the same time. Same as in the example below:
compute
03

Sounds great! But we need to transfer this data to a device, and then we need to receive a score from it. In fact, it looks like that:
compute
04
We should transfer those boards from RAM memory to a device and then we have to execute a kernel on supplied data set. Next thing we have to do is to download a score to RAM memory again. We can distinguish:

Basic program in CUDAfy

Follow these simple steps to use CUDAfy:

Add references to Cudafy.NET.dll (it’s in the package).
Add all three namespaces using Cudafy, Cudafy.Host and Cudafy.Translator.
Add Gthread parameter to computing function on GPU. Thanks to it, It’s possible to download information about a current thread and its ID, and also about a block and a grid, in which given thread is. This parameter also enables access to files synchronisation and local shared memory.
Place [Cudafy] attribute above computing functions.
Trigger Cudafy.Translator.Cudafy( ) in the host code (the part of an application, which is being executed on CPU). It’ll allow the instance of Cudafy module to return.
Load module (from 5. step) on GPGPU instance, on which computations will be made.

A simple kernel

The easiest example is an execution of kernel itself without any computations. It’s possible to do it by using the code below:

It shows a device, which the host will be using and into which language C# code will be translated.

CudafyModule km = CudafyTranslator.Cudafy();

It creates a CUDAfy module and translates code into OpenCL C. In described example, it’ll be the only method marked with [Cudafy] “thekernel” attribute.

gpu.Launch().thekernel();

Launch method means the beginning of executing functions on preselected GPGPU device.

[Cudafy] public static void thekernel() { }

Cudafy Viewer

If we’d like to check how generated by Cudafy code looks like, we can use Cudafy Viewer. We can also check if given machine has devices supporting CUDA or OpenCL.

After the compilation, files created by Cudafy should appear in BIN folder.

We can view them by using Cudafy Viewer. In this case, it’s clear function in OpenCL C.

Knowing the basics, we can handle more advanced program:

This example isn’t complicated. Three-dimensional vector boards are being created, and then, a and b vectors are being added to every line, giving c vector as a score.

A structure presents a vector – this type is admissible by CUDAfy. However, we should remember that CUDAfy mainly supports Blittable Types like int, byte, float, etc. Only these kinds may support code elements marked by CUDAfy attribute. In a case of using System.Bool, which isn’t supported, CUDAfy will throw an exception. Additional CUDAfy supports structures mentioned above and one-dimensional boards. It’s possible to use classes only, but just in case of CUDA and with considerable restrictions.

After filling boards with sample data, we have to send boards content to graphics card memory:

VectorStruct[] dev_a = gpu.Allocate(N);

Then we have to declare space in memory for a scoreboard in the same way:

VectorStruct[] dev_c = gpu.Allocate(c);

The next step is to copy data to device memory:

gpu.CopyToDevice(a, dev_a);

Kernel execution itself is a bit more complicated.

Processed boards are one-dimensional, and they have a size of N., In this case, they are summed up in a single block. One can sum them up in separate blocks, especially, if they’d be so long that they’ll exceed an amount of possible threads in a block.

One grid with N blocks (as much as a number of elements in totalized boards) is being created. Indicators on addresses of boards in GPU memory – dev_a, dev_b and scoreboard dev_c – are being forwarded to a kernel. If this program is launched in a debugger, checking their contents may be a bit misleading. We would see empty boards. This will happen because they’re only addresses in GPGPU memory and this is where we should be looking for them.

After kernel execution on all threads, we should copy the score from GPGPU memory to RAM memory.

gpu.CopyFromDevice<VectorStruct>(dev_c, c);

The last step in host code is to release memory on the device.

gpu.FreeAll();

The kernel code itself looks like this:

Gthread and three boards of vectors are being transferred to it. We can define a position of specific thread in a block by using Gthread properties. If we have this position, we can define for which set of data this thread is responsible for. In above example, the block is a one-dimensional and only value of x is used.

But boards are three-dimensional, and one can define thread’s position in a block by x, y, z values. It can be used for large input data. An analogous situation occurs in a case of blocks in a grid.

Calling km.TrySerialize(); will cause serialisation of CUDAfy module and it’ll be still possible to view the code converted to OpenCL C. This code will be more complicated, but also very similar to code written in .NET.

Implementation

The implementation itself is an element we can’t forget about. As I have mentioned before, it’s best to write applications with previously planned hardware solutions in mind.

Work with CUDAfy can be divided into two big elements:

Code translation into CUDA C or OpenCL C.
Loading of translated modules and communication between host and GPU device.

The process of working on a project from the beginning to implementation can also be divided into two points: working on a development machine and implemented environment. The difference is that deployed application won’t be translating C# code – it’ll just load ready modules, e.g. from XML files.