04 April 2017

General Purpose Computing on GPU in .NET World – Part 1


While we are developing applications in .NET, we don’t usually consider using the graphics card to boost project operation. From my point of view, one of the main reasons is that there is no need to do that. Let’s not cheat ourselves. Most of the code, which falls under optimisation, are more or less complicated CRUD operations and problems like requests optimisation and so on.

It’s not always like that. In .NET alone we have a lot of possibilities to parallelize computations on our processor. However, we don’t usually go a step further to use the graphics card to do that. One of the problems is a lack of technology knowledge and the fact that the code needs to be maintained. It’s also possible that at some point we’ll have to hand it over to somebody. The time required to learn how to code quite well in OpenCL or CUDA C can exceed project deadline. And benefits of using those technologies may not be worthwhile.

Things have changed a lot recently. CUDAfy library has been created, and it helps .NET developers to complete this task. Do you want to find out more about it? Read the full article.

Processor and GPU

At first, let’s talk about differences between CPU and GPU. And let’s do it in an easy way.

The difference, at first glance, is quite visible. GPU have several dozens or several hundred cores more than CPU. But it doesn’t mean that they’re just faster than CPU. Both units are designed for different goals and are very different. In a simplified way, CPU has to perform complicated operations, using a small amount of data and GPU has to carry out simple operations, but using a lot of data.

While talking about GPGPU, SIMD (Single Instruction Multiple Data) term is being frequently used. It means that we perform exactly same instructions on many cores. It’s different in CPU, where every core may perform a different action. GPU processors have also much lower timing than CPU. Major differences are also in RAM memory, but let’s talk about it later.

OpenCL and CUDA

I won’t go into the details about how to write in OpenCL or CUDA C. However, we can’t skip this topic. You just need to understand the basics of these technologies to go further. Then you will be able “to write code in C# and run it on graphics card” or “to write code in OpenCL/CUDA and to run it from a managed code level”.


  • Closed-source GPGPU programming technology from NVIDIA
  • It’s not just API, but also a set of tools and the name for whole architecture
  • It’s intended only for Nvidia devices
  • Tools and SDK are available for free


  • Free and open standard for parallel computing
  • Supports different kinds of devices – CPU, GPU, DSP
  • Tools and SDK are being delivered by a provider of a particular device

The main principle in both technologies is that every program, even that one using the graphics card, needs to be started on CPU. Furthermore, it communicates only with the code launched on the device.

I will refer to both technologies, and I will show some differences between them. Both models are very similar because they have similar hierarchical and scalable structure. Hierarchical structure herself is almost identical – differences between them are primarily differences in naming. And it can be a little confusing

CUDAfy.Net was created based on CUDA technology. Only later OpenCL support was implemented. Because of that NVIDIA tech terminology is used both in the documentation and in the library. It may be misleading when using OpenCL.

Let’s begin from the bottom of the structure:

  • On the lowest level, we have a thread.
  • Threads are being grouped into indexed blocks. Every block can have one (x), two (x,y) or three (x,y,z) dimensions.

Blocks are supported by multiprocessors (MPs). Multiprocessor can’t support more than one block. Specifics of a device is important here – e.g. if we have 768 threads available in a multiprocessor, a size of three-dimensional block can’t exceed 768. In other words,
x * y * z <= 768.

On the highest level in the structure, we have a grid, which is supported by the single graphics card. We place blocks on the grid in the same way as we place threads in a block. The grid also can be one-, two- or three-dimensional.

In short, we have a grid, which contains blocks (x,y,z) and blocks, which contain threads (x,y,z).

Grids, blocks and threads are called differently depending on technology. Below I have created a table with counterparts of those names.

Grid NDRange
Thread Block Workgroup
Thread Work item
Thread ID Global ID
Block index Block ID
Thread-index Local ID


We can divide graphics card memory into:

Global – so-called device memory; every thread from every multiprocessor may freely save from and read from that memory.

Texture cache – a memory in every MP may be filled with data from a global memory; it’s read-only memory.

Constant cache– It’s read-only memory for every MP.

Shared memory – it’s memory shared between threads in MP; every thread can read from it and save to it within multiprocessor; it’s very fast.

Registry– the fastest kind of memory.


  • SIMT (Single Instruction Multiple Threads) is the leading principle.
  • All threads perform the same instruction at the same time – but with a different set of data.
  • Nothing changes in the context – every thread has his registers.
  • Any thread may remain inactive while waiting for data or the end of other thread computing.


We discussed the basics of device performance, and now it’s time to mention how it looks from the side of software. It’s pretty simple, and we can cut it to these steps:

  1. Device initialization to start computations (GPU or e.g. APU, CPU).
  2. Allocation of both device memory and host memory (GPU global memory and RAM memory).
  3. Duplication of the data from the host to device memory (from RAM memory to GPU global memory).
  4. Beginning of kernel execution on the device.
  5. Duplication of the results of computations from device memory to host memory (from RAM memory to GPU global memory).
  6. Repeat steps 3 through 5 if needed.
  7. Release allocated device memory (we don’t have to worry about it in .NET, but here we have to).

Here is an example of the code, which is being executed on the graphics card:


And here is a list of libraries similar to CUDAfy:

  • Net (TidePowerd) – it also converts C# code to CUDA during compilation. But probably it’s not being developed anymore. There is no support for OpenCL.
  • Brahma (LINQ-to-streaming computation provider) – it runs LINQ expressions using Open CL. It’s an exciting project, but it’s in the initial phase, and there is no documentation. There are some more advanced examples, but It’s not easy to use. It can be an alternative for boosting some of the computations. Last commits are from 2012.
  • C$ – It was an ambitious project, established to create the language based on C# for many platforms development.
  • Cloo – wrapper, which allows using OpenCL from C# level.
  • Openclnet – wrapper, which allows using OpenCL from C# level. It’s been created to be as fast as possible and to contain as little as possible abstractions.
  • Alea GPU – the most developed library, unfortunately only for CUDA devices. It has a free license, an excellent documentation and a lot of examples and tutorials. It’s possible that it’s the best alternative for CUDAfy, if we don’t want to use OpenCL.

In the second part, I will tell you about CUDAfy library – elements in a package, devices supported by CUDAfy, why do we need it and more. See you soon!

Paweł Kondzior
Software Developer

Software Engineer specialised in .NET development. Working as a developer since 2011 year most of this time getting experience in web applications based on MVC. He spends free time as a glider pilot and amateur American football player.