cuda/doc/tutorial.rst

**************************************
Tutorial
**************************************

Some Expensive Computation
==========================

Remember that before compiling any you will need to have sourced env.sh
in your environment::

Open the ``tutorial`` directory found at the root of this CUDA repository. You will see
there a typical CUDA project structure. In particular, ``device.gpr`` will build
the code running on the device, ``host.gpr`` the code on the host. The Makefile
is responsible to build both project, note that it's using the standard Makefile.build
structure. If you open cuda/Makefile.build, you'll see both build commands:

.. code-block:: Makefile

    gprbuild -Xcuda_host=$(CUDA_HOST) -P device
    gprbuild -Xcuda_host=$(CUDA_HOST) -P host -largs $(CURDIR)/lib/*.fatbin.o

The argument ``-Xcuda_host=$(CUDA_HOST)`` is here to allow to build for cross
platform such as ARM Linux. In this tutorial, we're only going to build for native
X86 Linux so they could be ignored.

One key argument on the host compilation command line is
``-largs $(CURDIR)/lib/*.fatbin.o`` which is adding to the list of object files that
the host is linking against the object code containing device binary to load on
the GPU.

You can make and run the application:

.. code-block::

    $> make && ./main

You should see some timing output, e.g.:

.. code-block::

    Host processing took  0.021741000 seconds
    Device processing took  0.000000000 seconds

This project is doing some expensive computations in O(3) complexity, on an array
of size of 2^8 by default. Device processing isn't implemented yet which explains
the 0 result. You can change the size of the analysis by adding a parameter
on main command line, e.g.:

.. code-block::

    $> ./main 9
    Host processing took  1.303823000 seconds
    Device processing took  0.000000000 seconds

These number depend on your platform. You should find a value that process data around 1
second to make the next steps meaningful.

This code is currently fully native and single treaded. We're going to offload
the computation to the GPU.

Open ``src/common/kernel.ads``. You'll see the specification of ``Complex_Computation``:

.. code-block:: ada

  procedure Complex_Computation
     (A : Float_Array;
      B : Float_Array;
      C : out Float_Array;
      I : Integer);

We're going to wrap this call into a CUDA kernel that can be called directly
from the host.

The first thing is to create types. Ada array types cannot be passed directly
from host to device - they need to be passed through specific access types
marked as addressing device memory space. This is done by using a specific
aspect on the type ``Designated_Storage_Model => CUDA.Storage_Models.Model``.
When set, allocation and deallocation are done through CUDA API allocators.
Copies betwen these pointers and native pointers will also be instrumented
to be able to move device data to the host and back.

Introduce a new pointer type in the ``Kernel`` package:

.. code-block:: ada

    type Array_Device_Access is access Float_Array
       with Designated_Storage_Model => CUDA.Storage_Models.Model;

Note that this pointer has to be pool specific - e.g. it can't be an ``access all``.
That means that it conceptually points to a specific
pool of data - the device memory - and that conversions with other
pointers types are not allowed.

We're now going to introduce a procedure that can be called from the host. In
CUDA world, this is called a kernel. Kernels are identified by a special aspect,
``CUDA_Global`` which corresponds to the ``__global__`` modifier used in C CUDA
code. This kernel will take A, B and C as parameter. The specific index I is
not needed at the kernel level and will be computed in the kernel.

Write the kernel specification:

.. code-block:: ada

   procedure Device_Complex_Computation
     (A : Array_Device_Access;
      B : Array_Device_Access;
      C : Array_Device_Access)
     with CUDA_Global;

We're now going to implement the kernel. Open ``kernel.adb`` and start the body
of the function:

.. code-block:: ada

   procedure Device_Complex_Computation
     (A : Array_Device_Access;
      B : Array_Device_Access;
      C : Array_Device_Access)
    is
    begin
       null;
    end Device_Complex_Computation;

This kernel will be be called in parallel, once per index in the array to
compute. Within a kernel, it's possible to index a given call using the
thread number (``Thread_IDx``) and the block number (``Block_IDx``). You
can also retrieve the number of thread in a block that have been scheduled
(``Block_Dim``) and the number of blocks in the grid (``Grid_Dim``). These
are 3 dimension values, indexed by x, y and z. In this example, we're only
going to use the x dimension.

Add computation of the Index I in the body of the kernel based on the block
and thread index:

.. code-block:: ada

   I : Integer := Integer (Block_Dim.X * Block_IDx.X + Thread_IDx.X);

Note that these are expressed in terms of Interfaces.C.int, so the result
needs to be converted explicly to Integer.

From there, the call to ``Complex_Computation`` is trivial. The whole
kernel should now look like:

.. code-block:: ada

   procedure Device_Complex_Computation
     (A : Array_Device_Access;
      B : Array_Device_Access;
      C : Array_Device_Access)
   is
      I : Integer := Integer (Block_Dim.X * Block_IDx.X + Thread_IDx.X);
   begin
      Complex_Computation (A.all, B.all, C.all, I);
   end Device_Complex_Computation;

We're done with the kernel - let's move to the host code. Open ``src/host/main.adb``.
This code currently allocates three native arrays ``H_A``, ``H_B`` and ``H_C`` and runs
computation on it. We're going to introduce three new arrays, ``D_A``, ``D_B``, ``D_C``
type after ``Array_Device_Access`` created earlier.

Introduce the new declarations:

.. code-block:: ada

    D_A, D_B, D_C : Array_Device_Access;

We're now going to decide on how kernel calls will be scheduled on the GPU.
In this tutorial, we're going to set each block to contain 256 threads.
The number of blocks to cover the entire array can be deduced from this,
essentiallize ``Number_Of_Elements / Threads_Per_Block``, adding 1 to
account for potenntial portions of the Threads_Per_Block size.

Introduce computations for ``Threads_Per_Block`` and ``Blocks_Per_Grid``:

.. code-block:: ada

   Threads_Per_Block : Integer := 256;
   Blocks_Per_Grid : Integer := Num_Elements / Threads_Per_Block + 1;

We now need to allocate memory on the device side. In order to compute the actual
additional cost of device computation, this allocation will be taken into account
in the total time reported. Indeed, data copy can be a critical limiting factor
of GPU performance enhancements.

Identify the portion of the body marked ``-- INSERT HERE DEVICE CALL``. Introduce
here two array allocations and copies for H_A and H_B to D_A and D_B respectively.
Also allocate one array to D_C to be the size of H_C:

.. code-block:: ada

   D_A := new Float_Array'(H_A.all);
   D_B := new Float_Array'(H_B.all);
   D_C := new Float_Array (H_C.all'Range);

Note that the three statements above are using the storage model instrumentation
introduced before. In particular, allocation are done through the CUDA API, and
copies will happen through the CUDA API from the host to the device.

Now is time to finally call our kernel code! These special calls are done through
a special pragma, ``CUDA_Execute``, which takes at least 3 parameters: a procedure
call to a kernel, the dimensions of the blocks (how many threads they contain) and
one for the grid (how many block it contains).

Introduce the CUDA call:

.. code-block:: ada

   pragma CUDA_Execute
     (Device_Complex_Computation (D_A, D_B, D_C),
      Threads_Per_Block,
      Blocks_Per_Grid);

When hitting that pragma, ``Device_Complex_Computation`` will be scheduled to be
executed Blocks_Per_Grid * Threads_Per_Block times on the kernel. This call is
non blocking by itself - subsequent dependent operations (such as copies from the
device) will actually be.

Let's actually introduce this copy now. Results are going to be stored in D_C, let's
copy it to H_C:

.. code-block:: ada

    H_C.all := D_C.all;

This is a copy between a host and a device pointer, which will be instrumented to
retreive device memory to the host.

The whole sequence should look like:

.. code-block:: ada

   D_A := new Float_Array'(H_A.all);
   D_B := new Float_Array'(H_B.all);
   D_C := new Float_Array (H_C.all'Range);

   pragma CUDA_Execute
     (Device_Complex_Computation (D_A, D_B, D_C),
      Threads_Per_Block,
      Blocks_Per_Grid);

   H_C.all := D_C.all;

That's it! As an extra exercise, you might want to instanciate and call
``Ada.Unchecked_Deallocation`` on the device pointers, but that's not
strickly necessary. Compile and run the code. You can try different
values for the array size to observe differences. For example:

.. code-block::

    ./main 10
    Host processing took  1.227895000 seconds
    Device processing took  0.051606000 seconds

Marching Cubes
==============

You can see a full fetch example with some more interesting computation in the
marching cubes example. Marching cubes is one of the typical algorithms in graphical
rendering. It allows you to convert a density function, able to separate absence of
presence of a material in a continuous 3D space, into a mesh of triangles. This algorithm
is a transcription of the algoritm details in NVIDIA's `Metaballs GPU Gem 3 manual
<https://developer.nvidia.com/gpugems/gpugems3/part-i-geometry/chapter-1-generating-complex-procedural-terrains-using-gpu>`_.
In this example, we'll define a density function through `Metaballs <https://en.wikipedia.org/wiki/Metaballs>`_

.. image:: marching.png

To be able to build and run the example, ensure that you have the following
dependencies installed:

- SDL
- OpenGL

You can build and run the code the same way as other examples:

.. code-block::

    cd cuda/examples/marching
    make
    ./main

This should open a window and display metaballs on the screen moving around.
The speed of the rendering will be dependent on the GPU power available on
your system. This can be adjusted by changing the sampling of the grid that
computes marching cubes - the smaller the sampling the faster the computation.
You can adust that by changing the value under ``src/common/data.ads``::

.. code-block:: ada

    Samples : constant Integer := 256;

Try for example 128 or 64. This value needs to be a power of 2.

Walkthrough of this code goes beyond the scope of this tutorial, but it's
a good place to start looking at more complex usage of the technology.