Files
cuda/doc/tutorial.rst

296 lines
10 KiB
ReStructuredText

**************************************
Tutorial
**************************************
Some Expensive Computation
==========================
Remember that before compiling any you will need to have sourced env.sh
in your environment::
Open the ``tutorial`` directory found at the root of this CUDA repository. You will see
there a typical CUDA project structure. In particular, ``device.gpr`` will build
the code running on the device, ``host.gpr`` the code on the host. The Makefile
is responsible to build both project, note that it's using the standard Makefile.build
structure. If you open cuda/Makefile.build, you'll see both build commands:
.. code-block:: Makefile
gprbuild -Xcuda_host=$(CUDA_HOST) -P device
gprbuild -Xcuda_host=$(CUDA_HOST) -P host -largs $(CURDIR)/lib/*.fatbin.o
The argument ``-Xcuda_host=$(CUDA_HOST)`` is here to allow to build for cross
platform such as ARM Linux. In this tutorial, we're only going to build for native
X86 Linux so they could be ignored.
One key argument on the host compilation command line is
``-largs $(CURDIR)/lib/*.fatbin.o`` which is adding to the list of object files that
the host is linking against the object code containing device binary to load on
the GPU.
You can make and run the application:
.. code-block::
$> make && ./main
You should see some timing output, e.g.:
.. code-block::
Host processing took 0.021741000 seconds
Device processing took 0.000000000 seconds
This project is doing some expensive computations in O(3) complexity, on an array
of size of 2^8 by default. Device processing isn't implemented yet which explains
the 0 result. You can change the size of the analysis by adding a parameter
on main command line, e.g.:
.. code-block::
$> ./main 9
Host processing took 1.303823000 seconds
Device processing took 0.000000000 seconds
These number depend on your platform. You should find a value that process data around 1
second to make the next steps meaningful.
This code is currently fully native and single treaded. We're going to offload
the computation to the GPU.
Open ``src/common/kernel.ads``. You'll see the specification of ``Complex_Computation``:
.. code-block:: ada
procedure Complex_Computation
(A : Float_Array;
B : Float_Array;
C : out Float_Array;
I : Integer);
We're going to wrap this call into a CUDA kernel that can be called directly
from the host.
The first thing is to create types. Ada array types cannot be passed directly
from host to device - they need to be passed through specific access types
marked as addressing device memory space. This is done by using a specific
aspect on the type ``Designated_Storage_Model => CUDA.Storage_Models.Model``.
When set, allocation and deallocation are done through CUDA API allocators.
Copies betwen these pointers and native pointers will also be instrumented
to be able to move device data to the host and back.
Introduce a new pointer type in the ``Kernel`` package:
.. code-block:: ada
type Array_Device_Access is access Float_Array
with Designated_Storage_Model => CUDA.Storage_Models.Model;
Note that this pointer has to be pool specific - e.g. it can't be an ``access all``.
That means that it conceptually points to a specific
pool of data - the device memory - and that conversions with other
pointers types are not allowed.
We're now going to introduce a procedure that can be called from the host. In
CUDA world, this is called a kernel. Kernels are identified by a special aspect,
``CUDA_Global`` which corresponds to the ``__global__`` modifier used in C CUDA
code. This kernel will take A, B and C as parameter. The specific index I is
not needed at the kernel level and will be computed in the kernel.
Write the kernel specification:
.. code-block:: ada
procedure Device_Complex_Computation
(A : Array_Device_Access;
B : Array_Device_Access;
C : Array_Device_Access)
with CUDA_Global;
We're now going to implement the kernel. Open ``kernel.adb`` and start the body
of the function:
.. code-block:: ada
procedure Device_Complex_Computation
(A : Array_Device_Access;
B : Array_Device_Access;
C : Array_Device_Access)
is
begin
null;
end Device_Complex_Computation;
This kernel will be be called in parallel, once per index in the array to
compute. Within a kernel, it's possible to index a given call using the
thread number (``Thread_IDx``) and the block number (``Block_IDx``). You
can also retrieve the number of thread in a block that have been scheduled
(``Block_Dim``) and the number of blocks in the grid (``Grid_Dim``). These
are 3 dimension values, indexed by x, y and z. In this example, we're only
going to use the x dimension.
Add computation of the Index I in the body of the kernel based on the block
and thread index:
.. code-block:: ada
I : Integer := Integer (Block_Dim.X * Block_IDx.X + Thread_IDx.X);
Note that these are expressed in terms of Interfaces.C.int, so the result
needs to be converted explicly to Integer.
From there, the call to ``Complex_Computation`` is trivial. The whole
kernel should now look like:
.. code-block:: ada
procedure Device_Complex_Computation
(A : Array_Device_Access;
B : Array_Device_Access;
C : Array_Device_Access)
is
I : Integer := Integer (Block_Dim.X * Block_IDx.X + Thread_IDx.X);
begin
Complex_Computation (A.all, B.all, C.all, I);
end Device_Complex_Computation;
We're done with the kernel - let's move to the host code. Open ``src/host/main.adb``.
This code currently allocates three native arrays ``H_A``, ``H_B`` and ``H_C`` and runs
computation on it. We're going to introduce three new arrays, ``D_A``, ``D_B``, ``D_C``
type after ``Array_Device_Access`` created earlier.
Introduce the new declarations:
.. code-block:: ada
D_A, D_B, D_C : Array_Device_Access;
We're now going to decide on how kernel calls will be scheduled on the GPU.
In this tutorial, we're going to set each block to contain 256 threads.
The number of blocks to cover the entire array can be deduced from this,
essentiallize ``Number_Of_Elements / Threads_Per_Block``, adding 1 to
account for potenntial portions of the Threads_Per_Block size.
Introduce computations for ``Threads_Per_Block`` and ``Blocks_Per_Grid``:
.. code-block:: ada
Threads_Per_Block : Integer := 256;
Blocks_Per_Grid : Integer := Num_Elements / Threads_Per_Block + 1;
We now need to allocate memory on the device side. In order to compute the actual
additional cost of device computation, this allocation will be taken into account
in the total time reported. Indeed, data copy can be a critical limiting factor
of GPU performance enhancements.
Identify the portion of the body marked ``-- INSERT HERE DEVICE CALL``. Introduce
here two array allocations and copies for H_A and H_B to D_A and D_B respectively.
Also allocate one array to D_C to be the size of H_C:
.. code-block:: ada
D_A := new Float_Array'(H_A.all);
D_B := new Float_Array'(H_B.all);
D_C := new Float_Array (H_C.all'Range);
Note that the three statements above are using the storage model instrumentation
introduced before. In particular, allocation are done through the CUDA API, and
copies will happen through the CUDA API from the host to the device.
Now is time to finally call our kernel code! These special calls are done through
a special pragma, ``CUDA_Execute``, which takes at least 3 parameters: a procedure
call to a kernel, the dimensions of the blocks (how many threads they contain) and
one for the grid (how many block it contains).
Introduce the CUDA call:
.. code-block:: ada
pragma CUDA_Execute
(Device_Complex_Computation (D_A, D_B, D_C),
Threads_Per_Block,
Blocks_Per_Grid);
When hitting that pragma, ``Device_Complex_Computation`` will be scheduled to be
executed Blocks_Per_Grid * Threads_Per_Block times on the kernel. This call is
non blocking by itself - subsequent dependent operations (such as copies from the
device) will actually be.
Let's actually introduce this copy now. Results are going to be stored in D_C, let's
copy it to H_C:
.. code-block:: ada
H_C.all := D_C.all;
This is a copy between a host and a device pointer, which will be instrumented to
retreive device memory to the host.
The whole sequence should look like:
.. code-block:: ada
D_A := new Float_Array'(H_A.all);
D_B := new Float_Array'(H_B.all);
D_C := new Float_Array (H_C.all'Range);
pragma CUDA_Execute
(Device_Complex_Computation (D_A, D_B, D_C),
Threads_Per_Block,
Blocks_Per_Grid);
H_C.all := D_C.all;
That's it! As an extra exercise, you might want to instanciate and call
``Ada.Unchecked_Deallocation`` on the device pointers, but that's not
strickly necessary. Compile and run the code. You can try different
values for the array size to observe differences. For example:
.. code-block::
./main 10
Host processing took 1.227895000 seconds
Device processing took 0.051606000 seconds
Marching Cubes
==============
You can see a full fetch example with some more interesting computation in the
marching cubes example. Marching cubes is one of the typical algorithms in graphical
rendering. It allows you to convert a density function, able to separate absence of
presence of a material in a continuous 3D space, into a mesh of triangles. This algorithm
is a transcription of the algoritm details in NVIDIA's `Metaballs GPU Gem 3 manual
<https://developer.nvidia.com/gpugems/gpugems3/part-i-geometry/chapter-1-generating-complex-procedural-terrains-using-gpu>`_.
In this example, we'll define a density function through `Metaballs <https://en.wikipedia.org/wiki/Metaballs>`_
.. image:: marching.png
To be able to build and run the example, ensure that you have the following
dependencies installed:
- SDL
- OpenGL
You can build and run the code the same way as other examples:
.. code-block::
cd cuda/examples/marching
make
./main
This should open a window and display metaballs on the screen moving around.
The speed of the rendering will be dependent on the GPU power available on
your system. This can be adjusted by changing the sampling of the grid that
computes marching cubes - the smaller the sampling the faster the computation.
You can adust that by changing the value under ``src/common/data.ads``::
.. code-block:: ada
Samples : constant Integer := 256;
Try for example 128 or 64. This value needs to be a power of 2.
Walkthrough of this code goes beyond the scope of this tutorial, but it's
a good place to start looking at more complex usage of the technology.