mirror of
https://github.com/AdaCore/cuda.git
synced 2026-02-12 13:05:54 -08:00
309 lines
11 KiB
ReStructuredText
309 lines
11 KiB
ReStructuredText
**************************************
|
|
Tutorial
|
|
**************************************
|
|
|
|
.. role:: switch(samp)
|
|
|
|
Some Expensive Computation
|
|
==========================
|
|
|
|
Before compiling any CUDA code, you will need to source :file:`env.sh` into
|
|
your environment.
|
|
|
|
Open the :file:`tutorial` directory found at the top directory of this CUDA
|
|
repository. There you'll see a typical CUDA project structure. In
|
|
particular, :file:`device.gpr` builds the code that runs on the device and
|
|
:file:`host.gpr` builds the code for the host. :file:`Makefile` is
|
|
responsible for building both projects. Note that it's using the standard
|
|
:file:`Makefile.build` structure. If you look at
|
|
:file:`cuda/Makefile.build`, you'll see both build commands:
|
|
|
|
.. code-block:: shell
|
|
|
|
gprbuild -Xcuda_host=$(CUDA_HOST) -P device
|
|
gprbuild -Xcuda_host=$(CUDA_HOST) -P host -largs $(CURDIR)/lib/*.fatbin.o
|
|
|
|
The switch :switch:`-Xcuda_host=$(CUDA_HOST)` allows building for cross
|
|
platform such as ARM Linux. In this tutorial, we're only building for
|
|
native X86 Linux so it isn't needed.
|
|
|
|
An important switch on the host compilation command line is :switch:`-largs
|
|
$(CURDIR)/lib/*.fatbin.o`. This adds the object code containing the device
|
|
binary to the list of object files linked with the host application. The
|
|
host will load this library onto the GPU.
|
|
|
|
You can make and run the application:
|
|
|
|
.. code-block:: shell
|
|
|
|
make && ./main
|
|
|
|
You should see timing output, e.g.:
|
|
|
|
.. code-block:: text
|
|
|
|
Host processing took 0.021741000 seconds
|
|
Device processing took 0.000000000 seconds
|
|
|
|
This sample project is doing expensive computations with O(3) complexity on
|
|
an array whose size is 2^8 by default. We haven't yet arranged for any
|
|
processing to be done on the device, which explains why there was no time
|
|
spent on the device. You can change the size of the array by adding a
|
|
parameter on the :file:`main` invocation, e.g.:
|
|
|
|
.. code-block:: text
|
|
|
|
./main 9
|
|
Host processing took 1.303823000 seconds
|
|
Device processing took 0.000000000 seconds
|
|
|
|
The timing depends on your platform. You should chose a parameter that
|
|
causes the processing to take around 1 second to make the next steps
|
|
meaningful.
|
|
|
|
As we said, this code is currently fully native and single-threaded. We're
|
|
now going to offload the computation to the GPU.
|
|
|
|
Open :file:`src/common/kernel.ads`. You'll see the specification of
|
|
:code:`Complex_Computation`:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Complex_Computation
|
|
(A : Float_Array;
|
|
B : Float_Array;
|
|
C : out Float_Array;
|
|
I : Integer);
|
|
|
|
We're going to wrap this call into a CUDA kernel that we can call from the
|
|
host.
|
|
|
|
We first need to create types. Ada arrays can't be passed directly from
|
|
host to device: they need to be passed through specific access types marked
|
|
as addressing device memory space. You do this by using a specific aspect
|
|
on the type :code:`Designated_Storage_Model => CUDA.Storage_Models.Model`.
|
|
When you do this, allocation and deallocation are done through the CUDA
|
|
API. Copies between these pointers and native pointers are also modified to
|
|
move data from the device to the host and back.
|
|
|
|
We next introduce a new pointer type in the :code:`Kernel` package:
|
|
|
|
.. code-block:: ada
|
|
|
|
type Array_Device_Access is access Float_Array
|
|
with Designated_Storage_Model => CUDA.Storage_Models.Model;
|
|
|
|
This pointer must be pool specific: it can't be an :code:`access all` type.
|
|
That means it conceptually points to a specific pool of data, the device
|
|
memory, and that conversions between other pointers types aren't allowed.
|
|
|
|
We're now going to introduce a procedure to be called from the host. In the
|
|
CUDA world, this is called a "kernel". Kernels are identified by a special
|
|
aspect, :code:`CUDA_Global` which corresponds to the :code:`__global__`
|
|
modifier used in C CUDA code. This kernel accepts :code:`A`, :code:`B` and
|
|
:code:`C` as parameters. The specific index, :code:`J`, isn't passed to the
|
|
kernel, but is instead computed there.
|
|
|
|
Write the kernel specification:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Device_Complex_Computation
|
|
(A : Array_Device_Access;
|
|
B : Array_Device_Access;
|
|
C : Array_Device_Access)
|
|
with CUDA_Global;
|
|
|
|
We're now going to implement the kernel. Open :file:`kernel.adb` and start
|
|
writing the body of the function:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Device_Complex_Computation
|
|
(A : Array_Device_Access;
|
|
B : Array_Device_Access;
|
|
C : Array_Device_Access)
|
|
is
|
|
begin
|
|
null;
|
|
end Device_Complex_Computation;
|
|
|
|
This kernel is called in parallel, once per index in the array we're
|
|
computing. Within a kernel, you can index a given call using the thread
|
|
number (:code:`Thread_IDx`) and the block number (:code:`Block_IDx`). You
|
|
can also retrieve the number of threads in a block that have been scheduled
|
|
(:code:`Block_Dim`) and the number of blocks in the grid
|
|
(:code:`Grid_Dim`). These correspond to three dimension values, which we
|
|
can call :code:`x`, :code:`y`, and :code:`z`. In this example, we're only
|
|
going to use the :code:`x` dimension.
|
|
|
|
Next, we add a computation of the index, :code:`J`, into the body of the
|
|
kernel based on the block and thread index:
|
|
|
|
.. code-block:: ada
|
|
|
|
J : Integer := Integer (Block_Dim.X * Block_IDx.X + Thread_IDx.X);
|
|
|
|
These are expressed in terms of :code:`Interfaces.C.int`, so we need to
|
|
explicitly convert the result to :code:`Integer`.
|
|
|
|
At this point the call to :code:`Complex_Computation` is trivial. Our whole
|
|
kernel should now look like:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Device_Complex_Computation
|
|
(A : Array_Device_Access;
|
|
B : Array_Device_Access;
|
|
C : Array_Device_Access)
|
|
is
|
|
J : Integer := Integer (Block_Dim.X * Block_IDx.X + Thread_IDx.X);
|
|
begin
|
|
Complex_Computation (A.all, B.all, C.all, J);
|
|
end Device_Complex_Computation;
|
|
|
|
We're now done with the kernel, so let's move on to the host code. Open
|
|
:file:`src/host/main.adb`. That code currently allocates three native
|
|
arrays, :code:`H_A`, :code:`H_B` and :code:`H_C` and runs a computation on
|
|
them. We're going to introduce three new arrays, :code:`D_A`, :code:`D_B`,
|
|
and :code:`D_C` using the :code:`Array_Device_Access` type we created
|
|
earlier.
|
|
|
|
The new declaration is:
|
|
|
|
.. code-block:: ada
|
|
|
|
D_A, D_B, D_C : Array_Device_Access;
|
|
|
|
We're now going to determine how kernel calls will be scheduled on the GPU.
|
|
In this tutorial, we're going to have each block contain 256 threads. We
|
|
can compute the number of blocks to cover the entire array, which is
|
|
essentially :code:`Number_Of_Elements / Threads_Per_Block`, but we add 1 to
|
|
account for leftover portions of :code:`Threads_Per_Block`.
|
|
|
|
The computations for :code:`Threads_Per_Block` and :code:`Blocks_Per_Grid`
|
|
are:
|
|
|
|
.. code-block:: ada
|
|
|
|
Threads_Per_Block : Integer := 256;
|
|
Blocks_Per_Grid : Integer := Num_Elements / Threads_Per_Block + 1;
|
|
|
|
We now need to allocate memory on the device side. To compute the actual
|
|
additional cost of device computation, this allocation is taken into
|
|
account in the total time reported because data copy can be a critically
|
|
limiting factor of GPU performance enhancements.
|
|
|
|
Find the portion of the body marked :code:`-- INSERT HERE DEVICE CALL`.
|
|
After that, add the two array allocations and copies for :code:`H_A`
|
|
and :code:`H_B` to :code:`D_A` and :code:`D_B` respectively. Also allocate
|
|
an array for :code:`D_C` which is the size of :code:`H_C`:
|
|
|
|
.. code-block:: ada
|
|
|
|
D_A := new Float_Array'(H_A.all);
|
|
D_B := new Float_Array'(H_B.all);
|
|
D_C := new Float_Array (H_C.all'Range);
|
|
|
|
These three statements are using the storage model introduced before. In
|
|
particular, allocations are done through the CUDA API and copies from the
|
|
host to the device are also done through the CUDA API.
|
|
|
|
Now we can finally call our kernel code! We do this using a special pragma,
|
|
:code:`CUDA_Execute`, which takes at least three parameters: a procedure
|
|
call to a kernel, the dimension of the blocks (how many threads they
|
|
contain) and the dimension for the grid (how many blocks it contains).
|
|
|
|
The CUDA call is as follows:
|
|
|
|
.. code-block:: ada
|
|
|
|
pragma CUDA_Execute
|
|
(Device_Complex_Computation (D_A, D_B, D_C),
|
|
Blocks_Per_Grid,
|
|
Threads_Per_Block);
|
|
|
|
When executing that pragma, the CUDA API schedules
|
|
:code:`Device_Complex_Computation` to be executed
|
|
:code:`Blocks_Per_Grid * Threads_Per_Block` times on the kernel. This call
|
|
itself is non-blocking, but subsequent dependent operations (such as copies
|
|
from the device) will block host execution until the kernel completes.
|
|
|
|
Let's introduce this copy now. Results are going to be stored in
|
|
:code:`D_C`, so let's copy it to :code:`H_C`:
|
|
|
|
.. code-block:: ada
|
|
|
|
H_C.all := D_C.all;
|
|
|
|
This is a copy between a host and a device pointer, which will be
|
|
implemented as a copy from device memory to the host.
|
|
|
|
The whole sequence should look like:
|
|
|
|
.. code-block:: ada
|
|
|
|
D_A := new Float_Array'(H_A.all);
|
|
D_B := new Float_Array'(H_B.all);
|
|
D_C := new Float_Array (H_C.all'Range);
|
|
|
|
pragma CUDA_Execute
|
|
(Device_Complex_Computation (D_A, D_B, D_C),
|
|
Blocks_Per_Grid,
|
|
Threads_Per_Block);
|
|
|
|
H_C.all := D_C.all;
|
|
|
|
That's it! As an extra exercise, you might want to instantiate and call
|
|
:code:`Ada.Unchecked_Deallocation` on the device pointers, but that's not
|
|
strictly necessary. Now compile and run the code. You can try different
|
|
values for the array size to observe different timings. For example:
|
|
|
|
.. code-block:: text
|
|
|
|
./main 10
|
|
Host processing took 1.227895000 seconds
|
|
Device processing took 0.051606000 seconds
|
|
|
|
Marching Cubes
|
|
==============
|
|
|
|
The marching cubes example demonstrates a more interesting
|
|
computation. Marching cubes is an important algorithms in graphical
|
|
rendering. It converts a density function, which indicates the absence or
|
|
presence of a material in a continuous 3D space, into a mesh of
|
|
triangles. The algorithm in this example is a transcription of the
|
|
algorithm shown in NVIDIA's `Metaballs GPU Gem 3 manual
|
|
<https://developer.nvidia.com/gpugems/gpugems3/part-i-geometry/chapter-1-generating-complex-procedural-terrains-using-gpu>`_.
|
|
In this example, we'll define a density function through `Metaballs
|
|
<https://en.wikipedia.org/wiki/Metaballs>`_
|
|
|
|
.. image:: marching.png
|
|
|
|
To build and run the example, ensure you have SDL and OpenGL installed.
|
|
|
|
You can build and run the code like the other examples:
|
|
|
|
.. code-block:: shell
|
|
|
|
cd cuda/examples/marching
|
|
make
|
|
./main
|
|
|
|
This opens a window and shows metaballs moving around the screen. The
|
|
speed of the rendering is dependent on the GPU power available on your
|
|
system. You can adjust the speed by changing the sampling of the grid that
|
|
computes the marching cubes: the smaller the sampling, the faster the
|
|
computation. You do this by changing the value in
|
|
:file:`src/common/data.ads`:
|
|
|
|
.. code-block:: ada
|
|
|
|
Samples : constant Integer := 256;
|
|
|
|
This value needs to be a power of 2. Try 128 or 64, for example.
|
|
|
|
A detailed walkthrough of this code is beyond the scope of this tutorial,
|
|
but this example is a good place to start looking at more complex usage of
|
|
the technology.
|