2022-12-08 13:30:51 -05:00
|
|
|
**************************************
|
2022-06-23 11:29:25 -04:00
|
|
|
Programming with GNAT for CUDA®
|
2022-06-18 12:23:04 -04:00
|
|
|
**************************************
|
|
|
|
|
|
2022-06-23 11:29:25 -04:00
|
|
|
CUDA API
|
|
|
|
|
========
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
The CUDA API available from GNAT for CUDA® is a binding to the CUDA API
|
2022-11-26 19:37:59 -05:00
|
|
|
provided by NVIDIA. The NVIDIA API is installed with the CUDA driver. You
|
2022-11-26 19:12:05 -05:00
|
|
|
access the Ada API by adding a reference to :file:`cuda_host.gpr` (on the
|
2022-11-26 19:37:59 -05:00
|
|
|
host) and :file:`cuda_device.gpr` (on the target). The initial
|
2022-11-26 19:12:05 -05:00
|
|
|
installation script generates the Ada version of the API from the CUDA
|
|
|
|
|
version that's installed on your system.
|
2022-06-23 11:29:25 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
Two versions of the Ada API are available:
|
2022-06-23 11:29:25 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
- a "thick" binding version. These are child units of the :code:`CUDA`
|
|
|
|
|
package, the main one being :code:`CUDA.Runtime_API`. This is the API you
|
|
|
|
|
will most likely use. However, this API is still in the process of being
|
|
|
|
|
completed and a number of types and subprogram specs have not been mapped
|
|
|
|
|
to higher-level Ada constructs. For example, you will still see a lot of
|
|
|
|
|
references to :code:`System.Address` where you would normally expect
|
|
|
|
|
specific access types in Ada.
|
|
|
|
|
- a "thin" binding version. These are typically identified by having a
|
|
|
|
|
suffix of :code:`_h`. They are direct bindings to the underlying C
|
2022-11-26 19:37:59 -05:00
|
|
|
APIs. These bindings are functional and complete. They can be used as a
|
2022-11-26 19:12:05 -05:00
|
|
|
low-level alternative to the thick binding, but they don't expose an
|
|
|
|
|
interface consistent with the Ada programming style and may require more
|
|
|
|
|
work to use.
|
2022-06-23 11:29:25 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
You can regenerate these bindings at any time. You may want to do this, for
|
|
|
|
|
example, if you install a new version of CUDA. To regenerate these
|
|
|
|
|
bindings, execute the :file:`bind.sh` script located in :file:`<your GNAT
|
|
|
|
|
for CUDA installation>/cuda/api/`.
|
2022-06-23 11:29:25 -04:00
|
|
|
|
2022-06-18 12:23:04 -04:00
|
|
|
Defining and calling Kernels
|
|
|
|
|
============================
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
Just as in a typical CUDA program, programming in GNAT for CUDA requires
|
|
|
|
|
you to identify application entry points to the GPU code, called
|
|
|
|
|
"kernels". In Ada, you do this by annotating a procedure with the
|
|
|
|
|
:code:`CUDA_Global` aspect, which serves the same role as the CUDA
|
|
|
|
|
:code:`__global__` modifier. For example:
|
2022-06-20 15:06:50 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
2022-06-18 12:23:04 -04:00
|
|
|
|
|
|
|
|
procedure My_Kernel (X : Some_Array_Access)
|
|
|
|
|
with CUDA_Global;
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
Kernels are compiled both for the host and the device. They can be called
|
|
|
|
|
as regular procedures, e.g:
|
2022-06-20 15:06:50 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
2022-06-18 12:23:04 -04:00
|
|
|
|
|
|
|
|
My_Kernel (An_Array_Instance);
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
The above makes a regular single-threaded call to the kernel and executes
|
|
|
|
|
it on the host. You may want to do this because of a better debugging
|
|
|
|
|
environment on the host.
|
2022-06-18 12:23:04 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
To call a kernel on the device (which means copying it to the device and
|
|
|
|
|
executing it there), you use the :code:`CUDA_Execute` pragma:
|
2022-06-20 15:06:50 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
2022-06-18 12:23:04 -04:00
|
|
|
|
|
|
|
|
pragma CUDA_Execute (My_Kernel (An_Array_Instance), 10, 1);
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
The procedure call looks the same as a regular call, but this call is
|
|
|
|
|
surrounded by the pragma :code:`CUDA_Execute`, which has two extra
|
2023-04-14 11:00:21 +02:00
|
|
|
parameters defining, respectively, the number of blocks per grid and the
|
|
|
|
|
number of threads per block. This is equivalent to the CUDA call:
|
2022-06-20 15:06:50 -04:00
|
|
|
|
|
|
|
|
.. code-block:: c
|
2022-06-18 12:23:04 -04:00
|
|
|
|
|
|
|
|
<<<10, 1>>> myKernel (someArray);
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
In each case, these calls launch ten instances of the kernel to the device.
|
2022-06-18 12:23:04 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
The numbers of threads per block and blocks per grid can be expressed as a
|
|
|
|
|
one-dimensional scalar or a :code:`Dim3` value that specifies all three
|
2022-12-08 13:30:51 -05:00
|
|
|
dimensions (:code:`x`, :code:`y`, and :code:`z`). For example:
|
2022-06-18 12:23:04 -04:00
|
|
|
|
2022-06-20 15:06:50 -04:00
|
|
|
.. code-block:: ada
|
|
|
|
|
|
2022-06-18 12:23:04 -04:00
|
|
|
pragma CUDA_Execute (My_Kernel (An_Array_Instance), (3, 3, 3), (3, 3, 3));
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
The above call launches (3 * 3 * 3) * (3 * 3 * 3) = 729 instances of the
|
2022-06-18 12:23:04 -04:00
|
|
|
kernel on the device.
|
|
|
|
|
|
|
|
|
|
Passing Data between Device and Host
|
|
|
|
|
====================================
|
|
|
|
|
|
2022-07-05 15:56:05 -04:00
|
|
|
Using Storage Model Aspect
|
|
|
|
|
--------------------------
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
"Storage Model" is an extension to the Ada language that is currently under
|
2022-11-26 19:37:59 -05:00
|
|
|
development. General description of this capability can be found `here
|
2023-07-27 14:53:06 +03:00
|
|
|
<https://github.com/AdaCore/ada-spark-rfcs/blob/master/prototyped/storage_model_2.rst>`_.
|
2022-07-05 15:56:05 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
GNAT for CUDA provides a storage model that maps to CUDA primitives for
|
|
|
|
|
allocation, deallocation, and copying. The model is declared in the package
|
|
|
|
|
:code:`CUDA.Storage_Models`. You may either use
|
|
|
|
|
:code:`CUDA.Storage_Models.Model` itself or you may create your own.
|
2022-07-05 15:56:05 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
When a pointer type is associated with a CUDA storage model, memory
|
|
|
|
|
allocation through that pointer occurs on the device in the same manner as
|
2022-11-26 19:37:59 -05:00
|
|
|
it would in the host if a storage model wasn't specified. For example:
|
2022-07-05 15:56:05 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
type Int_Array is array (Integer range <>) of Integer;
|
|
|
|
|
|
|
|
|
|
type Int_Array_Device_Access is access Int_Array
|
|
|
|
|
with Designated_Storage_Model => CUDA.Storage_Model.Model;
|
|
|
|
|
|
2022-11-26 19:37:59 -05:00
|
|
|
Device_Array : Int_Array_Device_Access := new Int_Array (1 .. 100);
|
2022-07-05 15:56:05 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
In addition to allocation being done on the device, copies between the host
|
2022-11-26 19:37:59 -05:00
|
|
|
and device are converted to call the CUDA memory copy operations. So you
|
2022-11-26 19:12:05 -05:00
|
|
|
can write:
|
2022-07-05 15:56:05 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
procedure Main is
|
|
|
|
|
type Int_Array_Host_Access is access Int_Array;
|
|
|
|
|
|
|
|
|
|
Host_Array : Int_Array_Host_Access := new Int_Array (1 .. 100);
|
|
|
|
|
Device_Array : Int_Array_Device_Access := new Int_Array'(Host_Array.all);
|
|
|
|
|
begin
|
2023-07-27 14:53:06 +03:00
|
|
|
pragma CUDA_Execute (
|
2022-07-05 15:56:05 -04:00
|
|
|
Some_Kernel (Device_Array),
|
|
|
|
|
Host_Array.all'Length,
|
|
|
|
|
1);
|
|
|
|
|
|
|
|
|
|
Host_Array.all := Device_Array.all;
|
|
|
|
|
end Main;
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
On the kernel side, :code:`CUDA.Storage_Model.Model` is the native storage
|
2022-11-26 19:37:59 -05:00
|
|
|
model (as opposed to the foreign device one when on the host side). You
|
2022-11-26 19:12:05 -05:00
|
|
|
can use :code:`Int_Array_Device_Access` directly:
|
2022-07-05 15:56:05 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
procedure Kernel (Device_Array : Int_Array_Device_Access) is
|
|
|
|
|
begin
|
|
|
|
|
Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
|
|
|
|
|
end Kernel;
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
This is the recommended way of sharing memory between device and host.
|
|
|
|
|
However, the storage model can be extended to support capabilities such as
|
|
|
|
|
streaming or unified memory.
|
2022-07-05 15:56:05 -04:00
|
|
|
|
|
|
|
|
Using Unified Storage Model
|
|
|
|
|
---------------------------
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
An alternative to using the default CUDA Storage model is to use so-called
|
|
|
|
|
"unified memory". In that model, the device memory is mapped directly onto
|
|
|
|
|
host memory, so no special copy operation is necessary. The factors that
|
|
|
|
|
may lead you to choose to one model over the other are outside of the scope
|
|
|
|
|
of this manual. To use unified memory, you use the package
|
|
|
|
|
:code:`Unified_Model` instead of the default one:
|
2022-07-05 16:53:43 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
type Int_Array is array (Integer range <>) of Integer;
|
|
|
|
|
|
|
|
|
|
type Int_Array_Device_Access is access Int_Array
|
|
|
|
|
with Designated_Storage_Model => CUDA.Storage_Model.Unified_Model;
|
|
|
|
|
|
|
|
|
|
Using Storage Model with Streams
|
|
|
|
|
--------------------------------
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
CUDA streams allows you to launch several computations in parallel. This
|
|
|
|
|
model allows you to specify which computation write and read operation must
|
|
|
|
|
wait for. The Ada CUDA API doesn't provide a pre-allocated stream memory
|
|
|
|
|
model. Instead, it provides a type, :code:`CUDA_Async_Storage_Model`, that
|
2022-12-08 13:30:51 -05:00
|
|
|
you can instantiate and specify the specific stream:
|
2022-07-05 16:53:43 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
My_Stream_Model : CUDA.Storage_Model.CUDA_Async_Storage_Model
|
|
|
|
|
(Stream => Stream_Create);
|
|
|
|
|
|
|
|
|
|
type Int_Array is array (Integer range <>) of Integer;
|
|
|
|
|
|
|
|
|
|
type Int_Array_Device_Access is access Int_Array
|
|
|
|
|
with Designated_Storage_Model => My_Stream_Model;
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
The data stream associated with a specific model can vary over time,
|
|
|
|
|
allowing different parts of a given object to be used by different streams,
|
|
|
|
|
e.g.:
|
2022-07-05 16:53:43 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
X : Int_Array_Device_Access := new Int_Array (1 .. 10_000);
|
|
|
|
|
Stream_1 : Stream_T := Stream_Create;
|
|
|
|
|
Stream_2 : Stream_T := Stream_Create;
|
|
|
|
|
begin
|
|
|
|
|
My_Stream_Model.Stream := Stream_1;
|
|
|
|
|
X (1 .. 5_000) := 0;
|
|
|
|
|
My_Stream_Model.Stream := Stream_2;
|
|
|
|
|
X (5_001 .. 10_000) := 0;
|
2022-07-05 15:56:05 -04:00
|
|
|
|
2023-07-27 14:53:06 +03:00
|
|
|
.. only:: COMMENT
|
2023-09-12 10:56:19 +02:00
|
|
|
|
2023-07-27 14:53:06 +03:00
|
|
|
The example above would need additional explanations.
|
|
|
|
|
It is not clear (if at all) X and the streams are related
|
|
|
|
|
|
2022-08-25 14:33:38 +01:00
|
|
|
Low-Level Data Transfer
|
2022-06-18 12:23:04 -04:00
|
|
|
-----------------------
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
At the lowest level, you can allocate memory to the device using the
|
|
|
|
|
standard CUDA function :code:`malloc` that's bound from
|
|
|
|
|
:code:`CUDA.Runtime_API.Malloc`. E.g.:
|
2022-06-18 12:23:04 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
This is equivalent to the following CUDA code:
|
2022-06-18 12:23:04 -04:00
|
|
|
|
|
|
|
|
.. code-block:: c
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
int *deviceArray = cudaMalloc (sizeof (int) * 100);
|
2022-06-18 12:23:04 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
In this example, objects on the Ada side aren't typed. Creating typed
|
|
|
|
|
objects requires more advanced Ada constructions that are described later.
|
2022-06-18 12:23:04 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
The above statement created space in the device memory of 100 integers.
|
|
|
|
|
That space can now be used to perform copies back and forth from host
|
|
|
|
|
memory. For example:
|
2022-06-18 12:23:04 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
procedure Main is
|
|
|
|
|
type Int_Array is array (Integer range <>) of Integer;
|
|
|
|
|
type Int_Array_Access is access all Int_Array;
|
|
|
|
|
|
|
|
|
|
Host_Array : Int_Array_Access := new Int_Array (1 .. 100);
|
|
|
|
|
Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);
|
|
|
|
|
begin
|
|
|
|
|
Host_Array := (others => 0);
|
|
|
|
|
|
|
|
|
|
CUDA.Runtime_API.Memcpy
|
|
|
|
|
(Dst => Device_Array,
|
|
|
|
|
Src => Host_Array.all'Address,
|
|
|
|
|
Count => Host_Array.all'Size,
|
|
|
|
|
Kind => Memcpy_Host_To_Device);
|
|
|
|
|
|
2023-07-27 14:53:06 +03:00
|
|
|
pragma CUDA_Execute (
|
2022-06-18 12:23:04 -04:00
|
|
|
Some_Kernel (Device_Array, Host_Array.all'Length),
|
|
|
|
|
Host_Array.all'Length,
|
|
|
|
|
1);
|
|
|
|
|
|
|
|
|
|
CUDA.Runtime_API.Memcpy
|
|
|
|
|
(Dst => Host_Array.all'Address
|
|
|
|
|
Src => Device_Array,
|
|
|
|
|
Count => Host_Array.all'Size,
|
|
|
|
|
Kind => Memcpy_Device_To_Host);
|
|
|
|
|
end Main;
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
This code copies the contents of :code:`Host_Array` to
|
|
|
|
|
:code:`Device_Array`, performs some computations on that data on the
|
|
|
|
|
device, and then copies the data back. At this level of coding, we're not
|
|
|
|
|
passing a typed array but instead a raw address. On the kernel side, we
|
|
|
|
|
need to reconstruct the array with an overlay:
|
2022-06-18 12:23:04 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
procedure Kernel (Array_Address : System.Address; Length : Integer) is
|
|
|
|
|
Device_Array : Int_Array (1 .. Length)
|
|
|
|
|
with Address => Array_Address;
|
|
|
|
|
begin
|
|
|
|
|
Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
|
|
|
|
|
end Kernel;
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
While it works, this method of passing data back and forth is not very
|
|
|
|
|
satisfactory and you should reserve it for cases where an alternative
|
|
|
|
|
doesn't exist or doesn't exist yet. In particular, typing is lost at the
|
|
|
|
|
interface, and you need to carefully check manually for type correctness.
|
2022-06-18 12:23:04 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
Specifying Where Code is For
|
|
|
|
|
============================
|
2022-06-20 15:06:50 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
Like in CUDA, a GNAT for CUDA application contains code that may be
|
|
|
|
|
compiled exclusively for the host, the device, or both. By default, all
|
|
|
|
|
code is compiled for both the host and the device. You can identify code as
|
|
|
|
|
only being compilable for the device by using the :code:`CUDA_Device`
|
|
|
|
|
aspect:
|
2022-06-20 15:06:50 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
|
|
|
|
procedure Some_Device_Procedure
|
|
|
|
|
with CUDA_Device;
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
:code:`Some_Device_Procedure` will not exist on the host. Calling it will
|
|
|
|
|
result in a compilation error.
|
2022-06-20 15:06:50 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
The corresponding :code:`CUDA_Host` aspect is currently not implemented.
|
2022-06-21 09:14:05 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
Accessing Block and Thread Indexes and Dimensions
|
|
|
|
|
=================================================
|
2022-06-20 15:06:50 -04:00
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
GNAT for CUDA® allows you to access block and thread indexes and
|
|
|
|
|
dimensions in a way that's similar to CUDA. The package
|
|
|
|
|
:code:`CUDA.Runtime_API` declares :code:`Block_Dim`, :code:`Grid_Dim`,
|
|
|
|
|
:code:`Block_IDx` and :code:`Thread_IDx` which map directly to the
|
|
|
|
|
corresponding PTX registers. For example:
|
2022-06-20 15:06:50 -04:00
|
|
|
|
|
|
|
|
.. code-block:: ada
|
|
|
|
|
|
2022-11-26 19:12:05 -05:00
|
|
|
J : Integer := Integer (Block_Dim.X * Block_IDx.Y + Thread_IDx.X);
|