mirror of
https://github.com/AdaCore/cuda.git
synced 2026-02-12 13:05:54 -08:00
301 lines
11 KiB
ReStructuredText
301 lines
11 KiB
ReStructuredText
**************************************
|
|
Programming with GNAT for CUDA®
|
|
**************************************
|
|
|
|
CUDA API
|
|
========
|
|
|
|
The CUDA API available from GNAT for CUDA® is a binding to the CUDA API
|
|
provided by NVIDIA, installed with the CUDA driver. Is is accessed by the host
|
|
by adding a reference to ``cuda_host.gpr`` (on the host) and ``cuda_device.gpr``
|
|
(on the target).
|
|
|
|
The Ada version of the API is generated automatically when running the initial
|
|
installation script, and thus corresponds specifically to the CUDA version that
|
|
is installed on the system.
|
|
|
|
Two versions of the API are available:
|
|
|
|
- a "thick" binding version. These units are child units of the CUDA package,
|
|
the main one being ``CUDA.Runtime_API``. This is the intended API to use.
|
|
Note that at this stage, this API is still in the process of being completed.
|
|
A number of types and subprogram profiles have not been mapped to higher-level
|
|
Ada constructions. For example, you will still see a lot of references
|
|
to ``System.Address`` where Ada would call for specific types.
|
|
- a "thin" binding version. These units are typically identified by their
|
|
suffix "_h" and are direct bindings to the underlying C APIs. These bindings
|
|
are functional and complete, they can be used as a low-level alternative
|
|
to the thick binding. However, they do not expose an interface consistent
|
|
with the Ada programming patterns and may require more work at the user-level.
|
|
|
|
At any time, these bindings can be regenerated. That can be useful for example
|
|
if a new version of CUDA is installed. To generate these bindings, you can
|
|
execute the "bind.sh" script locaed under
|
|
<your GNAT for CUDA installation>/cuda/api/.
|
|
|
|
Defining and calling Kernels
|
|
============================
|
|
|
|
Just as a typical CUDA program, programming in GNAT for CUDA requires the
|
|
developer to identify the application entry point to the GPU code, called
|
|
kernels. In Ada, this is done by associating a procedure with the
|
|
``CUDA_Global`` aspect, which serves the same role as the CUDA ``__global__``
|
|
modifier. For example:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure My_Kernel (X : Some_Array_Access)
|
|
with CUDA_Global;
|
|
|
|
Kernels are compiled both for host and device. They can be called as regular
|
|
procedures, e.g:
|
|
|
|
.. code-block:: ada
|
|
|
|
My_Kernel (An_Array_Instance);
|
|
|
|
Will do a regular single thread call to the kernel, and execute it on the host.
|
|
In some situations, this can help debug on the host.
|
|
|
|
Calling a kernel on the device is done through the CUDA_Execute pragma:
|
|
|
|
.. code-block:: ada
|
|
|
|
pragma CUDA_Execute (My_Kernel (An_Array_Instance), 10, 1);
|
|
|
|
Note that the procedure call looks the same as in the case of a regular call.
|
|
However, this call is done surrounded by the pragma CUDA_Execute, which has two
|
|
extra parameters, defining respectively the number of threads per block and number
|
|
of blocks per grid. This is equivalent to a familiar CUDA call:
|
|
|
|
.. code-block:: c
|
|
|
|
<<<10, 1>>> myKernel (someArray);
|
|
|
|
The above calls are launching ten instances of the kernel to the device.
|
|
|
|
Thread per block and blocks per grid can be expressed as a 1 dimensional scalar
|
|
or a ``Dim3`` value which will give a dimensionality in x, y and z. For example::
|
|
|
|
.. code-block:: ada
|
|
|
|
pragma CUDA_Execute (My_Kernel (An_Array_Instance), (3, 3, 3), (3, 3, 3));
|
|
|
|
The above call will launch (3 * 3 * 3) * (3 * 3 * 3) = 729 instances of the
|
|
kernel on the device.
|
|
|
|
Passing Data between Device and Host
|
|
====================================
|
|
|
|
Using Storage Model Aspect
|
|
--------------------------
|
|
|
|
Storage Model is an extension to the Ada language that is currently under
|
|
implementation. Discussion around the generic capability
|
|
can be found `here <https://github.com/AdaCore/ada-spark-rfcs/pull/76>`_.
|
|
|
|
GNAT for CUDA provides a storage model that maps to CUDA primitives for allocation,
|
|
deallocation and copy. It is declared in the package ``CUDA.Storage_Models``.
|
|
Users may used directly ``CUDA.Storage_Models.Model`` or create their own
|
|
instances.
|
|
|
|
When a pointer type is associated with a CUDA storage model, memory allocation
|
|
will happen on the device. This allocation can be a single operation, or multiple
|
|
allocations and copies as it is the case in GNAT for unconstrained arrays. For
|
|
example:
|
|
|
|
.. code-block:: ada
|
|
|
|
type Int_Array is array (Integer range <>) of Integer;
|
|
|
|
type Int_Array_Device_Access is access Int_Array
|
|
with Designated_Storage_Model => CUDA.Storage_Model.Model;
|
|
|
|
Device_Array : Int_Array_Device_Access := new Int_Array (1 .. 100);
|
|
|
|
Moreover, copies between host and device will be instrumented to call proper
|
|
CUDA memory copy operations. The code can now be written:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Main is
|
|
type Int_Array_Host_Access is access Int_Array;
|
|
|
|
Host_Array : Int_Array_Host_Access := new Int_Array (1 .. 100);
|
|
Device_Array : Int_Array_Device_Access := new Int_Array'(Host_Array.all);
|
|
begin
|
|
pragma Kernel_Execute (
|
|
Some_Kernel (Device_Array),
|
|
Host_Array.all'Length,
|
|
1);
|
|
|
|
Host_Array.all := Device_Array.all;
|
|
end Main;
|
|
|
|
On the kernel side, CUDA.Storage_Model.Model is implemented as being the native
|
|
storage model (as opposed to the foreign device one from the host).
|
|
``Int_Array_Device_Access`` can be used directly:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Kernel (Device_Array : Int_Array_Device_Access) is
|
|
begin
|
|
Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
|
|
end Kernel;
|
|
|
|
This is the intended way of sharing memory between device and host. Note that
|
|
the storage model can be extended to support capabilities such as streaming or
|
|
unified memory.
|
|
|
|
Using Unified Storage Model
|
|
---------------------------
|
|
|
|
An alternative to using the default CUDA Storage model is to use so-called
|
|
unified memory. When using such memory model, the device memory is mapped
|
|
directly on to host memory, and therefore no specific copy operation is
|
|
necessary. The factors that may lead to one model or the other are outside of
|
|
the scope of this manual. A specific model called ``Unified_Model`` can be used
|
|
in replacement of the default one:
|
|
|
|
.. code-block:: ada
|
|
|
|
type Int_Array is array (Integer range <>) of Integer;
|
|
|
|
type Int_Array_Device_Access is access Int_Array
|
|
with Designated_Storage_Model => CUDA.Storage_Model.Unified_Model;
|
|
|
|
Using Storage Model with Streams
|
|
--------------------------------
|
|
|
|
CUDA streams allows to launch several operations in parallel. This allows to
|
|
specify which execution write and read operation have to wait for. The Ada CUDA
|
|
API doesn't provide a pre-allocated stream memory model. Instead, it provides
|
|
a type that can be instantiated, and for which the specific stream can be
|
|
specified, e.g.:
|
|
|
|
.. code-block:: ada
|
|
|
|
My_Stream_Model : CUDA.Storage_Model.CUDA_Async_Storage_Model
|
|
(Stream => Stream_Create);
|
|
|
|
type Int_Array is array (Integer range <>) of Integer;
|
|
|
|
type Int_Array_Device_Access is access Int_Array
|
|
with Designated_Storage_Model => My_Stream_Model;
|
|
|
|
Note that the value of the stream associated to a specific model can vary over
|
|
time, allowing different parts of a given object to be used by different
|
|
streams, e.g.:
|
|
|
|
.. code-block:: ada
|
|
|
|
X : Int_Array_Device_Access := new Int_Array (1 .. 10_000);
|
|
Stream_1 : Stream_T := Stream_Create;
|
|
Stream_2 : Stream_T := Stream_Create;
|
|
begin
|
|
My_Stream_Model.Stream := Stream_1;
|
|
X (1 .. 5_000) := 0;
|
|
My_Stream_Model.Stream := Stream_2;
|
|
X (5_001 .. 10_000) := 0;
|
|
|
|
Low-Level Data Transfer
|
|
-----------------------
|
|
|
|
At the lowest level, it is possible to allocate memory to the device using the
|
|
standard CUDA function malloc bound from CUDA.Runtime_API.Malloc. E.g.:
|
|
|
|
.. code-block:: ada
|
|
|
|
Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);
|
|
|
|
This is equivalent to the following code in CUDA:
|
|
|
|
.. code-block:: c
|
|
|
|
int * deviceArray = cudaMalloc (sizeof (int) * 100);
|
|
|
|
Note that the objects on the Ada side aren't typed. Creating typed objects
|
|
requires more advanced Ada constructions that are described later.
|
|
|
|
The above example creates a space in the device memory of 100 integers. It can
|
|
now be used to perform copies back and forth from host memory. For example:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Main is
|
|
type Int_Array is array (Integer range <>) of Integer;
|
|
type Int_Array_Access is access all Int_Array;
|
|
|
|
Host_Array : Int_Array_Access := new Int_Array (1 .. 100);
|
|
Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);
|
|
begin
|
|
Host_Array := (others => 0);
|
|
|
|
CUDA.Runtime_API.Memcpy
|
|
(Dst => Device_Array,
|
|
Src => Host_Array.all'Address,
|
|
Count => Host_Array.all'Size,
|
|
Kind => Memcpy_Host_To_Device);
|
|
|
|
pragma Kernel_Execute (
|
|
Some_Kernel (Device_Array, Host_Array.all'Length),
|
|
Host_Array.all'Length,
|
|
1);
|
|
|
|
CUDA.Runtime_API.Memcpy
|
|
(Dst => Host_Array.all'Address
|
|
Src => Device_Array,
|
|
Count => Host_Array.all'Size,
|
|
Kind => Memcpy_Device_To_Host);
|
|
end Main;
|
|
|
|
The above will copy the contents of Host_Array to Device_Array, perform some
|
|
computations on the device, and then copy the memory back. Note that at this
|
|
level of data passing, we're not passing a typed array but a raw address. On the
|
|
kernel side, we need to reconstruct the array with an overlay:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Kernel (Array_Address : System.Address; Length : Integer) is
|
|
Device_Array : Int_Array (1 .. Length)
|
|
with Address => Array_Address;
|
|
begin
|
|
Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
|
|
end Kernel;
|
|
|
|
While effective, this method of passing data back and forth is not very
|
|
satisfactory and should be reserved for cases where an alternative does not
|
|
exist (yet). In particular, typing is lost at the interface, and the developer
|
|
is left with manual means of verification.
|
|
|
|
|
|
Specifying Compilation Side
|
|
===========================
|
|
|
|
As for CUDA, a GNAT for CUDA application contains code that may be compiled
|
|
exclusively for the host, the device or both. By default, all code is
|
|
compiled for both the host and the device. Code can be identified as only being
|
|
compilable for the device with the ``CUDA_Device`` aspect:
|
|
|
|
.. code-block:: ada
|
|
|
|
procedure Some_Device_Procedure
|
|
with CUDA_Device;
|
|
|
|
The above procedure will not exist on the host. Calling it will result in a
|
|
compilation error.
|
|
|
|
The corresponding ``CUDA_Host`` aspect is currently not implemented.
|
|
|
|
Accessing Blocks and Threads Indexes and Dimensions
|
|
===================================================
|
|
|
|
GNAT for CUDA® allows to access block and thread indexes and dimensions in a way
|
|
that is similar to CUDA. Notably, the package ``CUDA.Runtime_API`` declares
|
|
``Block_Dim``, ``Grid_Dim``, ``Block_IDx`` and ``Thread_IDx`` which map
|
|
directly to the corresponding PTX registers. For example:
|
|
|
|
.. code-block:: ada
|
|
|
|
I : Integer := Integer (Block_Dim.X * Block_IDx.Y + Thread_IDx.X);
|