doc/programming.rst

**************************************
Programming with GNAT for CUDA®
**************************************

CUDA API
========

The CUDA API available from GNAT for CUDA® is a binding to the CUDA API
provided by NVIDIA. The NVIDIA API is installed with the CUDA driver. You
access the Ada API by adding a reference to :file:`cuda_host.gpr` (on the
host) and :file:`cuda_device.gpr` (on the target). The initial
installation script generates the Ada version of the API from the CUDA
version that's installed on your system.

Two versions of the Ada API are available:

- a "thick" binding version. These are child units of the :code:`CUDA`
  package, the main one being :code:`CUDA.Runtime_API`. This is the API you
  will most likely use. However, this API is still in the process of being
  completed and a number of types and subprogram specs have not been mapped
  to higher-level Ada constructs. For example, you will still see a lot of
  references to :code:`System.Address` where you would normally expect
  specific access types in Ada.
- a "thin" binding version. These are typically identified by having a
  suffix of :code:`_h`.  They are direct bindings to the underlying C
  APIs. These bindings are functional and complete. They can be used as a
  low-level alternative to the thick binding, but they don't expose an
  interface consistent with the Ada programming style and may require more
  work to use.

You can regenerate these bindings at any time. You may want to do this, for
example, if you install a new version of CUDA. To regenerate these
bindings, execute the :file:`bind.sh` script located in :file:`<your GNAT
for CUDA installation>/cuda/api/`.

Defining and calling Kernels
============================

Just as in a typical CUDA program, programming in GNAT for CUDA requires
you to identify application entry points to the GPU code, called
"kernels". In Ada, you do this by annotating a procedure with the
:code:`CUDA_Global` aspect, which serves the same role as the CUDA
:code:`__global__` modifier. For example:

.. code-block:: ada

    procedure My_Kernel (X : Some_Array_Access)
    with CUDA_Global;

Kernels are compiled both for the host and the device. They can be called
as regular procedures, e.g:

.. code-block:: ada

    My_Kernel (An_Array_Instance);

The above makes a regular single-threaded call to the kernel and executes
it on the host.  You may want to do this because of a better debugging
environment on the host.

To call a kernel on the device (which means copying it to the device and
executing it there), you use the :code:`CUDA_Execute` pragma:

.. code-block:: ada

    pragma CUDA_Execute (My_Kernel (An_Array_Instance), 10, 1);

The procedure call looks the same as a regular call, but this call is
surrounded by the pragma :code:`CUDA_Execute`, which has two extra
parameters defining, respectively, the number of blocks per grid and the
number of threads per block. This is equivalent to the CUDA call:

.. code-block:: c

    <<<10, 1>>> myKernel (someArray);

In each case, these calls launch ten instances of the kernel to the device.

The numbers of threads per block and blocks per grid can be expressed as a
one-dimensional scalar or a :code:`Dim3` value that specifies all three
dimensions (:code:`x`, :code:`y`, and :code:`z`). For example:

.. code-block:: ada

   pragma CUDA_Execute (My_Kernel (An_Array_Instance), (3, 3, 3), (3, 3, 3));

The above call launches (3 * 3 * 3) * (3 * 3 * 3) = 729 instances of the
kernel on the device.

Passing Data between Device and Host
====================================

Using Storage Model Aspect
--------------------------

"Storage Model" is an extension to the Ada language that is currently under
development. General description of this capability can be found `here
<https://github.com/AdaCore/ada-spark-rfcs/blob/master/prototyped/storage_model_2.rst>`_.

GNAT for CUDA provides a storage model that maps to CUDA primitives for
allocation, deallocation, and copying. The model is declared in the package
:code:`CUDA.Storage_Models`.  You may either use
:code:`CUDA.Storage_Models.Model` itself or you may create your own.

When a pointer type is associated with a CUDA storage model, memory
allocation through that pointer occurs on the device in the same manner as
it would in the host if a storage model wasn't specified. For example:

.. code-block:: ada

    type Int_Array is array (Integer range <>) of Integer;

    type Int_Array_Device_Access is access Int_Array
       with Designated_Storage_Model => CUDA.Storage_Model.Model;

    Device_Array : Int_Array_Device_Access := new Int_Array (1 .. 100);

In addition to allocation being done on the device, copies between the host
and device are converted to call the CUDA memory copy operations. So you
can write:

.. code-block:: ada

    procedure Main is
       type Int_Array_Host_Access is access Int_Array;

       Host_Array : Int_Array_Host_Access := new Int_Array (1 .. 100);
       Device_Array : Int_Array_Device_Access := new Int_Array'(Host_Array.all);
    begin
       pragma CUDA_Execute (
           Some_Kernel (Device_Array),
           Host_Array.all'Length,
           1);

       Host_Array.all := Device_Array.all;
    end Main;

On the kernel side, :code:`CUDA.Storage_Model.Model` is the native storage
model (as opposed to the foreign device one when on the host side). You
can use :code:`Int_Array_Device_Access` directly:

.. code-block:: ada

    procedure Kernel (Device_Array : Int_Array_Device_Access) is
    begin
       Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
    end Kernel;

This is the recommended way of sharing memory between device and host.
However, the storage model can be extended to support capabilities such as
streaming or unified memory.

Using Unified Storage Model
---------------------------

An alternative to using the default CUDA Storage model is to use so-called
"unified memory". In that model, the device memory is mapped directly onto
host memory, so no special copy operation is necessary. The factors that
may lead you to choose to one model over the other are outside of the scope
of this manual. To use unified memory, you use the package
:code:`Unified_Model` instead of the default one:

.. code-block:: ada

    type Int_Array is array (Integer range <>) of Integer;

    type Int_Array_Device_Access is access Int_Array
       with Designated_Storage_Model => CUDA.Storage_Model.Unified_Model;

Using Storage Model with Streams
--------------------------------

CUDA streams allows you to launch several computations in parallel. This
model allows you to specify which computation write and read operation must
wait for. The Ada CUDA API doesn't provide a pre-allocated stream memory
model. Instead, it provides a type, :code:`CUDA_Async_Storage_Model`, that
you can instantiate and specify the specific stream:

.. code-block:: ada

    My_Stream_Model : CUDA.Storage_Model.CUDA_Async_Storage_Model
      (Stream => Stream_Create);

    type Int_Array is array (Integer range <>) of Integer;

    type Int_Array_Device_Access is access Int_Array
       with Designated_Storage_Model => My_Stream_Model;

The data stream associated with a specific model can vary over time,
allowing different parts of a given object to be used by different streams,
e.g.:

.. code-block:: ada

       X : Int_Array_Device_Access := new Int_Array (1 .. 10_000);
       Stream_1 : Stream_T := Stream_Create;
       Stream_2 : Stream_T := Stream_Create;
    begin
       My_Stream_Model.Stream := Stream_1;
       X (1 .. 5_000) := 0;
       My_Stream_Model.Stream := Stream_2;
       X (5_001 .. 10_000) := 0;

.. only:: COMMENT

    The example above would need additional explanations.
    It is not clear (if at all) X and the streams are related

Low-Level Data Transfer
-----------------------

At the lowest level, you can allocate memory to the device using the
standard CUDA function :code:`malloc` that's bound from
:code:`CUDA.Runtime_API.Malloc`. E.g.:

.. code-block:: ada

 Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);

This is equivalent to the following CUDA code:

.. code-block:: c

 int *deviceArray = cudaMalloc (sizeof (int) * 100);

In this example, objects on the Ada side aren't typed. Creating typed
objects requires more advanced Ada constructions that are described later.

The above statement created space in the device memory of 100 integers.
That space can now be used to perform copies back and forth from host
memory. For example:

.. code-block:: ada

    procedure Main is
       type Int_Array is array (Integer range <>) of Integer;
       type Int_Array_Access is access all Int_Array;

       Host_Array : Int_Array_Access := new Int_Array (1 .. 100);
       Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);
    begin
       Host_Array := (others => 0);

       CUDA.Runtime_API.Memcpy
           (Dst   => Device_Array,
            Src   => Host_Array.all'Address,
            Count => Host_Array.all'Size,
            Kind  => Memcpy_Host_To_Device);

        pragma CUDA_Execute (
            Some_Kernel (Device_Array, Host_Array.all'Length),
            Host_Array.all'Length,
            1);

        CUDA.Runtime_API.Memcpy
           (Dst   => Host_Array.all'Address
            Src   => Device_Array,
            Count => Host_Array.all'Size,
            Kind  => Memcpy_Device_To_Host);
    end Main;

This code copies the contents of :code:`Host_Array` to
:code:`Device_Array`, performs some computations on that data on the
device, and then copies the data back. At this level of coding, we're not
passing a typed array but instead a raw address. On the kernel side, we
need to reconstruct the array with an overlay:

.. code-block:: ada

    procedure Kernel (Array_Address : System.Address; Length : Integer) is
       Device_Array : Int_Array (1 .. Length)
          with Address => Array_Address;
    begin
       Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;
    end Kernel;

While it works, this method of passing data back and forth is not very
satisfactory and you should reserve it for cases where an alternative
doesn't exist or doesn't exist yet. In particular, typing is lost at the
interface, and you need to carefully check manually for type correctness.

Specifying Where Code is For
============================

Like in CUDA, a GNAT for CUDA application contains code that may be
compiled exclusively for the host, the device, or both. By default, all
code is compiled for both the host and the device. You can identify code as
only being compilable for the device by using the :code:`CUDA_Device`
aspect:

.. code-block:: ada

   procedure Some_Device_Procedure
      with CUDA_Device;

:code:`Some_Device_Procedure` will not exist on the host. Calling it will
result in a compilation error.

The corresponding :code:`CUDA_Host` aspect is currently not implemented.

Accessing Block and Thread Indexes and Dimensions
=================================================

GNAT for CUDA® allows you to access block and thread indexes and
dimensions in a way that's similar to CUDA. The package
:code:`CUDA.Runtime_API` declares :code:`Block_Dim`, :code:`Grid_Dim`,
:code:`Block_IDx` and :code:`Thread_IDx` which map directly to the
corresponding PTX registers. For example:

.. code-block:: ada

    J : Integer := Integer (Block_Dim.X * Block_IDx.Y + Thread_IDx.X);
fixed code blocks 2022-12-08 13:30:51 -05:00			`**************************************`
Added section on the CUDA API 2022-06-23 11:29:25 -04:00			`Programming with GNAT for CUDA®`
further documentation 2022-06-18 12:23:04 -04:00			`**************************************`

Added section on the CUDA API 2022-06-23 11:29:25 -04:00			`CUDA API`
			`========`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`The CUDA API available from GNAT for CUDA® is a binding to the CUDA API`
further documentation fixes 2022-11-26 19:37:59 -05:00			`provided by NVIDIA. The NVIDIA API is installed with the CUDA driver. You`
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			access the Ada API by adding a reference to :file:`cuda_host.gpr` (on the
further documentation fixes 2022-11-26 19:37:59 -05:00			host) and :file:`cuda_device.gpr` (on the target). The initial
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`installation script generates the Ada version of the API from the CUDA`
			`version that's installed on your system.`
Added section on the CUDA API 2022-06-23 11:29:25 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`Two versions of the Ada API are available:`
Added section on the CUDA API 2022-06-23 11:29:25 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			- a "thick" binding version. These are child units of the :code:`CUDA`
			package, the main one being :code:`CUDA.Runtime_API`. This is the API you
			`will most likely use. However, this API is still in the process of being`
			`completed and a number of types and subprogram specs have not been mapped`
			`to higher-level Ada constructs. For example, you will still see a lot of`
			references to :code:`System.Address` where you would normally expect
			`specific access types in Ada.`
			`- a "thin" binding version. These are typically identified by having a`
			suffix of :code:`_h`. They are direct bindings to the underlying C
further documentation fixes 2022-11-26 19:37:59 -05:00			`APIs. These bindings are functional and complete. They can be used as a`
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`low-level alternative to the thick binding, but they don't expose an`
			`interface consistent with the Ada programming style and may require more`
			`work to use.`
Added section on the CUDA API 2022-06-23 11:29:25 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`You can regenerate these bindings at any time. You may want to do this, for`
			`example, if you install a new version of CUDA. To regenerate these`
			bindings, execute the :file:`bind.sh` script located in :file:`<your GNAT
			for CUDA installation>/cuda/api/`.
Added section on the CUDA API 2022-06-23 11:29:25 -04:00
further documentation 2022-06-18 12:23:04 -04:00			`Defining and calling Kernels`
			`============================`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`Just as in a typical CUDA program, programming in GNAT for CUDA requires`
			`you to identify application entry points to the GPU code, called`
			`"kernels". In Ada, you do this by annotating a procedure with the`
			:code:`CUDA_Global` aspect, which serves the same role as the CUDA
			:code:`__global__` modifier. For example:
Further documentation 2022-06-20 15:06:50 -04:00
			`.. code-block:: ada`
further documentation 2022-06-18 12:23:04 -04:00
			`procedure My_Kernel (X : Some_Array_Access)`
			`with CUDA_Global;`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`Kernels are compiled both for the host and the device. They can be called`
			`as regular procedures, e.g:`
Further documentation 2022-06-20 15:06:50 -04:00
			`.. code-block:: ada`
further documentation 2022-06-18 12:23:04 -04:00
			`My_Kernel (An_Array_Instance);`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`The above makes a regular single-threaded call to the kernel and executes`
			`it on the host. You may want to do this because of a better debugging`
			`environment on the host.`
further documentation 2022-06-18 12:23:04 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`To call a kernel on the device (which means copying it to the device and`
			executing it there), you use the :code:`CUDA_Execute` pragma:
Further documentation 2022-06-20 15:06:50 -04:00
			`.. code-block:: ada`
further documentation 2022-06-18 12:23:04 -04:00
			`pragma CUDA_Execute (My_Kernel (An_Array_Instance), 10, 1);`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`The procedure call looks the same as a regular call, but this call is`
			surrounded by the pragma :code:`CUDA_Execute`, which has two extra
Fix explanation of pragma CUDA_Execute in documentation 2023-04-14 11:00:21 +02:00			`parameters defining, respectively, the number of blocks per grid and the`
			`number of threads per block. This is equivalent to the CUDA call:`
Further documentation 2022-06-20 15:06:50 -04:00
			`.. code-block:: c`
further documentation 2022-06-18 12:23:04 -04:00
			`<<<10, 1>>> myKernel (someArray);`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`In each case, these calls launch ten instances of the kernel to the device.`
further documentation 2022-06-18 12:23:04 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`The numbers of threads per block and blocks per grid can be expressed as a`
			one-dimensional scalar or a :code:`Dim3` value that specifies all three
fixed code blocks 2022-12-08 13:30:51 -05:00			dimensions (:code:`x`, :code:`y`, and :code:`z`). For example:
further documentation 2022-06-18 12:23:04 -04:00
Further documentation 2022-06-20 15:06:50 -04:00			`.. code-block:: ada`

further documentation 2022-06-18 12:23:04 -04:00			`pragma CUDA_Execute (My_Kernel (An_Array_Instance), (3, 3, 3), (3, 3, 3));`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`The above call launches (3 * 3 * 3) * (3 * 3 * 3) = 729 instances of the`
further documentation 2022-06-18 12:23:04 -04:00			`kernel on the device.`

			`Passing Data between Device and Host`
			`====================================`

extended documentation on storage models 2022-07-05 15:56:05 -04:00			`Using Storage Model Aspect`
			`--------------------------`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`"Storage Model" is an extension to the Ada language that is currently under`
further documentation fixes 2022-11-26 19:37:59 -05:00			development. General description of this capability can be found `here
Small documentation fixes from clean setup 2023-07-27 14:53:06 +03:00			<https://github.com/AdaCore/ada-spark-rfcs/blob/master/prototyped/storage_model_2.rst>`_.
extended documentation on storage models 2022-07-05 15:56:05 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`GNAT for CUDA provides a storage model that maps to CUDA primitives for`
			`allocation, deallocation, and copying. The model is declared in the package`
			:code:`CUDA.Storage_Models`. You may either use
			:code:`CUDA.Storage_Models.Model` itself or you may create your own.
extended documentation on storage models 2022-07-05 15:56:05 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`When a pointer type is associated with a CUDA storage model, memory`
			`allocation through that pointer occurs on the device in the same manner as`
further documentation fixes 2022-11-26 19:37:59 -05:00			`it would in the host if a storage model wasn't specified. For example:`
extended documentation on storage models 2022-07-05 15:56:05 -04:00
			`.. code-block:: ada`

			`type Int_Array is array (Integer range <>) of Integer;`

			`type Int_Array_Device_Access is access Int_Array`
			`with Designated_Storage_Model => CUDA.Storage_Model.Model;`

further documentation fixes 2022-11-26 19:37:59 -05:00			`Device_Array : Int_Array_Device_Access := new Int_Array (1 .. 100);`
extended documentation on storage models 2022-07-05 15:56:05 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`In addition to allocation being done on the device, copies between the host`
further documentation fixes 2022-11-26 19:37:59 -05:00			`and device are converted to call the CUDA memory copy operations. So you`
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`can write:`
extended documentation on storage models 2022-07-05 15:56:05 -04:00
			`.. code-block:: ada`

			`procedure Main is`
			`type Int_Array_Host_Access is access Int_Array;`

			`Host_Array : Int_Array_Host_Access := new Int_Array (1 .. 100);`
			`Device_Array : Int_Array_Device_Access := new Int_Array'(Host_Array.all);`
			`begin`
Small documentation fixes from clean setup 2023-07-27 14:53:06 +03:00			`pragma CUDA_Execute (`
extended documentation on storage models 2022-07-05 15:56:05 -04:00			`Some_Kernel (Device_Array),`
			`Host_Array.all'Length,`
			`1);`

			`Host_Array.all := Device_Array.all;`
			`end Main;`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			On the kernel side, :code:`CUDA.Storage_Model.Model` is the native storage
further documentation fixes 2022-11-26 19:37:59 -05:00			`model (as opposed to the foreign device one when on the host side). You`
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			can use :code:`Int_Array_Device_Access` directly:
extended documentation on storage models 2022-07-05 15:56:05 -04:00
			`.. code-block:: ada`

			`procedure Kernel (Device_Array : Int_Array_Device_Access) is`
			`begin`
			`Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;`
			`end Kernel;`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`This is the recommended way of sharing memory between device and host.`
			`However, the storage model can be extended to support capabilities such as`
			`streaming or unified memory.`
extended documentation on storage models 2022-07-05 15:56:05 -04:00
			`Using Unified Storage Model`
			`---------------------------`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`An alternative to using the default CUDA Storage model is to use so-called`
			`"unified memory". In that model, the device memory is mapped directly onto`
			`host memory, so no special copy operation is necessary. The factors that`
			`may lead you to choose to one model over the other are outside of the scope`
			`of this manual. To use unified memory, you use the package`
			:code:`Unified_Model` instead of the default one:
fixed previous check-in 2022-07-05 16:53:43 -04:00
			`.. code-block:: ada`

			`type Int_Array is array (Integer range <>) of Integer;`

			`type Int_Array_Device_Access is access Int_Array`
			`with Designated_Storage_Model => CUDA.Storage_Model.Unified_Model;`

			`Using Storage Model with Streams`
			`--------------------------------`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`CUDA streams allows you to launch several computations in parallel. This`
			`model allows you to specify which computation write and read operation must`
			`wait for. The Ada CUDA API doesn't provide a pre-allocated stream memory`
			model. Instead, it provides a type, :code:`CUDA_Async_Storage_Model`, that
fixed code blocks 2022-12-08 13:30:51 -05:00			`you can instantiate and specify the specific stream:`
fixed previous check-in 2022-07-05 16:53:43 -04:00
			`.. code-block:: ada`

			`My_Stream_Model : CUDA.Storage_Model.CUDA_Async_Storage_Model`
			`(Stream => Stream_Create);`

			`type Int_Array is array (Integer range <>) of Integer;`

			`type Int_Array_Device_Access is access Int_Array`
			`with Designated_Storage_Model => My_Stream_Model;`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`The data stream associated with a specific model can vary over time,`
			`allowing different parts of a given object to be used by different streams,`
			`e.g.:`
fixed previous check-in 2022-07-05 16:53:43 -04:00
			`.. code-block:: ada`

			`X : Int_Array_Device_Access := new Int_Array (1 .. 10_000);`
			`Stream_1 : Stream_T := Stream_Create;`
			`Stream_2 : Stream_T := Stream_Create;`
			`begin`
			`My_Stream_Model.Stream := Stream_1;`
			`X (1 .. 5_000) := 0;`
			`My_Stream_Model.Stream := Stream_2;`
			`X (5_001 .. 10_000) := 0;`
extended documentation on storage models 2022-07-05 15:56:05 -04:00
Small documentation fixes from clean setup 2023-07-27 14:53:06 +03:00			`.. only:: COMMENT`
doc: fix comment blocks 2023-09-12 10:56:19 +02:00
Small documentation fixes from clean setup 2023-07-27 14:53:06 +03:00			`The example above would need additional explanations.`
			`It is not clear (if at all) X and the streams are related`

Correct multiple typos in the public user docs 2022-08-25 14:33:38 +01:00			`Low-Level Data Transfer`
further documentation 2022-06-18 12:23:04 -04:00			`-----------------------`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`At the lowest level, you can allocate memory to the device using the`
			standard CUDA function :code:`malloc` that's bound from
			:code:`CUDA.Runtime_API.Malloc`. E.g.:
further documentation 2022-06-18 12:23:04 -04:00
			`.. code-block:: ada`

			`Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`This is equivalent to the following CUDA code:`
further documentation 2022-06-18 12:23:04 -04:00
			`.. code-block:: c`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`int deviceArray = cudaMalloc (sizeof (int) 100);`
further documentation 2022-06-18 12:23:04 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`In this example, objects on the Ada side aren't typed. Creating typed`
			`objects requires more advanced Ada constructions that are described later.`
further documentation 2022-06-18 12:23:04 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`The above statement created space in the device memory of 100 integers.`
			`That space can now be used to perform copies back and forth from host`
			`memory. For example:`
further documentation 2022-06-18 12:23:04 -04:00
			`.. code-block:: ada`

			`procedure Main is`
			`type Int_Array is array (Integer range <>) of Integer;`
			`type Int_Array_Access is access all Int_Array;`

			`Host_Array : Int_Array_Access := new Int_Array (1 .. 100);`
			`Device_Array : System.Address := CUDA.Runtime_API.Malloc (Integer'Size * 100);`
			`begin`
			`Host_Array := (others => 0);`

			`CUDA.Runtime_API.Memcpy`
			`(Dst => Device_Array,`
			`Src => Host_Array.all'Address,`
			`Count => Host_Array.all'Size,`
			`Kind => Memcpy_Host_To_Device);`

Small documentation fixes from clean setup 2023-07-27 14:53:06 +03:00			`pragma CUDA_Execute (`
further documentation 2022-06-18 12:23:04 -04:00			`Some_Kernel (Device_Array, Host_Array.all'Length),`
			`Host_Array.all'Length,`
			`1);`

			`CUDA.Runtime_API.Memcpy`
			`(Dst => Host_Array.all'Address`
			`Src => Device_Array,`
			`Count => Host_Array.all'Size,`
			`Kind => Memcpy_Device_To_Host);`
			`end Main;`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			This code copies the contents of :code:`Host_Array` to
			:code:`Device_Array`, performs some computations on that data on the
			`device, and then copies the data back. At this level of coding, we're not`
			`passing a typed array but instead a raw address. On the kernel side, we`
			`need to reconstruct the array with an overlay:`
further documentation 2022-06-18 12:23:04 -04:00
			`.. code-block:: ada`

			`procedure Kernel (Array_Address : System.Address; Length : Integer) is`
			`Device_Array : Int_Array (1 .. Length)`
			`with Address => Array_Address;`
			`begin`
			`Device_Array (Thread_IDx.X) := Device_Array (Thread_IDx.X) + 10;`
			`end Kernel;`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`While it works, this method of passing data back and forth is not very`
			`satisfactory and you should reserve it for cases where an alternative`
			`doesn't exist or doesn't exist yet. In particular, typing is lost at the`
			`interface, and you need to carefully check manually for type correctness.`
further documentation 2022-06-18 12:23:04 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`Specifying Where Code is For`
			`============================`
Further documentation 2022-06-20 15:06:50 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`Like in CUDA, a GNAT for CUDA application contains code that may be`
			`compiled exclusively for the host, the device, or both. By default, all`
			`code is compiled for both the host and the device. You can identify code as`
			only being compilable for the device by using the :code:`CUDA_Device`
			`aspect:`
Further documentation 2022-06-20 15:06:50 -04:00
			`.. code-block:: ada`

			`procedure Some_Device_Procedure`
			`with CUDA_Device;`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			:code:`Some_Device_Procedure` will not exist on the host. Calling it will
			`result in a compilation error.`
Further documentation 2022-06-20 15:06:50 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			The corresponding :code:`CUDA_Host` aspect is currently not implemented.
Added missing details. 2022-06-21 09:14:05 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`Accessing Block and Thread Indexes and Dimensions`
			`=================================================`
Further documentation 2022-06-20 15:06:50 -04:00
initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`GNAT for CUDA® allows you to access block and thread indexes and`
			`dimensions in a way that's similar to CUDA. The package`
			:code:`CUDA.Runtime_API` declares :code:`Block_Dim`, :code:`Grid_Dim`,
			:code:`Block_IDx` and :code:`Thread_IDx` which map directly to the
			`corresponding PTX registers. For example:`
Further documentation 2022-06-20 15:06:50 -04:00
			`.. code-block:: ada`

initial update of documentation, following english fixes 2022-11-26 19:12:05 -05:00			`J : Integer := Integer (Block_Dim.X * Block_IDx.Y + Thread_IDx.X);`