CUDA C/C++ Programming Notes
Overview
What is CUDA?
CUDA is a parallel computing architecture developed by NVIDIA that allows developers to leverage GPU acceleration for general-purpose computing, not just graphics.
Key points:
- Exposes GPU parallelism for general-purpose computing while retaining high performance.
- Based on industry-standard C/C++ with a small set of extensions, like
Malloc() -> cudaMalloc()
andMemcpy() -> cudaMemcpy()
- Provides straightforward APIs to manage devices, memory, and kernels.
- Enables writing and launching CUDA kernels, managing GPU memory, and handling synchronization between the CPU (host) and GPU (device).
Difference between host and device:
- Host: CPU
- Device: GPU
Using __global__
to declare a function as device code:
- Executes on the device (GPU)
- Called from the host (CPU)
Heterogeneous Computing
Host and Device relationship
- Host: The CPU and its associated memory; executes serial portions of code.
- Device: The GPU and its dedicated memory; executes parallel portions of code efficiently.
Simple Processing Flow
Hello World
Host (CPU) Example
A simple C program running entirely on the CPU:
#include <stdio.h>
int main(void) {
printf("Hello World!\n");
return 0;
}
- Standard C code executed by the CPU.
- Compiled using a standard C compiler like
gcc
. - Outputs
Hello World!
directly.
CUDA Device (GPU) Example
A basic CUDA C program that launches a kernel (device function) on the GPU:
#include <stdio.h>
// Device kernel (does nothing here)
__global__ void mykernel(void) {
}
int main(void) {
// Launch kernel on the GPU
mykernel<<<1, 1>>>();
// Host code continues
printf("Hello World!\n");
return 0;
}
__global__
keyword specifies a CUDA kernel (runs on GPU, called from CPU).- Triple angle brackets
<<< >>>
indicate kernel launch parameters (1 block, 1 thread). - Requires NVIDIA's CUDA compiler (
nvcc
) to compile. - Kernel launch does nothing visible here; real work happens inside device code.
- Still prints
Hello World!
from the CPU (host).
Addition on the Device
A basic CUDA C program performing addition on the GPU:
#include <stdio.h>
// Device kernel for addition
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
int main(void) {
int a = 2, b = 7, c;
int *d_a, *d_b, *d_c;
// Allocate memory on GPU
cudaMalloc((void **)&d_a, sizeof(int));
cudaMalloc((void **)&d_b, sizeof(int));
cudaMalloc((void **)&d_c, sizeof(int));
// Copy values to GPU
cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);
// Launch kernel on GPU
add<<<1, 1>>>(d_a, d_b, d_c);
// Copy result back to CPU
cudaMemcpy(&c, d_c, sizeof(int), cudaMemcpyDeviceToHost);
// Free GPU memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
printf("Sum: %d\n", c);
return 0;
}
__global__
defines a kernel function to execute on the GPU.- Uses CUDA-specific functions for memory management (
cudaMalloc
,cudaMemcpy
,cudaFree
). - Kernel launch syntax (
<<<1, 1>>>
) specifies execution configuration: 1 block, 1 thread. - Data must be explicitly transferred between host (CPU) and device (GPU).
- Compiled with NVIDIA's CUDA compiler (
nvcc
).
Vector Addition on the Device
Device Kernel
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
- Each block handles one element of the addition.
- Blocks run in parallel on the GPU.
Example of Parallel Execution:
- Block 0:
c[0] = a[0] + b[0];
- Block 1:
c[1] = a[1] + b[1];
- Block 2:
c[2] = a[2] + b[2];
- Block 3:
c[3] = a[3] + b[3];
Host Code Example
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// Allocate space for device copies
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Allocate space for host copies and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add<<<N, 1>>>(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
- Allocates and manages memory explicitly.
- Each GPU block computes one addition in parallel.
- Data transferred between host and GPU explicitly.
Below is a new section on Indexing that integrates the concepts covered on pages 37–48:
Indexing
Efficient indexing is key to mapping threads to data elements in CUDA. In CUDA’s parallel programming model, each thread in a grid must be assigned a unique portion of the data. This section outlines the common strategies and formulas used for indexing in CUDA kernels.
Basic Indexing Concepts
CUDA organizes threads into blocks, and blocks form a grid. Each thread within a block is uniquely identified by threadIdx
, while each block in the grid is identified by blockIdx
.
Built-in Variables:
threadIdx.x
: The thread’s index within its block (in the x-dimension).blockIdx.x
: The block’s index within the grid (in the x-dimension).blockDim.x
: The number of threads in a block (in the x-dimension). In the example aboveblockDim.x = 8
The Indexing Formula: When processing one-dimensional arrays, the unique global index for each thread is calculated using:
int index = threadIdx.x + blockIdx.x * blockDim.x;
Practical Example
Consider a kernel launch where each block contains 8 threads. For a thread with:
threadIdx.x = 5
blockIdx.x = 2
M = blockDim.x = 8
This means that the thread processes the 22nd element (using zero-based indexing) of the array.
Multi-Dimensional Indexing
For higher-dimensional data, CUDA supports three-dimensional configurations:
2D Indexing:
When working with images or matrices, you may compute a unique index using both x and y dimensions:
int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;
int index = row * width + col; // 'width' is the number of columns
3D Indexing:
For volume data, a similar approach extends to three dimensions using threadIdx.z
, blockIdx.z
, and blockDim.z
.
Handling Arbitrary Vector Sizes
Often, the total number of data elements is not an exact multiple of the block size. To safely process all elements without accessing out-of-bound memory, include a bounds check in your kernel:
__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n) {
c[index] = a[index] + b[index];
}
}
- Kernel Launch:
The number of blocks is computed to cover all elements:
add<<<(n + blockDim.x - 1) / blockDim.x, blockDim.x>>>(d_a, d_b, d_c, n);
Shared Memory, 1D Stencil, and __syncthreads()
Overview
CUDA provides a fast, user-managed on-chip memory called shared memory that all threads within a block can access. This section demonstrates how to leverage shared memory for a 1D stencil operation and explains the role of the synchronization primitive __syncthreads()
.
Shared Memory Basics
-
Definition:
Shared memory is declared with the__shared__
keyword. It is a limited, high-speed memory region accessible by all threads within the same block. -
Purpose:
It is used to cache data from global memory to reduce redundant memory accesses, thereby improving performance—especially in operations that reuse data, such as stencil computations.
The 1D Stencil Operation
A stencil operation processes each element of an input array by combining it with its neighboring elements. For example, with a radius of 3, each output element is computed by summing 7 consecutive elements from the input array.
Example: 1D Stencil Kernel
#define BLOCK_SIZE 16
#define RADIUS 3
__global__ void stencil_1d(int *in, int *out) {
// Declare shared memory with additional space for halo elements
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
// Calculate global and local indices
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Load the main data element into shared memory
temp[lindex] = in[gindex];
// Load halo elements (neighbors) needed for the stencil
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize to ensure all shared memory loads are complete
__syncthreads();
// Apply the stencil: sum the element and its neighbors
int result = 0;
for (int offset = -RADIUS; offset <= RADIUS; offset++) {
result += temp[lindex + offset];
}
// Write the result back to global memory
out[gindex] = result;
}
How the Kernel Works
- Data Loading:
Each thread loads one data element from global memory into a designated location in the shared memory array (temp
). Threads with an index less thanRADIUS
also load extra "halo" elements from the boundaries. These halo elements provide the necessary neighboring data for the stencil computation.
-
Synchronization with __syncthreads():
The call to__syncthreads()
is essential because it ensures that all threads have finished loading their data into shared memory before any thread begins processing. This barrier prevents data races, ensuring that every thread sees the correct, complete data. -
Stencil Computation:
After synchronization, each thread applies the stencil by summing the element at its local index (lindex
) along with its surrounding elements within the radius. The result is then written back to global memory.
The Role of __syncthreads()
-
Barrier for Threads:
__syncthreads()
acts as a barrier where all threads in a block must reach before any can proceed. This is crucial when threads depend on data loaded by other threads. -
Preventing Data Hazards:
It prevents race conditions by ensuring that shared memory writes are completed before any thread begins reading the data for computation. -
Uniform Execution Requirement:
All threads within a block must execute the same__syncthreads()
call (i.e., it should not be placed inside a condition that may evaluate differently across threads) to avoid deadlocks.
Practical Considerations
-
Handling Boundaries:
When the total number of elements does not neatly divide into blocks, additional boundary checks may be necessary to prevent out-of-bounds memory access. -
Performance Optimization:
Efficient use of shared memory can significantly reduce global memory accesses, especially in stencil operations where the same data may be reused by multiple threads.
Below is a new section that covers managing the device, based on the content from the provided CUDA slides (ppt pages 60+):
Managing the Device
Overview
CUDA applications must manage GPU devices effectively to ensure optimal performance, especially in systems with multiple GPUs. This section describes how to query available devices, select the appropriate device for execution, retrieve device properties, and perform advanced operations like peer-to-peer memory copies.
Querying and Selecting Devices
-
Querying Devices:
UsecudaGetDeviceCount(int *count)
to determine the number of CUDA-capable GPUs in your system. This helps in dynamically adjusting your application to the available hardware. -
Selecting a Device:
Once you know the available devices, usecudaSetDevice(int device)
to select the GPU on which your kernels will run. This is essential in multi-GPU systems where you might want to distribute workloads. -
Getting the Current Device:
UsecudaGetDevice(int *device)
to find out which device is currently active. -
Retrieving Device Properties:
To get detailed information about a GPU—such as its memory size, number of registers, clock rate, and compute capability—use:
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device);
This information can be used to optimize your code for the specific hardware.
Advanced Device Management
-
Multi-threading and Device Sharing:
Multiple host threads can share a single GPU, and a single thread can manage multiple GPUs by switching between them using successive calls tocudaSetDevice
. -
Peer-to-Peer Memory Copies:
CUDA supports direct memory copies between GPUs withcudaMemcpy(...)
, provided that the operating system and the GPUs support peer-to-peer transfers. This can further enhance performance by bypassing the host memory for inter-device communication.
Synchronization and Error Handling
-
Synchronizing Host and Device:
Kernel launches are asynchronous by default. UsecudaDeviceSynchronize()
to block the host until all preceding CUDA calls have completed. This is crucial before copying results back to the host or when precise timing is required. -
Error Reporting:
Every CUDA API call returns an error code. After making an API call, check for errors using:
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
printf("CUDA Error: %s\n", cudaGetErrorString(err));
}
This practice helps in diagnosing issues related to device management and kernel execution.
Conclusion
Managing the device in CUDA involves more than simply launching kernels. By querying device capabilities, selecting the appropriate GPU, and handling synchronization and errors properly, you ensure that your application fully leverages the available hardware resources for optimal performance.