-- COMPUTE UNIFIED DEVICE ARCHITECTURE
-- Used to expose the computational horsepower of NVIDIA GPUs for GPU Computing
-- It is scalable across any number of threads
Based on industry-standard C
Small set of extensions to C language
Low learning curve
Straightforward APIs to manage devices, memory, etc.
- Host -The CPU and its memory
- Device - The GPU and its memory
- Kernel - Function compiled or the device and it is executed on the device with many threads
You (probably) need experience with C or C++
You do not need any GPU experience
You do not need any graphics experience
You do not need any parallel programming experience
Ø Data Parallesim - Program property where arithmetic operations are simultaneously performed on data structures.
A 1,000 X 1,000 matrix multiplication
1,000,000 independent dot products.
Each 1,000 multiply & 1,000 add arithmetic operations.
Ø Thread Creation : CUDA threads light weight than CPU threads.
Take few cycles to generate and schedule due to efficient hardware support.
CPU threads typically take thousands of clock cycles to generate and schedule.
Ø It avoids performance overhead of graphics layer APIs by compiling software directly to hardware (GPU assembly lang).
Example of CUDA processing flow
Ø Copy data from main memory to GPU memory
Ø CPU instructs the process to GPU
Ø GPU execute parallel in each core
Ø Copy the result from GPU memory to main memory
How to Write
Ø Create or edit the CUDA program with your favorite editor. Note: CUDA C language programs have the suffix ".cu".
Ø Compile the program with nvcc to create the executable.
Ø Run the executable.