CUDA
1 Concepts
- thread
- thread block, consists of warps, executed on SM(Streaming Multiprocessor)
- warp, is a 32 thread block. A warp is executed physically in parellel (SIMD, Single Instruction, Multiple Data)
- Tensor core, a matrix-multiply engine that takes operands in registers, warp-wide
- No register in Tensor core (???), can be accessed via CUDA C++ or PTX
- NVLINK C2C, a higher speed PCIe between CPU and GPU
- Unified Memory, demanding paging between host and devices, still need move data between CPU and GPU, but more rapidly
- NUMBA, python for CUDA
2 GPU Latency Hiding
- One of the fundamental designs which differenciate GPU from CPUs
3 CUDA Compatibility
- Code needs to be run on future GPUS “without recompilation”
- Complied to PTX, and to SASS executed on GPU
- “Maybe” we don’