CUDA

less than 1 minute read

1 Concepts

  • thread
  • thread block, consists of warps, executed on SM(Streaming Multiprocessor)
  • warp, is a 32 thread block. A warp is executed physically in parellel (SIMD, Single Instruction, Multiple Data)
  • Tensor core, a matrix-multiply engine that takes operands in registers, warp-wide
  • No register in Tensor core (???), can be accessed via CUDA C++ or PTX
  • NVLINK C2C, a higher speed PCIe between CPU and GPU
  • Unified Memory, demanding paging between host and devices, still need move data between CPU and GPU, but more rapidly
  • NUMBA, python for CUDA

2 GPU Latency Hiding

  • One of the fundamental designs which differenciate GPU from CPUs

3 CUDA Compatibility

  • Code needs to be run on future GPUS “without recompilation”
  • Complied to PTX, and to SASS executed on GPU
  • “Maybe” we don’

Tags:

Categories:

Updated: