CUDA

less than 1 minute read

1 Concepts

thread
thread block, consists of warps, executed on SM(Streaming Multiprocessor)
warp, is a 32 thread block. A warp is executed physically in parellel (SIMD, Single Instruction, Multiple Data)
Tensor core, a matrix-multiply engine that takes operands in registers, warp-wide
No register in Tensor core (???), can be accessed via CUDA C++ or PTX
NVLINK C2C, a higher speed PCIe between CPU and GPU
Unified Memory, demanding paging between host and devices, still need move data between CPU and GPU, but more rapidly
NUMBA, python for CUDA

2 GPU Latency Hiding

One of the fundamental designs which differenciate GPU from CPUs

3 CUDA Compatibility

Code needs to be run on future GPUS “without recompilation”
Complied to PTX, and to SASS executed on GPU
“Maybe” we don’

Twitter Facebook LinkedIn

You May Also Enjoy

Slurm and Enroot

May 19 2025

Finally touching on Slurm system. First heard about during CGG time, and we had some brief discussing of using it for cluster jobs. But our own implemention ...

NVLink, InfiniBand and SpectrumX

May 13 2025

Summary from zhihu post, which some picture from here.

K8S behind DGXCloud and NVCF

May 09 2025

Recently all work seems K8S related and practices around k8s helped me onboard DGXCloud and NVCF Helm deployment really fast. 0 Web Server It’s totally irrel...

Disagg PD in vLLM and LMCache

May 02 2025

Tested out Disagg PD in vLLM and sth about LMCache, an open-source Knowledge Delivery Network (KDN), and the Redis for LLMs.