GPU and related techs 101

2 minute read

Last time I touched CUDA and GPU was in 2018, when I was preparing for job hunting at CGG. It’s time to review some basics about GPU now

  1. Ray Tracing Ray Tracing models the behavior of light in the scene. The “ray” of light are followed, or “traced” to determine the color of all the objects in the scene.
    Alt text

  2. Resolutions
    Displays consist of pixels. The color of each pixel is determined by little programs, or shaders Alt text A 4K display at 60Hz requires 5million (497,664,000 = 2160 x 3840 x 60) pixels

  3. Virtual GPUs
    • Virutal PC
    • Virtual Applications: app streaming
    • RTX Virutal Workstations: performance graphics
  4. CUDA Cores
    CUDA cores for parallel work
    RT cores for Ray Tracking acceleration
    Tensor cores for AI acceleration

  5. Networking
    • HPC: InfiniBand
    • AI: Mellanox GPUDirect RDMA(Remote Direct Memory Access)
    • Storage: Ethernet Solutions/storage fabric
    • Cloud: BlueField DPU
    • Zero-Trust Security: Nvidia DOCA and BlueField DPU
    • Zero Trust is a security framework requiring all users, whether in or outside the organization’s network, to be authenticated,and continuously validated for security configuration and posture before being granted or keeping access to applications and data.
    • Versus traditional approach automatically trusted users and endpoints within the organization’s perimeter, putting the organization at risk from malicious internal actors
  6. Cypersecurity
    • Digital fingerprinting
    • Spear phishing
  7. Deployment
    • Docker Engine vs Podman
    • Container Runtimes
    • Containerd
    • CRI-O
      - K8s distribution
    • Upstream K8s
    • OpenShift Container Platform (OCP)
    • HPE Ezmeral Runtime Enterprise - RHEL KVM (Kernel based Virtual Machine) - VMware vSphere with Tanzu Last here is a good summary of how container system works Alt text
  8. Application management
    • Container Toolkit w Docker/Podman
    • GPU Operators works w k8s clusters
  9. Workload management
    • Altair PBS pro
    • Slurm
    • K8s
  10. Doherty Threshold
    Productivity soars when a computer and its users interact at a pace (<400ms)
  11. RAPIDS techs
    • Apache Arrow: standard memory format
    • Unified Comminication X (UCX): faster version of TCP for GPU communications
  12. SuperPOD Storage ref
    • DDN
    • NetApp BeeGFS
    • IBM Spectrum Scale
    • VAST
  13. DPU
    • BlueField
    • DOCA
  14. SDN Software Defined Network
    • SR-IOV: Single Root, I/O Virtualization, one NIC to appear as multile virtual NICs
    • OVS Open Virtual Switch
    • VirtIO: Virtual I/O
  15. Storage
    • DAS: Direct Attached Storage
    • NVMe: Non-Volatile Memory Express
    • BlueField Snap: hardware-accelerated virtualization of NVMe
  16. Ethernet
    • ConnectX
    • RDMA over Converged Ethernet (RoCE) enables reading and writing without involving GPU
    • VXLAN offload enables TCP/IP offloads without involving GPU
    • SR-IOV enables VM’s direct access to network adapters - Cumulus Linux
    • OS for Open Ethernet Switches
    • (air.nvidia.com)Virtual - LinkX (up to 40km) - SONiC Software for Open Network in Cloud - NetQ: telemetry collection - What-just-happened: packet lost
  17. InfiniBand
    • EDR/HDR/NDR switches
    • 100/200/400 GB/s
    • Enhanced/High/Next Date Rate
    • S - iSER - RDMA storage protocols - IPoIB: IP over InfiniBand - DAC: Direct Attach Copper - AOC: Active Optical Cables - HCA: Host Channel Adapters
    • EDR/HDR/NDR - SHARP: Scalable Hierarchical Aggregation and Reduction Protocol - UFM Unified Fabric Manager: Telemetry collection - RDAM: one side communication enabled by hardware.
    • No header
    • bypassing the OS
    • socket application
  18. Spectrum-X
    • Spectrum-4 switch
    • BlueField-3 SuperNIC
    • RoCE: InfiniBand Trade Associat(IBTA)

Tags:

Categories:

Updated: