RubeRay Setup and Trouble shooting
Trying to get a 2 node 8 GPU (4 GPU from each node) Ray cluster running with KubeRay
1 RubeRay configuration
We can customize the value.yaml
and deploy Ray cluster with KubyRay Operator
helm install my-ray-cluster kuberay/ray-cluster -f my-value.yaml
and this is the sample values
image:
repository: vllm/vllm-openai
tag: latest
# Head group configuration
head:
nodeSelector:
kubernetes.io/hostname: dgx-001
useHostNetwork: true
resources:
requests:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: "4"
limits:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: "4"
containerEnv:
- name: "CUDA_VISIBLE_DEVICES"
value: "0,1,2,3"
worker:
nodeSelector:
kubernetes.io/hostname: dgx-002
useHostNetwork: true
resources:
requests:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: "4"
limits:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: "4"
containerEnv:
- name: "CUDA_VISIBLE_DEVICES"
value: "4,5,6,7"
2 Troubleshooting
- The deployment was failed on worker group does not have node assigned.
After
describe
the node, you will see followingAllocated resources
section at the end and there is NOnvidia.com/gpu
for this node│ Allocated resources: │ Resource Requests Limits │ cpu 355m (0%) 100m │ memory 576Mi (0%) 1Gi │ ephemeral-storage 0 (0%) 0 │ hugepages-1Gi 0 (0%) 0 │ hugepages-2Mi 0 (0%) 0 │ nvidia.com/gpu 0 0 # was not here
This is due to not install NVIDIA Container Toolkit properly. So fix it by
sudo apt-get install -y nvidia-container-toolkit
(see more details at the link).
Then I was checking with following cmds to see if the toolkit is properly installed.
systemctl status nvidia-container-runtime
# This command always shows NOT Found
#Unit nvidia-container-runtime.service could not be found.
systemctl list-units --type=service | grep nvidia
# This command list nvidia services
#nvidia-dcgm.service NVIDIA DCGM service
#nvidia-fabricmanager.service NVIDIA fabric manager service
#nvidia-persistenced.service NVIDIA Persistence Daemon
- To install
fabric manager
# To list all the available fabric manager version under 570 apt-cache madison cuda-drivers-fabricmanager-570 sudo apt-get install cuda-drivers-fabricmanager-570 #570.158.01 to be installed by default but 570.172.08 is GPU driver version cat /proc/driver/nvidia/version
Here is how re-install GPU driver
# remove current driver sudo apt-get purge nvidia* # install drive with specific versoin apt-cache madison nvidia-driver-570 sudo apt install nvidia-driver-570=570.124.06-0ubuntu1
3 Container Runtime update
The service file nvidia-container-runtime.service
likely doesn’t exist. In modern versions (like 1.17+), a persistent, standalone service is no longer required. The toolkit now integrates directly with your container runtime (Docker, containerd) through configuration.
The goal is not to start a service, but to configure your container engine to use the NVIDIA runtime.
# 1 Reload
sudo systemctl daemon-reload
# 2 Update
# Use this command if you are using Docker
sudo nvidia-ctk runtime configure --runtime=docker
# Or use this command if you are using containerd
# sudo nvidia-ctk runtime configure --runtime=containerd
# 3 Restart
# For Docker
sudo systemctl restart docker
# For containerd
# sudo systemctl restart containerd
You can check following files see the nvidia runtime being used
cat /etc/docker/daemon.json
cat /etc/containerd/config.toml
4 Key update
sudo apt-key list
distro=ubuntu2204
arch=x86_64
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb