Multi-node vLLM on Ray cluster
It’s kind of funny that I barely started a Ray cluster without Anyscale. It actually a comlicated process but most current Enterprise are not aware of. Mainly because they are not at the level of starting large clusters. Like during Ray Summit 24’s Key note, Yifei talked about Anyscale can start thousands of nodes within minutes, which is really remarkable. But where is the need of starting thousands of nodes? After one year with Nvidia, I haven’t started a 3 nodes cluster. yes, sadly 2 node cluster is my current upper limit.
0 Python Installatino
Install Python on Ubuntu 22.04 (Check versin w lsb_release -a
)
# Python 3.12 installation
sudo apt update && sudo apt upgrade -y
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.12
# venv module update
sudo apt-get install python3.12-venv
# Create venv
python3.12 -m venv .venv312
1 KubeRay
Starting a ray cluster by KubeRay is going through couple of helm install
for KubyRay operator, and then the RayCluster or RayServe installation.
# KubeRay Helm Chart Setup Script
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.4.0.
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0
# Raycluster CR creation
# Deploy a sample RayCluster CR from the KubeRay Helm chart repo:
helm install raycluster kuberay/ray-cluster --version 1.4.0
kget rayclusters
2 Ray CLI
Use ray start
to create Ray cluster from scratch
The vLLM multinodeexample actually use this approach.
The idea is
- create ray head node with
ray start --head --port=6379"
- create ray worker nodes with
ray start --address=${HEAD_NODE_ADDRESS}:6379
- Run vLLM serve as the normally with
--tensor-parallel-size 8 --pipeline-parallel-size 2
Couple of notes
- You can use
CUDA_VISIBLE_DEVICES=0,1 ray start ...
to control GPU used for ray node - vLLM log can show if it’s using all GPU on single node, or, when not enough, starting Ray process to get more GPUs from worker node
- Currently all cross node GPU access failed …
- Simple ray test works with following code for Pi calculation
# Takes about 8GB ndim = 25_000 def run_dummy_job(): start = time.time() random1 = torch.randn([ndim, ndim]).to("cuda") random2 = torch.randn([ndim, ndim]).to("cuda") while time.time() - start < 0.1 * 60: random1 = random1 * random2 random2 = random2 * random1 del random1, random2 torch.cuda.empty_cache() @ray.remote class ProgressActor: def __init__(self, total_num_samples: int): self.total_num_samples = total_num_samples self.num_samples_completed_per_task = {} def report_progress(self, task_id: int, num_samples_completed: int) -> None: self.num_samples_completed_per_task[task_id] = num_samples_completed def get_progress(self) -> float: return ( sum(self.num_samples_completed_per_task.values()) / self.total_num_samples ) @ray.remote(num_gpus=1) def sampling_task(num_samples: int, task_id: int, progress_actor: ray.actor.ActorHandle) -> int: num_inside = 0 for i in range(num_samples): x, y = random.uniform(-1, 1), random.uniform(-1, 1) if math.hypot(x, y) <= 1: num_inside += 1 # Report progress every 1 million samples. if (i + 1) % 1_000_000 == 0: # This is async. and showing dummy GPU usage run_dummy_job() progress_actor.report_progress.remote(task_id, i + 1) # Report the final progress. progress_actor.report_progress.remote(task_id, num_samples) return num_inside
- In this example, I didn’t set
num_gpu=1
for the actor, so it will NOT take a GPU. - If use the same ray remote setting for the actor, the master node will be assigned to the actor, and there will be 7 GPU worker left for the
sampling_task
(assuming running on 8 GPU node)
3 vLLM multi-node
- After you start the Ray cluster, you can serve vLLM as if you have all the GPU nodes
- But somehow current test with GPU cross nodes all failed
- 4 GPU from Node 1(has 8 GPU in total), 4 GPU from Node 2(has 4 GPU in total)
vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 4
would work using 4 GPU from Node 1vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 8
would work using 8 GPU from Node 1 (Only 4 GPU were added to Ray cluster)CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 8
failed. It has only 4 GPU from node 1, so trying to start Ray cluster and use the other 4 node from node 2.