Multi-node vLLM on Ray cluster

3 minute read

It’s kind of funny that I barely started a Ray cluster without Anyscale. It actually a comlicated process but most current Enterprise are not aware of. Mainly because they are not at the level of starting large clusters. Like during Ray Summit 24’s Key note, Yifei talked about Anyscale can start thousands of nodes within minutes, which is really remarkable. But where is the need of starting thousands of nodes? After one year with Nvidia, I haven’t started a 3 nodes cluster. yes, sadly 2 node cluster is my current upper limit.

0 Python Installatino

Install Python on Ubuntu 22.04 (Check versin w lsb_release -a)

# Python 3.12 installation
sudo apt update && sudo apt upgrade -y
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.12
# venv module update
sudo apt-get install python3.12-venv
# Create venv
python3.12 -m venv .venv312

1 KubeRay

Starting a ray cluster by KubeRay is going through couple of helm install for KubyRay operator, and then the RayCluster or RayServe installation.

# KubeRay Helm Chart Setup Script
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.4.0.
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0
# Raycluster CR creation
# Deploy a sample RayCluster CR from the KubeRay Helm chart repo:
helm install raycluster kuberay/ray-cluster --version 1.4.0
kget rayclusters

2 Ray CLI

Use ray start to create Ray cluster from scratch The vLLM multinodeexample actually use this approach. The idea is

create ray head node with ray start --head --port=6379"
create ray worker nodes with ray start --address=${HEAD_NODE_ADDRESS}:6379
Run vLLM serve as the normally with --tensor-parallel-size 8 --pipeline-parallel-size 2

Couple of notes

You can use CUDA_VISIBLE_DEVICES=0,1 ray start ... to control GPU used for ray node
vLLM log can show if it’s using all GPU on single node, or, when not enough, starting Ray process to get more GPUs from worker node
Currently all cross node GPU access failed …

Simple ray test works with following code for Pi calculation

# Takes about 8GB
ndim = 25_000
def run_dummy_job():
 start = time.time()
 random1 = torch.randn([ndim, ndim]).to("cuda")
 random2 = torch.randn([ndim, ndim]).to("cuda")
 while time.time() - start < 0.1 * 60:
     random1 = random1 * random2
     random2 = random2 * random1
 del random1, random2
 torch.cuda.empty_cache()
@ray.remote
class ProgressActor:
 def __init__(self, total_num_samples: int):
     self.total_num_samples = total_num_samples
     self.num_samples_completed_per_task = {}
 def report_progress(self, task_id: int, num_samples_completed: int) -> None:
     self.num_samples_completed_per_task[task_id] = num_samples_completed
 def get_progress(self) -> float:
     return (
         sum(self.num_samples_completed_per_task.values()) / self.total_num_samples
     )
@ray.remote(num_gpus=1)
def sampling_task(num_samples: int, task_id: int,
               progress_actor: ray.actor.ActorHandle) -> int:
 num_inside = 0
 for i in range(num_samples):
     x, y = random.uniform(-1, 1), random.uniform(-1, 1)
     if math.hypot(x, y) <= 1:
         num_inside += 1
     # Report progress every 1 million samples.
     if (i + 1) % 1_000_000 == 0:
         # This is async. and showing dummy GPU usage
         run_dummy_job()
         progress_actor.report_progress.remote(task_id, i + 1)
 # Report the final progress.
 progress_actor.report_progress.remote(task_id, num_samples)
 return num_inside

In this example, I didn’t set num_gpu=1 for the actor, so it will NOT take a GPU.
If use the same ray remote setting for the actor, the master node will be assigned to the actor, and there will be 7 GPU worker left for the sampling_task (assuming running on 8 GPU node)

3 vLLM multi-node

After you start the Ray cluster, you can serve vLLM as if you have all the GPU nodes
But somehow current test with GPU cross nodes all failed
- 4 GPU from Node 1(has 8 GPU in total), 4 GPU from Node 2(has 4 GPU in total)
- vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 4 would work using 4 GPU from Node 1
- vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 8 would work using 8 GPU from Node 1 (Only 4 GPU were added to Ray cluster)
- CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 8 failed. It has only 4 GPU from node 1, so trying to start Ray cluster and use the other 4 node from node 2.

Twitter Facebook LinkedIn

Multi-node vLLM on Ray cluster

0 Python Installatino

1 KubeRay

2 Ray CLI

3 vLLM multi-node

You May Also Enjoy

Flow and Diffusion models Part 4 - Classifer-Free Guidence

Something about IRA

Flow and Diffusion models Part 3 - Langevin and Matching

Ray continue on Two H200x8 nodes