Multi-node vLLM on Ray cluster

3 minute read

It’s kind of funny that I barely started a Ray cluster without Anyscale. It actually a comlicated process but most current Enterprise are not aware of. Mainly because they are not at the level of starting large clusters. Like during Ray Summit 24’s Key note, Yifei talked about Anyscale can start thousands of nodes within minutes, which is really remarkable. But where is the need of starting thousands of nodes? After one year with Nvidia, I haven’t started a 3 nodes cluster. yes, sadly 2 node cluster is my current upper limit.

0 Python Installatino

Install Python on Ubuntu 22.04 (Check versin w lsb_release -a)

# Python 3.12 installation
sudo apt update && sudo apt upgrade -y
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.12
# venv module update
sudo apt-get install python3.12-venv
# Create venv
python3.12 -m venv .venv312

1 KubeRay

Starting a ray cluster by KubeRay is going through couple of helm install for KubyRay operator, and then the RayCluster or RayServe installation.

# KubeRay Helm Chart Setup Script
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# Install both CRDs and KubeRay operator v1.4.0.
helm install kuberay-operator kuberay/kuberay-operator --version 1.4.0
# Raycluster CR creation
# Deploy a sample RayCluster CR from the KubeRay Helm chart repo:
helm install raycluster kuberay/ray-cluster --version 1.4.0
kget rayclusters

2 Ray CLI

Use ray start to create Ray cluster from scratch The vLLM multinodeexample actually use this approach. The idea is

  1. create ray head node with ray start --head --port=6379"
  2. create ray worker nodes with ray start --address=${HEAD_NODE_ADDRESS}:6379
  3. Run vLLM serve as the normally with --tensor-parallel-size 8 --pipeline-parallel-size 2

Couple of notes

  1. You can use CUDA_VISIBLE_DEVICES=0,1 ray start ... to control GPU used for ray node
  2. vLLM log can show if it’s using all GPU on single node, or, when not enough, starting Ray process to get more GPUs from worker node
  3. Currently all cross node GPU access failed …
  4. Simple ray test works with following code for Pi calculation
    # Takes about 8GB
    ndim = 25_000
    def run_dummy_job():
     start = time.time()
     random1 = torch.randn([ndim, ndim]).to("cuda")
     random2 = torch.randn([ndim, ndim]).to("cuda")
     while time.time() - start < 0.1 * 60:
         random1 = random1 * random2
         random2 = random2 * random1
     del random1, random2
     torch.cuda.empty_cache()
    @ray.remote
    class ProgressActor:
     def __init__(self, total_num_samples: int):
         self.total_num_samples = total_num_samples
         self.num_samples_completed_per_task = {}
     def report_progress(self, task_id: int, num_samples_completed: int) -> None:
         self.num_samples_completed_per_task[task_id] = num_samples_completed
     def get_progress(self) -> float:
         return (
             sum(self.num_samples_completed_per_task.values()) / self.total_num_samples
         )
    @ray.remote(num_gpus=1)
    def sampling_task(num_samples: int, task_id: int,
                   progress_actor: ray.actor.ActorHandle) -> int:
     num_inside = 0
     for i in range(num_samples):
         x, y = random.uniform(-1, 1), random.uniform(-1, 1)
         if math.hypot(x, y) <= 1:
             num_inside += 1
         # Report progress every 1 million samples.
         if (i + 1) % 1_000_000 == 0:
             # This is async. and showing dummy GPU usage
             run_dummy_job()
             progress_actor.report_progress.remote(task_id, i + 1)
     # Report the final progress.
     progress_actor.report_progress.remote(task_id, num_samples)
     return num_inside
    
  5. In this example, I didn’t set num_gpu=1 for the actor, so it will NOT take a GPU.
  6. If use the same ray remote setting for the actor, the master node will be assigned to the actor, and there will be 7 GPU worker left for the sampling_task (assuming running on 8 GPU node)

3 vLLM multi-node

  1. After you start the Ray cluster, you can serve vLLM as if you have all the GPU nodes
  2. But somehow current test with GPU cross nodes all failed
    • 4 GPU from Node 1(has 8 GPU in total), 4 GPU from Node 2(has 4 GPU in total)
    • vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 4 would work using 4 GPU from Node 1
    • vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 8 would work using 8 GPU from Node 1 (Only 4 GPU were added to Ray cluster)
    • CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /root/.cache/huggingface/model_folder --tensor-parallel-size 8 failed. It has only 4 GPU from node 1, so trying to start Ray cluster and use the other 4 node from node 2.

Tags:

Categories:

Updated: