Ray continue on Two H200x8 nodes

1 minute read

Issues and fixes

GPU limitations
- Setting export CUDA_VISIBLE_DEVICES=0,1,... at the top of _start_ray.sh does NOT work
- Instead, add -e CUDA_VISIBLE_DEVICES=0,1... as docker env var
- Or CUDA_VISIBLE_DEVICES=0,1... ray start ...
IP address
- Get error of ` {‘GPU’:1.0,’node:10.42.22.33’:0.001} * 1,{‘GPU’:1.0} * 1 (PACK): 1+ pending placement groups`, similar issue here
- ChatGPT suggest to export VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}', which does NOT work, PACK strategy is best effort for single node, but can still SPREAD to multi nodes.
- The solution, set -e VLLM_HOST_IP=10.0.0.55 using local address, instead of global IP address 89.169.102.44

Unique IP address error

error msg

RuntimeError: Every node should have a unique IP address. Got 2 nodes with node ids ['3d2614bf5e21d0b0ea4a1f6701582aaddbd54e79db2e5aa843bc36e6', '767a1e2406b21a25161c1744c462acb6d8db9323efacaae4f72cf307'] and 3 unique IP addresses {'89.169.102.44', '10.0.0.34', '10.0.0.55'}. Please check your network configuration. If you set `VLLM_HOST_IP` environment variable, make sure it is unique for each node.

The solution is using 10.0.0.55 from eth0 for head node IP instead of 89.169.102.44

GLOO error

error msg below

 backend_class = ProcessGroupGloo(
               ^^^^^^^^^^^^^^^^^
RuntimeError: Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

This is sth I still can’t resolve. Some reference here

Run Hardware test
- Single node case, just set --nproc-per-node=<number-of-GPUs> and # of GPUs match with visible GPU devices
- Multi node case, tried ` –rdzv_backend=c10d for elastic number of nodes. and –rdzv_backend=static` for fixed number of nodes

Twitter Facebook LinkedIn

Ray continue on Two H200x8 nodes

Issues and fixes

You May Also Enjoy

Flow and Diffusion models Part 4 - Classifer-Free Guidence

Something about IRA

Flow and Diffusion models Part 3 - Langevin and Matching

Flow and Diffusion models Part 2 - Fokker-Planck