Ray continue on Two H200x8 nodes
Issues and fixes
- GPU limitations
- Setting
export CUDA_VISIBLE_DEVICES=0,1,...
at the top of_start_ray.sh
does NOT work - Instead, add
-e CUDA_VISIBLE_DEVICES=0,1...
as docker env var - Or
CUDA_VISIBLE_DEVICES=0,1... ray start ...
- Setting
- IP address
- Get error of ` {‘GPU’:1.0,’node:10.42.22.33’:0.001} * 1,{‘GPU’:1.0} * 1 (PACK): 1+ pending placement groups`, similar issue here
- ChatGPT suggest to
export VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}'
, which does NOT work,PACK
strategy is best effort for single node, but can stillSPREAD
to multi nodes. - The solution, set
-e VLLM_HOST_IP=10.0.0.55
using local address, instead of global IP address89.169.102.44
- Unique IP address error
- error msg
RuntimeError: Every node should have a unique IP address. Got 2 nodes with node ids ['3d2614bf5e21d0b0ea4a1f6701582aaddbd54e79db2e5aa843bc36e6', '767a1e2406b21a25161c1744c462acb6d8db9323efacaae4f72cf307'] and 3 unique IP addresses {'89.169.102.44', '10.0.0.34', '10.0.0.55'}. Please check your network configuration. If you set `VLLM_HOST_IP` environment variable, make sure it is unique for each node.
- The solution is using
10.0.0.55
frometh0
for head node IP instead of89.169.102.44
- error msg
- GLOO error
- error msg below
backend_class = ProcessGroupGloo( ^^^^^^^^^^^^^^^^^ RuntimeError: Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
This is sth I still can’t resolve. Some reference here
- error msg below
- Run Hardware test
- Single node case, just set
--nproc-per-node=<number-of-GPUs>
and # of GPUs match with visible GPU devices - Multi node case, tried ` –rdzv_backend=c10d
for elastic number of nodes. and
–rdzv_backend=static` for fixed number of nodes
- Single node case, just set