Ray continue on Two H200x8 nodes

1 minute read

Issues and fixes

  1. GPU limitations
    • Setting export CUDA_VISIBLE_DEVICES=0,1,... at the top of _start_ray.sh does NOT work
    • Instead, add -e CUDA_VISIBLE_DEVICES=0,1... as docker env var
    • Or CUDA_VISIBLE_DEVICES=0,1... ray start ...
  2. IP address
    • Get error of ` {‘GPU’:1.0,’node:10.42.22.33’:0.001} * 1,{‘GPU’:1.0} * 1 (PACK): 1+ pending placement groups`, similar issue here
    • ChatGPT suggest to export VLLM_DISTRIBUTED_EXECUTOR_CONFIG='{"placement_group_options":{"strategy":"SPREAD"}}', which does NOT work, PACK strategy is best effort for single node, but can still SPREAD to multi nodes.
    • The solution, set -e VLLM_HOST_IP=10.0.0.55 using local address, instead of global IP address 89.169.102.44
  3. Unique IP address error
    • error msg
      RuntimeError: Every node should have a unique IP address. Got 2 nodes with node ids ['3d2614bf5e21d0b0ea4a1f6701582aaddbd54e79db2e5aa843bc36e6', '767a1e2406b21a25161c1744c462acb6d8db9323efacaae4f72cf307'] and 3 unique IP addresses {'89.169.102.44', '10.0.0.34', '10.0.0.55'}. Please check your network configuration. If you set `VLLM_HOST_IP` environment variable, make sure it is unique for each node.
      
    • The solution is using 10.0.0.55 from eth0 for head node IP instead of 89.169.102.44
  4. GLOO error
    • error msg below
       backend_class = ProcessGroupGloo(
                     ^^^^^^^^^^^^^^^^^
      RuntimeError: Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
      

      This is sth I still can’t resolve. Some reference here

  5. Run Hardware test
    • Single node case, just set --nproc-per-node=<number-of-GPUs> and # of GPUs match with visible GPU devices
    • Multi node case, tried ` –rdzv_backend=c10d for elastic number of nodes. and –rdzv_backend=static` for fixed number of nodes