Failed to run the model with 4 nodes of 8 4090
Try to start the model using 4 nodes of 8 x 4090 servers, and always show error "out of memory". Here is my config:
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
vllm serve deepseek-ai/DeepSeek-V3 \
--trust-remote-code \
--host 0.0.0.0 --port $PORT \
--gpu-memory-utilization 0.98 \
--max-model-len 128 \
--tensor-parallel-size 8 --pipeline-parallel-size 4 --enforce-eager
Even with a super small context, 32 RTX 4090s with a total of 768GB VRAM still failed to load the model.
Here is some logs:
INFO 01-02 02:07:12 model_runner.py:1099] Loading model weights took 16.9583 GB
(RayWorkerWrapper pid=2040) INFO 01-02 02:07:12 model_runner.py:1099] Loading model weights took 16.9687 GB [repeated 2x across cluster]
(RayWorkerWrapper pid=2031) INFO 01-02 02:07:18 model_runner.py:1099] Loading model weights took 16.9687 GB [repeated 2x across cluster]
(RayWorkerWrapper pid=2025) INFO 01-02 02:07:26 model_runner.py:1099] Loading model weights took 16.9687 GB [repeated 2x across cluster]
(RayWorkerWrapper pid=1053, ip=xxx.xxx.xx.24) INFO 01-02 02:38:57 model_runner.py:1099] Loading model weights took 20.5795 GB [repeated 3x across cluster]
(RayWorkerWrapper pid=1052, ip=xxx.xxx.xx.211) INFO 01-02 02:39:09 model_runner.py:1099] Loading model weights took 20.5795 GB [repeated 8x across cluster]
(RayWorkerWrapper pid=1057, ip=xxx.xxx.xx.52) WARNING 01-02 02:39:13 fused_moe.py:374] Using default MoE config. Performance might be sub-optimal! Config file not found at <vllm_install_path>/fused_moe/configs/E=256,N=256,device_name=NVIDIA_GeForce_RTX_4090,dtype=fp8_w8a8.json
(RayWorkerWrapper pid=1054, ip=xxx.xxx.xx.52) INFO 01-02 02:39:17 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250102-023917.pkl...
(RayWorkerWrapper pid=1053, ip=xxx.xxx.xx.24) INFO 01-02 02:39:17 worker.py:241] Memory profiling takes 8.36 seconds
(RayWorkerWrapper pid=1053, ip=xxx.xxx.xx.24) INFO 01-02 02:39:17 worker.py:241] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.98) = 23.17GiB
(RayWorkerWrapper pid=1053, ip=xxx.xxx.xx.24) INFO 01-02 02:39:17 worker.py:241] model weights take 20.58GiB; non_torch_memory takes 0.23GiB; PyTorch activation peak memory takes 0.40GiB; the rest of the memory reserved for KV Cache is 1.96GiB.
(RayWorkerWrapper pid=1054, ip=xxx.xxx.xx.52) ERROR 01-02 02:39:17 worker_base.py:467] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=1054, ip=xxx.xxx.xx.52) ERROR 01-02 02:39:17 worker_base.py:467] Traceback (most recent call last):
...
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 380.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 14.81 MiB is free. Process 258800 has 23.62 GiB memory in use.
Of the allocated memory 22.83 GiB is allocated by PyTorch, and 162.87 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 01-02 02:39:18 worker.py:241] Memory profiling takes 9.27 seconds
INFO 01-02 02:39:18 worker.py:241] the current vLLM instance can use total_gpu_memory (23.64GiB) x gpu_memory_utilization (0.98) = 23.17GiB
INFO 01-02 02:39:18 worker.py:241] model weights take 16.96GiB; non_torch_memory takes 0.22GiB; PyTorch activation peak memory takes 0.37GiB; the rest of the memory reserved for KV Cache is 5.62GiB.
[rank0]: Traceback (most recent call last):
File "<vllm_bin_path>/vllm", line 8, in <module>
File "<vllm_install_path>/vllm/scripts.py", line 201, in main
...
torch.OutOfMemoryError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-023917.pkl): CUDA out of memory. Tried to allocate 380.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 14.81 MiB is free.
Process 258800 has 23.62 GiB memory in use. Of the allocated memory 22.83 GiB is allocated by PyTorch, and 162.87 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[W102 02:39:20.918569899 ProcessGroupNCCL.cpp:1250] Warning: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group.
ask in chat : your configuration gpus and how you start and the errors,will receive text answer
use chat.deepseek.com
Yeah, I already tried the solutions, it is not working.
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
vllm serve deepseek-ai/DeepSeek-V3
--trust-remote-code
--host 0.0.0.0 --port $PORT
--gpu-memory-utilization 0.90
--max-model-len 128
--tensor-parallel-size 8 --pipeline-parallel-size 4 --enforce-eager
--dtype fp8
Please just remove all the unverified suggestions, I already told you, the solutions here are not working, try to give some useful feedback.
sorry
sorry
DO NOT use AI to answer questions, it's impolite and completely useless!
?
not 100% familiar with this, but are you sure it's using all the gpus? it looks like it's trying to put it all on one of them
not 100% familiar with this, but are you sure it's using all the gpus? it looks like it's trying to put it all on one of them
Yes, here is some additional context for this:
- I have already consulted ChatGPT, DeepSeek, and other large language models, tried various parameters (different context length, eager mode or not), but the issue persists.
- Using ray status, I can see that all 32 GPUs across the four nodes are detected, and they are indeed being utilized after running the script.
- I follow the instruction Distributed Inference and Serving and it works for qwen 2.5 72b with two nodes of 4090.
the rest of the memory reserved for KV Cache is 1.96GiB
This seems way too large if the context really is set to 128 tokens.
Looking through the options then all I can see is --block-size
that might be worth altering.