OPEA/DeepSeek-V3-int4-sym-gptq-inc

5 days ago

I launched VLLM (8 x H100 SXM) using the following command to start the server:

vllm serve --model OPEA/DeepSeek-V3-int4-sym-gptq-inc-preview --tensor-parallel-size 8 --max-model-len 16384 --max_num_seqs 1 --trust-remote-code

Then, I made a simple “Hello” request like this:

chat_completion = client.chat.completions.create(
    model="OPEA/DeepSeek-V3-int4-sym-gptq-inc-preview",
    messages=[
        {
            "role": "user",
            "content": "Hello"
        },
    
    ],
    stream = True,
)

However, the output I received was quite strange, and I’m unsure why this is happening. Any insights into what might be causing this anomaly would be greatly appreciated.

AAAA隨後ﬃ®AAAAAAAAÄAAAAAAAAICAgICAgAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAﬂaaaaaaaaAAAAAh–
AAAAAAAAAAAAепﬂaaaaAAAA–
 AAA mootÎ AAA AAAÎ巴克ÎiaisÎðStringÍABCDíð®AAA GuineaPokAAAAifiÎffffffialiíí @)]DotíÎ84ok安置куп Î HSÎÎ708414 followers Macrom Pembíð Including756HKiOS®157Indonesia Ronald HN仅有 ®ízi ÎIEEE
...

cicdatopea

Open Platform for Enterprise AI org 5 days ago

I will have a test on CPU side. Could you also have a test of our verified prompts in readme?
As we have no enough cuda resource, we could not test it on cuda side.

NikolaSigmoid

5 days ago

Alright, I’ll test loading the model using Transformers, but it seems quite slow. How long did it take you to load this model with Transformers?

cicdatopea

Open Platform for Enterprise AI org 5 days ago

Alright, I’ll test loading the model using Transformers, but it seems quite slow. How long did it take you to load this model with Transformers?

It takes over 20 minutes. What I suggest is testing the validated prompts in vLLMs first. If the results aren't satisfactory, it might be better to try them in Transformers. If neither approach works, I suspect an overflow issue has occurred, similar to what we encountered with Qwen2.5-32B earlier, https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-awq-inc. Then we will try to quantize the model in mixed precision way.

NikolaSigmoid

5 days ago

Could you please explain how you managed to quantize this model without sufficient GPU resources?

NikolaSigmoid

5 days ago

Oh no, I just waited for a long time and got this error !

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 7 has a total capacity of 79.10 GiB of which 7.88 MiB is free. Process 222887 has 79.08 GiB memory in use. Of the allocated memory 56.03 GiB is allocated by PyTorch, and 22.55 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

NikolaSigmoid

5 days ago

•

edited 5 days ago

Here is the result I obtained after launching with VLLMs and testing your example prompt.

cicdatopea

Open Platform for Enterprise AI org 5 days ago

•

edited 5 days ago

Could you please explain how you managed to quantize this model without sufficient GPU resources?
we have not tested it on GPU, we only tested it on CPU

cicdatopea

Open Platform for Enterprise AI org 5 days ago

Here is the result I obtained after launching with VLLMs and testing your example prompt.

Thank you for the information. We will verify if the input or output of certain layers of this model has exceeded the FP16 range.

cicdatopea

Open Platform for Enterprise AI org 4 days ago

•

edited 4 days ago

have a quick test

model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
model.layers.59.mlp.experts.138.down_proj tensor(1096.) tensor(130.5271)
model.layers.60.mlp.experts.81.down_proj tensor(6016.) tensor(8290.2236)
model.layers.60.mlp.experts.92.down_proj tensor(111616.) tensor(52362.3281)
model.layers.59.mlp.experts.138.down_proj tensor(1056.) tensor(125.2802)
model.layers.60.mlp.experts.81.down_proj tensor(5184.) tensor(7294.0933)
model.layers.60.mlp.experts.92.down_proj tensor(108032.) tensor(51036.6992)
model.layers.60.mlp.experts.81.down_proj tensor(4352.) tensor(6245.4785)
model.layers.60.mlp.experts.92.down_proj tensor(101888.) tensor(48230.1445)
model.layers.59.mlp.experts.138.down_proj tensor(1064.) tensor(124.9290)
model.layers.60.mlp.experts.81.down_proj tensor(5920.) tensor(8268.7275)
model.layers.60.mlp.experts.92.down_proj tensor(110592.) tensor(52426.2188)
model.layers.60.mlp.experts.81.down_proj tensor(5472.) tensor(7656.2739)
model.layers.60.mlp.experts.81.down_proj tensor(5472.) tensor(7656.2739)
model.layers.60.mlp.experts.92.down_proj tensor(107008.) tensor(50818.8711)
model.layers.60.mlp.experts.81.down_proj tensor(5760.) tensor(7966.4805)
model.layers.60.mlp.experts.92.down_proj tensor(107008.) tensor(51374.0078)
model.layers.60.mlp.experts.81.down_proj tensor(6688.) tensor(9049.8135)
model.layers.60.mlp.experts.92.down_proj tensor(117760.) tensor(55190.7734)

Maybe we need to exclude all the down proj in the last layer

NikolaSigmoid

4 days ago

Thank you so much! Could you please prioritize this? I’m eagerly anticipating the INT4 version of this model for deployment!

cicdatopea

Open Platform for Enterprise AI org 4 days ago

•

edited 4 days ago

Thank you so much! Could you please prioritize this? I’m eagerly anticipating the INT4 version of this model for deployment!

working on it.

fabiolecca

4 days ago

I get this error when starting vllm, can you help me?

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-133701.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False

NikolaSigmoid

4 days ago

Your command?

fabiolecca

4 days ago

vllm serve --max-model-len 16384 --max_num_seqs 1 --trust_remote_code --tensor-parallel-size 4 OPEA/DeepSeek-V3-int4-sym-inc-cpu

cicdatopea

Open Platform for Enterprise AI org 2 days ago

@NikolaSigmoid we have added a workaround for the overflow issue. Please have a try of the latest model. We have validated it with Transformers, but we have no enough resource to validate it on vllms.

NikolaSigmoid

1 day ago

Still not working =))

fsaudm

about 23 hours ago

•

edited about 23 hours ago

@NikolaSigmoid Im also working with vLLM. Maybe you could help me figure some things out :) I have access to only A100s, and I have 9 nodes with 2 GPUs (so 18 A100s of 80 GBs, 1440 GBs in total). I was trying to serve a bf16 version I found here on HF, but I am getting CUDA OOM... even though 685B params should be somewhere around 1350 GBs plus some overhead. Any thoughts? I am also trying to unload to CPU but not working either...

vllm serve opensourcerelease/DeepSeek-V3-bf16
--dtype bfloat16
--host 0.0.0.0
--port 5000
--gpu-memory-utilization 0.7
--cpu-offload-gb 540
--tensor-parallel-size 2
--pipeline-parallel-size 9
--trust-remote-code

any thoughts? pls help :(

cicdatopea

Open Platform for Enterprise AI org about 21 hours ago

Still not working =))

Thank you for the information. I guess the issue might be related to the Marlin kernel they used. Unfortunately, we don’t have enough GPUs to test it ourselves. For now, you can try using this model with Transformers or on a CPU. If you’re unable to reproduce the results , please let us know.

cicdatopea

Open Platform for Enterprise AI org about 21 hours ago

Still not working =))

another workaround you cloud try is changing the code https://huggingface.co/OPEA/DeepSeek-V3-int4-sym-gptq-inc/blob/main/modeling_deepseek.py#L389 to

down_proj = self.down_proj((self.act_fn(self.gate_proj(x))/2.0) * (self.up_proj(x))/2.0)*4.0

We have validated this and achieved similar results in Transformers. And you could also change the 2.0,2.0,4.0 to 4.0,4.0, 16.0