Training problem

#29
by DonGan13 - opened

Good afternoon. Does anyone know about neural network training? I’m working on an architecture for audio completion based on DeepSeek V3 (deepseek-ai/DeepSeek-V3), and I’ve encountered this issue: In the code, in the DeepSeekV3MoE class, within the forward method, there is a condition if not self.training, where experts in moe_infer are used with the no_grad annotation, but if the model is in training mode, a single shared expert is used.

Here’s my question: Is this normal? I’m concerned about the fact that the experts and the gate don’t seem to influence the result. What should I do?

DonGan13 changed discussion title from Training model to Training problem

Screenshot from 2025-01-02 13-03-01.png
Screenshot from 2025-01-02 13-03-06.png

I was asking for an answer from a knowledgeable person, not from some machine.

we know now what are you asking

Sign up or log in to comment