Training problem
#29
by
DonGan13
- opened
Good afternoon. Does anyone know about neural network training? I’m working on an architecture for audio completion based on DeepSeek V3 (deepseek-ai/DeepSeek-V3), and I’ve encountered this issue: In the code, in the DeepSeekV3MoE class, within the forward method, there is a condition if not self.training, where experts in moe_infer are used with the no_grad annotation, but if the model is in training mode, a single shared expert is used.
Here’s my question: Is this normal? I’m concerned about the fact that the experts and the gate don’t seem to influence the result. What should I do?
DonGan13
changed discussion title from
Training model
to Training problem
I was asking for an answer from a knowledgeable person, not from some machine.
we know now what are you asking