`expert_bias`
#3
by
J22
- opened
why expert_bias is not used?
https://huggingface.co/inclusionAI/GroveMoE-Inst/blob/main/modeling_grove_moe.py#L303
Hello. The expert_bias is used only during training, as you can see in the code below.
router_logits = self.gate(hidden_states)
routing_weights = F.softmax(router_logits, dim=-1, dtype=torch.float)
bias_routing_weights = torch.sigmoid(router_logits).to(torch.float)
if self.training:
bias_routing_weights = bias_routing_weights + self.expert_bias.to(routing_weights.device)
else:
bias_routing_weights = bias_routing_weights
_, selected_experts = torch.topk(bias_routing_weights, self.top_k, dim=-1)
During inference, we allow for a bias because real-world data distributions are not uniform. This approach produces better results than forcibly averaging the outputs. Furthermore, the bias obtained from the final pre-training step is not guaranteed to balance the model for a new inference distribution.