`expert_bias`

by J22 - opened Sep 7

Discussion

J22

Sep 7

why expert_bias is not used?

https://huggingface.co/inclusionAI/GroveMoE-Inst/blob/main/modeling_grove_moe.py#L303

HaoxingChen

Sep 8

•

edited Sep 8

Hello. The expert_bias is used only during training, as you can see in the code below.

router_logits = self.gate(hidden_states)
routing_weights = F.softmax(router_logits, dim=-1, dtype=torch.float)
bias_routing_weights = torch.sigmoid(router_logits).to(torch.float)

if self.training:
       bias_routing_weights = bias_routing_weights + self.expert_bias.to(routing_weights.device)
else:
       bias_routing_weights = bias_routing_weights

_, selected_experts = torch.topk(bias_routing_weights, self.top_k, dim=-1)

During inference, we allow for a bias because real-world data distributions are not uniform. This approach produces better results than forcibly averaging the outputs. Furthermore, the bias obtained from the final pre-training step is not guaranteed to balance the model for a new inference distribution.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment