`expert_bias`

#3
by J22 - opened

Hello. The expert_bias is used only during training, as you can see in the code below.

router_logits = self.gate(hidden_states)
routing_weights = F.softmax(router_logits, dim=-1, dtype=torch.float)
bias_routing_weights = torch.sigmoid(router_logits).to(torch.float)

if self.training:
       bias_routing_weights = bias_routing_weights + self.expert_bias.to(routing_weights.device)
else:
       bias_routing_weights = bias_routing_weights

_, selected_experts = torch.topk(bias_routing_weights, self.top_k, dim=-1)

During inference, we allow for a bias because real-world data distributions are not uniform. This approach produces better results than forcibly averaging the outputs. Furthermore, the bias obtained from the final pre-training step is not guaranteed to balance the model for a new inference distribution.

Sign up or log in to comment