Tool Use Benchmarks
Hi! Thanks for the great work. Do you also have the benchmark scores for tool use and browser comp? MiniMax M2 is targeted to be the best agent with a smaller size, so it would be helpful if these benchmarks are provided for reference as well.
And also, do you have the recommended sampling parameters?
@louis-szeto
we're working on adding more tool calling / agentic evals for this model, just added tau2-bench telecom and BFCLv3. note that not discarding the think traces (via --reasoning-parser minimax_m2_append_think in vLLM) does boost performance at the expense of larger KV cache. on tau2-bench telecom in particular, it's a boost of a few percentage points, we're going to add that soon.
for sampling, we're following the original MiniMax guide (temperature=1.0, top_p = 0.95, top_k = 40). we tried greedy with tool calling and it's considerably worse.