Tool Use Benchmarks

#2
by louis-szeto - opened

Hi! Thanks for the great work. Do you also have the benchmark scores for tool use and browser comp? MiniMax M2 is targeted to be the best agent with a smaller size, so it would be helpful if these benchmarks are provided for reference as well.

And also, do you have the recommended sampling parameters?

Cerebras org

@louis-szeto we're working on adding more tool calling / agentic evals for this model, just added tau2-bench telecom and BFCLv3. note that not discarding the think traces (via --reasoning-parser minimax_m2_append_think in vLLM) does boost performance at the expense of larger KV cache. on tau2-bench telecom in particular, it's a boost of a few percentage points, we're going to add that soon.

for sampling, we're following the original MiniMax guide (temperature=1.0, top_p = 0.95, top_k = 40). we tried greedy with tool calling and it's considerably worse.

Sign up or log in to comment