Tool Use Benchmarks

by louis-szeto - opened 7 days ago

7 days ago

•

Hi! Thanks for the great work. Do you also have the benchmark scores for tool use and browser comp? MiniMax M2 is targeted to be the best agent with a smaller size, so it would be helpful if these benchmarks are provided for reference as well.

And also, do you have the recommended sampling parameters?

lazarevich

Cerebras org 5 days ago

@louis-szeto we're working on adding more tool calling / agentic evals for this model, just added tau2-bench telecom and BFCLv3. note that not discarding the think traces (via --reasoning-parser minimax_m2_append_think in vLLM) does boost performance at the expense of larger KV cache. on tau2-bench telecom in particular, it's a boost of a few percentage points, we're going to add that soon.

for sampling, we're following the original MiniMax guide (temperature=1.0, top_p = 0.95, top_k = 40). we tried greedy with tool calling and it's considerably worse.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment