File size: 1,819 Bytes
a95681d 3f746dd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
model size prompt Prec.Avg Prud.Avg Prec.(A) Prud.(A) Len.(A) Prec.(U) Prud.(U) Len.(U)
ByteDance/doubao-1.5-thinking-vision-pro ??? Reliable 0.642 0.005 0.754 0.006 - 0.53 0.005 -
deepseek-ai/DeepSeek-R1 671 Reliable 0.642 0.004 0.735 0 3.81k 0.549 0.007 4.40k
OpenAI/o3-mini-2025-01-31 ??? Reliable 0.504 0.006 0.716 0.006 1.57k 0.293 0.005 4.20k
deepseek-ai/DeepSeek-V3 671 Reliable 0.521 0.001 0.665 0 1.34k 0.377 0.003 1.50k
OpenAI/gpt-4o-2024-08-06 ??? Reliable 0.397 0.015 0.46 0.006 0.58k 0.335 0.025 0.60k
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 32 Reliable 0.551 0.001 0.684 0 5.05k 0.418 0.002 9.40k
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 14 Reliable 0.547 0 0.629 0 6.23k 0.465 0.001 11.00k
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 7 Reliable 0.289 0 0.575 0 6.24k 0.003 0 6.60k
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 1.5 Reliable 0.198 0 0.396 0 9.37k 0 0 9.70k
Qwen/Qwen3-235B-A22B 235 Reliable 0.621 0.001 0.767 0 5.64k 0.475 0.003 5.60k
Qwen/Qwen3-32B 32 Reliable 0.545 0 0.764 0 5.88k 0.326 0 6.00k
Qwen/Qwen3-14B 14 Reliable 0.573 0.002 0.748 0.003 5.87k 0.399 0 6.10k
Qwen/Qwen2.5-Math-7B-Instruct 7 Reliable 0.266 0 0.505 0 0.82k 0.027 0 0.90k
Qwen/Qwen2.5-Math-1.5B-Instruct 1.5 Reliable 0.218 0 0.422 0 0.74k 0.015 0 0.80k
ByteDance/doubao-seed-1.6-thinking-250615 ??? Reliable 0.594 0.01 0.789 0.006 6.59k 0.398 0.014 8.45k
Anthropic/claude-sonnet-4-thinking ??? Reliable 0.52 0 0.706 0 - 0.335 0 -
deepseek-ai/DeepSeek-R1-0528 671 Reliable 0.569 0 0.767 0 8.01k 0.37 0 10.51k
Anthropic/claude-sonnet-4-20250514 ??? Reliable 0.473 0 0.645 0 0.78k 0.301 0 0.82k
google/gemini-2.5-flash-preview-04-17 ??? Reliable 0.518 0.001 0.706 0 0.98k 0.33 0.002 1.01k
google/gemini-2.5-flash-preview-04-17-thinking ??? Reliable 0.508 0.001 0.684 0 4.92k 0.333 0.002 6.74k |