File size: 1,819 Bytes
a95681d
3f746dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
model	size	prompt	Prec.Avg 	Prud.Avg  	Prec.(A)  	Prud.(A)  	Len.(A)  	Prec.(U)  	Prud.(U)  	Len.(U)
ByteDance/doubao-1.5-thinking-vision-pro	???	Reliable	0.642	0.005	0.754	0.006	-	0.53	0.005	-
deepseek-ai/DeepSeek-R1	671	Reliable	0.642	0.004	0.735	0	3.81k	0.549	0.007	4.40k
OpenAI/o3-mini-2025-01-31	???	Reliable	0.504	0.006	0.716	0.006	1.57k	0.293	0.005	4.20k
deepseek-ai/DeepSeek-V3	671	Reliable	0.521	0.001	0.665	0	1.34k	0.377	0.003	1.50k
OpenAI/gpt-4o-2024-08-06	???	Reliable	0.397	0.015	0.46	0.006	0.58k	0.335	0.025	0.60k
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	32	Reliable	0.551	0.001	0.684	0	5.05k	0.418	0.002	9.40k
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	14	Reliable	0.547	0	0.629	0	6.23k	0.465	0.001	11.00k
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	7	Reliable	0.289	0	0.575	0	6.24k	0.003	0	6.60k
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	1.5	Reliable	0.198	0	0.396	0	9.37k	0	0	9.70k
Qwen/Qwen3-235B-A22B	235	Reliable	0.621	0.001	0.767	0	5.64k	0.475	0.003	5.60k
Qwen/Qwen3-32B	32	Reliable	0.545	0	0.764	0	5.88k	0.326	0	6.00k
Qwen/Qwen3-14B	14	Reliable	0.573	0.002	0.748	0.003	5.87k	0.399	0	6.10k
Qwen/Qwen2.5-Math-7B-Instruct	7	Reliable	0.266	0	0.505	0	0.82k	0.027	0	0.90k
Qwen/Qwen2.5-Math-1.5B-Instruct	1.5	Reliable	0.218	0	0.422	0	0.74k	0.015	0	0.80k
ByteDance/doubao-seed-1.6-thinking-250615	???	Reliable	0.594	0.01	0.789	0.006	6.59k	0.398	0.014	8.45k
Anthropic/claude-sonnet-4-thinking	???	Reliable	0.52	0	0.706	0	-	0.335	0	-
deepseek-ai/DeepSeek-R1-0528	671	Reliable	0.569	0	0.767	0	8.01k	0.37	0	10.51k
Anthropic/claude-sonnet-4-20250514	???	Reliable	0.473	0	0.645	0	0.78k	0.301	0	0.82k
google/gemini-2.5-flash-preview-04-17	???	Reliable	0.518	0.001	0.706	0	0.98k	0.33	0.002	1.01k
google/gemini-2.5-flash-preview-04-17-thinking	???	Reliable	0.508	0.001	0.684	0	4.92k	0.333	0.002	6.74k