Where's the knowledge?
This model has OK knowledge, but not for its size. A 456 BILLION total parameter model should have a much higher SimpleQA than 18.5. Even tiny little non-thinking Llama 3 70b has a higher SimpleQA.
Also, where's the "thinking"? If models like this one were actually thinking then their performance across all cognitive tasks, such as writing poems and jokes, would improve. But the improvements are almost exclusively seen in a handful of overfit domains, particularly coding and math.
If you ended training of this model on trillions of poem and joke tokens instead of coding and math tokens then this "thinking" model would produce far better poems and jokes that better align with users' prompts, while performing far worse on math and coding tests.
I get the sense that the entire AI industry is in 'fake it until you make it' mode. That is, pretend to be making gains by grossly overfitting a handful of tasks, especially coding, math, and STEM knowledge so the scores like LiveCodeBench, MATH 500, and the MMLU creep up, all while general knowledge and abilities regress. We all knew since day one that if you end training on trillions of tokens from a select domain the performance would increase in said domain, but that this would start scrambling the previous trained weights, causing an across the board regression in general knowledge and abilities.
利益讓人忍不住炒作的心
That's an insightful point. Another factor could be that while creative tasks like writing poems and jokes are more reflective of daily life, it's inherently difficult to establish quantitative benchmarks for a model's "wit", given the lack of universal standards for such subjective domains. We currently often see models being evaluated in highly challenging academic contests. Perhaps agentic benchmarks like SWE-bench and tau-bench could encourage a more diversified evaluation approach.
I agree that it's far easier to test domains like coding and math than poems and jokes, primarily since the outputs of good poems and jokes are unpredictable, complex and diverse, while the outputs of good code and math converge on the one and only resultant, or need to compile while executing the desire function, making them far easier to evaluate.
However, there are still objective measures, such as adherence to rhyme and meter. And by these standards the poem writing of these models is horrific. And by more subjective and harder to evaluate criteria, such as adherence to prompt directives while telling a coherent story using apt symbolism, metaphor, humor... they're absolutely abysmal.
Regardless, why do the tests matter? If you're making an agent, such as a math or coding model, then ending training on trillions of math or coding tokens makes perfect sense. But when making a general purpose AI model doing the same doesn't make a lick of sense. A diverse corpus retaining a balance of humanity's most popular data across all domains is absolutely essential.
When models are produced that can compete in the math olympiad, yet can't even write poems and jokes as well as young children who were repeatedly dropped on their heads, and score very low on broad knowledge tests like the SimpleQA despite having 100s of billions of parameters, then what's being produced are grossly overfit agents, not general purpose AI models.
I'm not trying to single this model out. In fact, Alibaba's Qwen2.5/3, Microsoft's Phi series... are even more guilty of doing this. And I'm not sure why I'm bothering to bitch about it because it's likely too late. At this point any attempt to stop grossly overfitting math, coding, and STEM in order to start democratically training general purpose AI models across all popular domains equally will inevitably result in a drop in coding ability that this community of mostly autistic coding nerds will respond to by flooding X, Youtube... with tirades about how awful the model is. So my advice to you is to not take my advice and to continue grossly overfitting math, coding, and STEM. I just have to accept that the OS AI community has largely given up on making general purpose AI models post-Llama 3.1.