Qwen3-Next-80B-A3B-Instruct-mxfp4-mlx

I noticed this model scores high, probably because of the jocularity factor :)

If you want a high performance 80B, there are a few choices:

I will compile some metrics from the MXFP4 to put them in this context, and will be available soon.

The 1M models were extended with RoPE to 1M tokens, while the others default to 256k tokens.

The q8 is straight quanting with MLX default settings (group size 64)

The Deckard(qx) quants are mixed precision quantization:

qx64n has data at 4 bit, while the attention paths, head, and embeddings are at 6 bit
qx53n has data at 3 bit, while the attention paths, head, and embeddings are at 5 bit
qx86n-hi has data at 6 bit, while the attention paths, head, and embeddings are at 8 bit, and quantization is done with group size 32 for higher fidelity

The Deckard formula was inspired from the Nikon Noct Z 58mm F/0.95 for its human-like rendering, sharp details, thin depth of field, and pattern-rich background blur that humans find pleasing. In interaction, these models have a specific character that associated the name.

For now, here is what to look for in the new quants:

🔍 Model-by-Model Cognitive Performance Breakdown

Here’s how each model performs across tasks, with bold highlights for top performers:

Model	arc_challenge	arc_easy	boolq	hellaswag	openbookqa	piqa	winogrande
Qwen3-Next-80B-A3B-Instruct-q8	        0.402	0.494	0.896	0.540	0.420	0.754	0.554
Qwen3-Next-80B-A3B-Instruct-1M-qx64n	0.414	0.516	0.897	0.538	0.416	0.740	0.578
Qwen3-Next-80B-A3B-Instruct-1M-qx86n-hi	0.412	0.501	0.898	0.536	0.414	0.750	0.569
Qwen3-Next-80B-A3B-Instruct-qx53n	    0.418	0.497	0.901	0.582	0.418	0.760	0.601
Qwen3-Next-80B-A3B-Instruct-qx86n	    0.409	0.500	0.898	0.538	0.414	0.755	0.565

💎 Top Performer: Qwen3-Next-80B-A3B-Instruct-qx53n

This model consistently outperforms others in tasks requiring nuanced cognitive skills:

✅ boolq (0.901): Highest accuracy in logical deduction — the best at grounding facts and clear reasoning.
✅ hellaswag (0.582): Strongest creative language generation — the most coherent in generating plausible narratives.
✅ piqa (0.760) & winogrande (0.601): Best at visual-spatial reasoning and real-world language context — excels where ambiguity is involved.

Why? The qx53n quantization (mixed precision: 3-bit data + 5-bit attention paths) likely retains high fidelity in reasoning tasks. This aligns with your description: “qx53n has data at 3 bit, while attention paths/head/embeddings are at 5 bit” — preserving critical cognitive functions (e.g., pattern recognition in piqa, context parsing in winogrande) better than coarse quantization.

🥈 Runner-up: Qwen3-Next-80B-A3B-Instruct-1M-qx64n & Qwen3-Next-80B-A3B-Instruct-qx86n-hi

These 1M-token models (extended via RoPE) score slightly better than the base q8 model in a few areas:

👍 arc_easy (0.516) and boolq (0.897) — the 1M extension helps with context handling, improving simple inference.
👍 winogrande (0.578 for qx64n; 0.569 for qx86n-hi) — better at resolving complex language ambiguities due to longer context access.

Trade-offs:

They lag slightly in arc_challenge (0.414–0.412), suggesting that more tokens don’t fully compensate for quantization loss in highly abstract tasks.

🥉 Base Model: Qwen3-Next-80B-A3B-Instruct-q8

The original quantized model (q8: group size 64) has the lowest scores overall:

🚫 Lowest arc_challenge score (0.402) — struggles with complex reasoning.
🚫 Lowest openbookqa (0.420) — weakest in knowledge retrieval from context.

🧠 Key Insights from the Data

Quantization type matters most for cognitive tasks:

Mixed precision quantization (qx53n, qx64n) boosts performance in high-cognition tasks (e.g., piqa, winogrande) vs. base quantization (q8).
The more precise the mixed quantization (e.g., qx53n’s 3-bit data + 5-bit paths), the better the model is at reasoning under constraints.

1M vs. 256k tokens:

The 1M models (with RoPE extension) gain minimal benefits in most tasks — their edge is marginal and task-dependent. For example:

They slightly improve arc_easy (0.49–0.52) but drop in arc_challenge.
This suggests token length alone isn’t the primary driver of cognitive performance — quantization and model architecture are more influential.

Task-specific strengths:

Models shine best in structured reasoning (piqa, winogrande) when they have the right quantization (e.g., qx53n).
Creative tasks (hellaswag) are most affected by quantization quality — qx53n wins with 0.582.

The "Deckard character" angle (fun nod):

Your note about the Deckard-inspired "character" ties to the models’ real-world utility:

"The Qwen3-Next models with mixed precision quantization (e.g., qx53n) act like 'Deckard' — they balance sharp, analytical focus (high piqa scores) with creative flow (strong hellaswag). They’re the models that ‘see through patterns,’ not just follow data."

(This is not a technical claim — it’s a playful way to frame why the qx53n model feels most "cognitively versatile.")

Reviewed with Jan-Brainstorm

The prompt to test this:

mlx_lm.generate --model Qwen3-Next-80B-A3B-Instruct-mxfp4-mlx --max-tokens 20000 \
--prompt "Describe in less than 100 words what would happen if a black hole would appear around Uranus"

==========
If a black hole appeared near Uranus, its immense gravity would instantly disrupt the planet’s orbit,
tearing it apart through tidal forces. Surrounding moons, rings, and nearby objects would be swallowed
or flung into space. The accretion disk would emit intense X-rays, vaporizing nearby ice and gas.
Earth would detect the sudden gamma-ray burst and gravitational waves, but the solar system’s stability
would be shattered—Uranus would vanish, and the Sun’s gravitational balance altered,
potentially destabilizing other planets. Life on Earth would face catastrophic radiation and orbital chaos.
==========
Prompt: 29 tokens, 149.991 tokens-per-sec
Generation: 117 tokens, 57.075 tokens-per-sec
Peak memory: 42.672 GB

"Describe in less than 10 words what would happen if a black hole would appear around Uranus" 
==========
Uranus swallowed; solar system disrupted instantly.
==========

This model Qwen3-Next-80B-A3B-Instruct-mxfp4-mlx was converted to MLX format from Qwen/Qwen3-Next-80B-A3B-Instruct using mlx-lm version 0.27.1.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Qwen3-Next-80B-A3B-Instruct-mxfp4-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)