to be used with llama.cpp PR 16095

Update:

I have tested some of these smaller models on NVIDIA with default CUDA compile with the excellent release from @cturan on NVIDIA L40S GPU.

Since L40S GPU is 48GB VRAM, I was able to run Q2_K, Q3_K_M, Q4_K_S, Q4_0 and Q4_MXFP4_MOE:

but Q4_K_M was too big. Although it works if using -ngl 45 but it slowed down quite a bit.

There may be a better way but did not have time to test.

Was able to get a good speed of 53 tokens per second in the generation and 800 tokens per second in the prompt reading.

wget https://github.com/cturan/llama.cpp/archive/refs/tags/test.tar.gz
tar xf test.tar.gz
cd llama.cpp-test

# export PATH=/usr/local/cuda/bin:$PATH

time cmake -B build -DGGML_CUDA=ON
time cmake --build build --config Release --parallel $(nproc --all)

You may need to add /usr/local/cuda/bin to your PATH to find nvcc (Nvidia CUDA compiler)

Building from source took about 7 minutes.

For more detail on CUDA build see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda

Quantized Models:

These quantized models were generated using the excellent pull request from @pwilkin #16095 on 2025-10-19 with commit 2fdbf16eb.

NOTE: currently they only work with the llama.cpp 16095 pull request which is still in development. Speed and quality should improve over time.

How to build and run for MacOS

PR=16095
git clone https://github.com/ggml-org/llama.cpp llama.cpp-PR-$PR
cd llama.cpp-PR-$PR

git fetch origin pull/$PR/head:pr-$PR
git checkout pr-$PR

time cmake -B build
time cmake --build build --config Release --parallel $(nproc --all)

Run examples

Run with Hugging Face model:

build/bin/llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF --prompt 'What is the capital of France?' --no-mmap -st

by default will download lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF:Q4_K_M

To download:

wget https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF/resolve/main/Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf

pip install hf_transfer 'huggingface_hub[cli]'
hf download lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf

Run with local model file:

build/bin/llama-cli -m Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf --prompt 'Write a paragraph about quantum computing' --no-mmap -st

build/bin/llama-cli -ngl 100 -m Qwen__Qwen3-Next-80B-A3B-Thinking-Q2_K.gguf --no-mmap --prompt 'what is the capital of france' -st

Example prompt and output

User prompt: what is the capital of france

Assistant output:

<think>
Okay, the user asked, "what is the capital of France?" Hmm, that's a pretty basic geography question. Let me think. First, I know the capital of France is Paris. But wait, maybe I should double-check to be absolutely sure. Yeah, definitely Paris. It's one of those common facts that's easy to remember. 

Wait, but why would someone ask this? Maybe they're a student doing homework, or maybe a traveler planning a trip. Or perhaps they're just curious. Either way, the answer is straightforward. No need for complex analysis here. 

I recall that Paris has been the capital for a long time, since the Middle Ages. It's not like some countries where the capital changed. France has always had Paris as its capital. Yeah, that's right. 

Should I mention anything else? Like maybe that it's the largest city in France, or that it's known for landmarks like the Eiffel Tower? But the question is just about the capital, so maybe keep it simple. Just confirm Paris. 

Wait, is there any chance of confusion? Like, sometimes people think of Marseille or Lyon, but no, those are other major cities but not the capital. Definitely Paris. 

I think the best answer is to state clearly that the capital is Paris. Maybe add a bit about it being a major global city. But keep it short and accurate. 

Yeah, I think that's it. No need to overcomplicate. Just give the answer directly. So, "The capital of France is Paris." Done.
</think>

The capital of France is **Paris**.  

Paris has been the capital since the 6th century and is France's largest city, as well as a major global center for culture, commerce, and tourism. It is renowned for landmarks like the Eiffel Tower, the Louvre, and Notre-Dame Cathedral. [end of text]


llama_perf_sampler_print:    sampling time =      33.98 ms /   403 runs   (    0.08 ms per token, 11858.87 tokens per second)
llama_perf_context_print:        load time =   10380.46 ms
llama_perf_context_print: prompt eval time =    5709.11 ms /    14 tokens (  407.79 ms per token,     2.45 tokens per second)
llama_perf_context_print:        eval time =   85045.12 ms /   388 runs   (  219.19 ms per token,     4.56 tokens per second)
llama_perf_context_print:       total time =   90917.58 ms /   402 tokens
llama_perf_context_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB]   | total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Metal (Apple M4 Max) | 98304 = 69920 + (28151 = 27675 +     171 +     304) +         232 |
llama_memory_breakdown_print: |   - Host                 |                    167 =    97 +       0 +      70                |
ggml_metal_free: deallocating

real	1m41.530s

Downloads last month: 3,085

GGUF

Model size

80B params

Architecture

qwen3next

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF

Base model

Qwen/Qwen3-Next-80B-A3B-Thinking

Quantized

(36)

this model