lefromage
/

Qwen3-Next-80B-A3B-Thinking-GGUF

Text Generation

GGUF

conversational

Model card Files Files and versions

xet

Community

lefromage commited on 18 days ago

Commit

74c324f

verified ·

1 Parent(s): 93e4b5d

Update README.md

Browse files

Files changed (1) hide show

README.md +90 -4

README.md CHANGED Viewed

@@ -9,16 +9,102 @@ tags:
 to be used with llama.cpp PR 16095
 ```bash
-build/bin/llama-cli -ngl 100 -m Qwen__Qwen3-Next-80B-A3B-Thinking-Q2_K.gguf --no-mmap --prompt 'what is the capital of france' -st
 ```
 ```
-...
-user
 what is the capital of france
-assistant
 <think>
 Okay, the user asked, "what is the capital of France?" Hmm, that's a pretty basic geography question. Let me think. First, I know the capital of France is Paris. But wait, maybe I should double-check to be absolutely sure. Yeah, definitely Paris. It's one of those common facts that's easy to remember.

 to be used with llama.cpp PR 16095
+## Update:
+I have tested some of these smaller models on NVIDIA with default CUDA compile
+with the excellent release from @cturan on NVIDIA L40S GPU.
+Since L40S GPU is 48GB VRAM, I was able to run Q2_K, Q3_K_M, Q4_K_S, Q4_0 and Q4_MXFP4_MOE:
+but Q4_K_M was too big.
+Although it works if using -ngl 45
+but it slowed down quite a bit.
+There may be a better way but did not have time to test.
+Was able to get a good speed of 53 tokens per second in the generation
+and 800 tokens per second in the prompt reading.
 ```bash
+wget https://github.com/cturan/llama.cpp/archive/refs/tags/test.tar.gz
+tar xf test.tar.gz
+cd llama.cpp-test
+# export PATH=/usr/local/cuda/bin:$PATH
+time cmake -B build -DGGML_CUDA=ON
+time cmake --build build --config Release --parallel $(nproc --all)
+```
+You may need to add /usr/local/cuda/bin to your PATH
+to find nvcc (Nvidia CUDA compiler)
+Building from source took about 7 minutes.
+For more detail on CUDA build see:
+https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda
+## Quantized Models:
+These quantized models were generated using the excellent pull request from @pwilkin
+[#16095](https://github.com/ggml-org/llama.cpp/pull/16095)
+on 2025-10-19 with commit `2fdbf16eb`.
+NOTE: currently they only work with the llama.cpp 16095 pull request which is still in development.
+Speed and quality should improve over time.
+### How to build and run for MacOS
+```bash
+PR=16095
+git clone https://github.com/ggml-org/llama.cpp llama.cpp-PR-$PR
+cd llama.cpp-PR-$PR
+git fetch origin pull/$PR/head:pr-$PR
+git checkout pr-$PR
+time cmake -B build
+time cmake --build build --config Release --parallel $(nproc --all)
 ```
+### Run examples
+Run with Hugging Face model:
+```bash
+build/bin/llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF --prompt 'What is the capital of France?' --no-mmap -st
 ```
+by default will download lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF:Q4_K_M
+To download:
+```bash
+wget https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF/resolve/main/Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf
+```
+or
+```bash
+pip install hf_transfer 'huggingface_hub[cli]'
+hf download lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf
+```
+Run with local model file:
+```bash
+build/bin/llama-cli -m Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf --prompt 'Write a paragraph about quantum computing' --no-mmap -st
+```
+```bash
+build/bin/llama-cli -ngl 100 -m Qwen__Qwen3-Next-80B-A3B-Thinking-Q2_K.gguf --no-mmap --prompt 'what is the capital of france' -st
+```
+### Example prompt and output
+**User prompt:**
 what is the capital of france
+**Assistant output:**
+```
 <think>
 Okay, the user asked, "what is the capital of France?" Hmm, that's a pretty basic geography question. Let me think. First, I know the capital of France is Paris. But wait, maybe I should double-check to be absolutely sure. Yeah, definitely Paris. It's one of those common facts that's easy to remember.