lefromage commited on
Commit
74c324f
·
verified ·
1 Parent(s): 93e4b5d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -4
README.md CHANGED
@@ -9,16 +9,102 @@ tags:
9
 
10
  to be used with llama.cpp PR 16095
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ```bash
13
- build/bin/llama-cli -ngl 100 -m Qwen__Qwen3-Next-80B-A3B-Thinking-Q2_K.gguf --no-mmap --prompt 'what is the capital of france' -st
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ```
15
 
 
 
 
 
 
 
16
  ```
17
- ...
18
 
19
- user
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  what is the capital of france
21
- assistant
 
 
 
22
  <think>
23
  Okay, the user asked, "what is the capital of France?" Hmm, that's a pretty basic geography question. Let me think. First, I know the capital of France is Paris. But wait, maybe I should double-check to be absolutely sure. Yeah, definitely Paris. It's one of those common facts that's easy to remember.
24
 
 
9
 
10
  to be used with llama.cpp PR 16095
11
 
12
+ ## Update:
13
+ I have tested some of these smaller models on NVIDIA with default CUDA compile
14
+ with the excellent release from @cturan on NVIDIA L40S GPU.
15
+
16
+ Since L40S GPU is 48GB VRAM, I was able to run Q2_K, Q3_K_M, Q4_K_S, Q4_0 and Q4_MXFP4_MOE:
17
+
18
+ but Q4_K_M was too big.
19
+ Although it works if using -ngl 45
20
+ but it slowed down quite a bit.
21
+
22
+ There may be a better way but did not have time to test.
23
+
24
+ Was able to get a good speed of 53 tokens per second in the generation
25
+ and 800 tokens per second in the prompt reading.
26
+
27
  ```bash
28
+ wget https://github.com/cturan/llama.cpp/archive/refs/tags/test.tar.gz
29
+ tar xf test.tar.gz
30
+ cd llama.cpp-test
31
+
32
+ # export PATH=/usr/local/cuda/bin:$PATH
33
+
34
+ time cmake -B build -DGGML_CUDA=ON
35
+ time cmake --build build --config Release --parallel $(nproc --all)
36
+ ```
37
+
38
+ You may need to add /usr/local/cuda/bin to your PATH
39
+ to find nvcc (Nvidia CUDA compiler)
40
+
41
+ Building from source took about 7 minutes.
42
+
43
+ For more detail on CUDA build see:
44
+ https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#cuda
45
+
46
+
47
+ ## Quantized Models:
48
+
49
+ These quantized models were generated using the excellent pull request from @pwilkin
50
+ [#16095](https://github.com/ggml-org/llama.cpp/pull/16095)
51
+ on 2025-10-19 with commit `2fdbf16eb`.
52
+
53
+ NOTE: currently they only work with the llama.cpp 16095 pull request which is still in development.
54
+ Speed and quality should improve over time.
55
+
56
+ ### How to build and run for MacOS
57
+
58
+ ```bash
59
+ PR=16095
60
+ git clone https://github.com/ggml-org/llama.cpp llama.cpp-PR-$PR
61
+ cd llama.cpp-PR-$PR
62
+
63
+ git fetch origin pull/$PR/head:pr-$PR
64
+ git checkout pr-$PR
65
+
66
+ time cmake -B build
67
+ time cmake --build build --config Release --parallel $(nproc --all)
68
  ```
69
 
70
+ ### Run examples
71
+
72
+ Run with Hugging Face model:
73
+
74
+ ```bash
75
+ build/bin/llama-cli -hf lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF --prompt 'What is the capital of France?' --no-mmap -st
76
  ```
77
+ by default will download lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF:Q4_K_M
78
 
79
+ To download:
80
+ ```bash
81
+ wget https://huggingface.co/lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF/resolve/main/Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf
82
+ ```
83
+ or
84
+ ```bash
85
+ pip install hf_transfer 'huggingface_hub[cli]'
86
+ hf download lefromage/Qwen3-Next-80B-A3B-Thinking-GGUF Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf
87
+ ```
88
+
89
+ Run with local model file:
90
+
91
+ ```bash
92
+ build/bin/llama-cli -m Qwen__Qwen3-Next-80B-A3B-Thinking-Q4_0.gguf --prompt 'Write a paragraph about quantum computing' --no-mmap -st
93
+ ```
94
+
95
+
96
+ ```bash
97
+ build/bin/llama-cli -ngl 100 -m Qwen__Qwen3-Next-80B-A3B-Thinking-Q2_K.gguf --no-mmap --prompt 'what is the capital of france' -st
98
+ ```
99
+
100
+ ### Example prompt and output
101
+
102
+ **User prompt:**
103
  what is the capital of france
104
+
105
+ **Assistant output:**
106
+
107
+ ```
108
  <think>
109
  Okay, the user asked, "what is the capital of France?" Hmm, that's a pretty basic geography question. Let me think. First, I know the capital of France is Paris. But wait, maybe I should double-check to be absolutely sure. Yeah, definitely Paris. It's one of those common facts that's easy to remember.
110