Update README.md
Browse files
README.md
CHANGED
|
@@ -74,6 +74,53 @@ Available in Dense and MoE architectures that scale from edge to cloud, with Ins
|
|
| 74 |
**Pure text performance**
|
| 75 |

|
| 76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
### Generation Hyperparameters
|
| 79 |
#### VL
|
|
|
|
| 74 |
**Pure text performance**
|
| 75 |

|
| 76 |
|
| 77 |
+
## How to Use
|
| 78 |
+
|
| 79 |
+
To use these models with `llama.cpp`, please ensure you are using the **latest version**—either by [building from source](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) or downloading the most recent [release](https://github.com/ggml-org/llama.cpp/releases/tag/b6907) according to the devices.
|
| 80 |
+
|
| 81 |
+
You can run inference via the command line or through a web-based chat interface.
|
| 82 |
+
|
| 83 |
+
### CLI Inference (`llama-mtmd-cli`)
|
| 84 |
+
|
| 85 |
+
For example, to run Qwen3-VL-2B-Instruct with an FP16 vision encoder and Q8_0 quantized LLM:
|
| 86 |
+
|
| 87 |
+
```bash
|
| 88 |
+
llama-mtmd-cli \
|
| 89 |
+
-m path/to/Qwen3VL-2B-Thinking-Q8_0.gguf \
|
| 90 |
+
--mmproj path/to/mmproj-Qwen3VL-2B-Thinking-F16.gguf \
|
| 91 |
+
--image test.jpeg \
|
| 92 |
+
-p "What is the publisher name of the newspaper?" \
|
| 93 |
+
--temp 1.0 --top-k 20 --top-p 0.95 -n 1024
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
### Web Chat (using `llama-server`)
|
| 97 |
+
|
| 98 |
+
To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI:
|
| 99 |
+
|
| 100 |
+
```bash
|
| 101 |
+
llama-server \
|
| 102 |
+
-m path/to/Qwen3VL-235B-A22B-Instruct-Q4_K_M-split-00001-of-00003.gguf \
|
| 103 |
+
--mmproj path/to/mmproj-Qwen3VL-235B-A22B-Instruct-Q8_0.gguf
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
> **Tip**: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts.
|
| 107 |
+
|
| 108 |
+
Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the [official documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
|
| 109 |
+
|
| 110 |
+
### Quantize Your Custom Model
|
| 111 |
+
|
| 112 |
+
You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit:
|
| 113 |
+
|
| 114 |
+
```bash
|
| 115 |
+
# Quantize to 2-bit (IQ2_XXS)
|
| 116 |
+
llama-quantize \
|
| 117 |
+
path/to/Qwen3VL-235B-A22B-Instruct-F16.gguf \
|
| 118 |
+
path/to/Qwen3VL-235B-A22B-Instruct-IQ2_XXS.gguf \
|
| 119 |
+
iq2_xxs 8
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
For a full list of supported quantization types and detailed instructions, refer to the [quantization documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md).
|
| 123 |
+
|
| 124 |
|
| 125 |
### Generation Hyperparameters
|
| 126 |
#### VL
|