### Qwen3-4B-Thinking-2507 q4f32_1 (MLC/WebLLM)
Base model: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507
- Model ID: `Qwen3-4B-Thinking-2507-q4f32_1-MLC`
- Quantization: `q4f32_1`
- Context window: 4096 tokens
- Backend: WebGPU (WASM), via WebLLM
- Thinking: optional via `...` with controllable budget
### Requirements
- Browser with WebGPU (Chrome/Edge ≥ 121; ensure `chrome://flags/#enable-webgpu-developer-features` if needed)
- HTTPS origin recommended for WebGPU
- Sufficient GPU memory (4B q4f32_1 typically runs on integrated GPUs; performance varies by device)
### Quick start (JavaScript, in-browser)
```html
```
### Streaming example
```js
// Enable "thinking" if desired; control max thinking tokens
const extra_body = {
enable_thinking: true,
max_thinking_tokens: 2000,
chat_template_kwargs: { enable_thinking: true, max_thinking_tokens: 2000 },
};
const stream = await engine.chat.completions.create({
stream: true,
messages: [
{ role: 'system', content: 'Answer briefly.' },
{ role: 'user', content: 'Explain WebGPU in one sentence.' },
],
temperature: 0.2,
max_tokens: 512,
extra_body,
});
// Consume streamed deltas; optionally ignore content until you see
let buffer = '';
for await (const chunk of stream) {
const delta = chunk.choices?.[0]?.delta?.content ?? '';
buffer += delta;
// Example: start rendering only after the model closes its block
const thinkEndIdx = buffer.indexOf('');
if (thinkEndIdx !== -1) {
const finalPart = buffer.slice(thinkEndIdx + ''.length);
// render finalPart incrementally as more arrives
}
}
```
### License
- Base model license: Qwen license (see original upstream).
- This quantized build and configuration follow the same licensing terms. Ensure compliance for your use case.
### Citation
If you use this model, please cite the original Qwen authors and MLC/WebLLM.