### Qwen3-4B-Thinking-2507 q4f32_1 (MLC/WebLLM) Base model: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507 - Model ID: `Qwen3-4B-Thinking-2507-q4f32_1-MLC` - Quantization: `q4f32_1` - Context window: 4096 tokens - Backend: WebGPU (WASM), via WebLLM - Thinking: optional via `...` with controllable budget ### Requirements - Browser with WebGPU (Chrome/Edge ≥ 121; ensure `chrome://flags/#enable-webgpu-developer-features` if needed) - HTTPS origin recommended for WebGPU - Sufficient GPU memory (4B q4f32_1 typically runs on integrated GPUs; performance varies by device) ### Quick start (JavaScript, in-browser) ```html ``` ### Streaming example ```js // Enable "thinking" if desired; control max thinking tokens const extra_body = { enable_thinking: true, max_thinking_tokens: 2000, chat_template_kwargs: { enable_thinking: true, max_thinking_tokens: 2000 }, }; const stream = await engine.chat.completions.create({ stream: true, messages: [ { role: 'system', content: 'Answer briefly.' }, { role: 'user', content: 'Explain WebGPU in one sentence.' }, ], temperature: 0.2, max_tokens: 512, extra_body, }); // Consume streamed deltas; optionally ignore content until you see let buffer = ''; for await (const chunk of stream) { const delta = chunk.choices?.[0]?.delta?.content ?? ''; buffer += delta; // Example: start rendering only after the model closes its block const thinkEndIdx = buffer.indexOf(''); if (thinkEndIdx !== -1) { const finalPart = buffer.slice(thinkEndIdx + ''.length); // render finalPart incrementally as more arrives } } ``` ### License - Base model license: Qwen license (see original upstream). - This quantized build and configuration follow the same licensing terms. Ensure compliance for your use case. ### Citation If you use this model, please cite the original Qwen authors and MLC/WebLLM.