YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Qwen3-4B-Thinking-2507 q4f32_1 (MLC/WebLLM)
Base model: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507
- Model ID:
Qwen3-4B-Thinking-2507-q4f32_1-MLC - Quantization:
q4f32_1 - Context window: 4096 tokens
- Backend: WebGPU (WASM), via WebLLM
- Thinking: optional via
<think>...</think>with controllable budget
Requirements
- Browser with WebGPU (Chrome/Edge โฅ 121; ensure
chrome://flags/#enable-webgpu-developer-featuresif needed) - HTTPS origin recommended for WebGPU
- Sufficient GPU memory (4B q4f32_1 typically runs on integrated GPUs; performance varies by device)
Quick start (JavaScript, in-browser)
<script type="module">
import * as webllm from 'https://cdn.jsdelivr.net/npm/@mlc-ai/web-llm@latest/+esm';
const modelId = 'Qwen3-4B-Thinking-2507-q4f32_1-MLC';
const appConfig = {
...webllm.prebuiltAppConfig,
model_list: [
{
model: 'https://huggingface.co/vladbuinceanu/Qwen3-4B-Thinking-2507-q4f32_1-MLC',
model_id: 'Qwen3-4B-Thinking-2507-q4f32_1-MLC',
// Reuse a compatible Qwen3-4B runtime library (WASM) from WebLLM CDN
model_lib: webllm.modelLibURLPrefix + webllm.modelVersion + '/Qwen3-4B-q4f32_1-ctx4k_cs1k-webgpu.wasm',
overrides: { context_window_size: 4096 },
},
],
};
const engine = await webllm.CreateMLCEngine(modelId, {
appConfig,
logLevel: 'INFO',
initProgressCallback: (p) => console.log(p?.text ?? ''),
});
// Simple non-streamed chat
const res = await engine.chat.completions.create({
stream: false,
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Hello!' },
],
temperature: 0.2,
max_tokens: 512,
});
console.log(res.choices[0].message.content);
</script>
Streaming example
// Enable "thinking" if desired; control max thinking tokens
const extra_body = {
enable_thinking: true,
max_thinking_tokens: 2000,
chat_template_kwargs: { enable_thinking: true, max_thinking_tokens: 2000 },
};
const stream = await engine.chat.completions.create({
stream: true,
messages: [
{ role: 'system', content: 'Answer briefly.' },
{ role: 'user', content: 'Explain WebGPU in one sentence.' },
],
temperature: 0.2,
max_tokens: 512,
extra_body,
});
// Consume streamed deltas; optionally ignore content until you see </think>
let buffer = '';
for await (const chunk of stream) {
const delta = chunk.choices?.[0]?.delta?.content ?? '';
buffer += delta;
// Example: start rendering only after the model closes its <think> block
const thinkEndIdx = buffer.indexOf('</think>');
if (thinkEndIdx !== -1) {
const finalPart = buffer.slice(thinkEndIdx + '</think>'.length);
// render finalPart incrementally as more arrives
}
}
License
- Base model license: Qwen license (see original upstream).
- This quantized build and configuration follow the same licensing terms. Ensure compliance for your use case.
Citation
If you use this model, please cite the original Qwen authors and MLC/WebLLM.
- Downloads last month
- 1
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support