YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Qwen3-4B-Thinking-2507 q4f32_1 (MLC/WebLLM)

Base model: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507

Model ID: Qwen3-4B-Thinking-2507-q4f32_1-MLC
Quantization: q4f32_1
Context window: 4096 tokens
Backend: WebGPU (WASM), via WebLLM
Thinking: optional via <think>...</think> with controllable budget

Requirements

Browser with WebGPU (Chrome/Edge ≥ 121; ensure chrome://flags/#enable-webgpu-developer-features if needed)
HTTPS origin recommended for WebGPU
Sufficient GPU memory (4B q4f32_1 typically runs on integrated GPUs; performance varies by device)

Quick start (JavaScript, in-browser)

<script type="module">
import * as webllm from 'https://cdn.jsdelivr.net/npm/@mlc-ai/web-llm@latest/+esm';

const modelId = 'Qwen3-4B-Thinking-2507-q4f32_1-MLC';
const appConfig = {
  ...webllm.prebuiltAppConfig,
  model_list: [
    {
      model: 'https://huggingface.co/vladbuinceanu/Qwen3-4B-Thinking-2507-q4f32_1-MLC',
      model_id: 'Qwen3-4B-Thinking-2507-q4f32_1-MLC',
      // Reuse a compatible Qwen3-4B runtime library (WASM) from WebLLM CDN
      model_lib: webllm.modelLibURLPrefix + webllm.modelVersion + '/Qwen3-4B-q4f32_1-ctx4k_cs1k-webgpu.wasm',
      overrides: { context_window_size: 4096 },
    },
  ],
};

const engine = await webllm.CreateMLCEngine(modelId, {
  appConfig,
  logLevel: 'INFO',
  initProgressCallback: (p) => console.log(p?.text ?? ''),
});

// Simple non-streamed chat
const res = await engine.chat.completions.create({
  stream: false,
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Hello!' },
  ],
  temperature: 0.2,
  max_tokens: 512,
});
console.log(res.choices[0].message.content);
</script>

Streaming example

// Enable "thinking" if desired; control max thinking tokens
const extra_body = {
  enable_thinking: true,
  max_thinking_tokens: 2000,
  chat_template_kwargs: { enable_thinking: true, max_thinking_tokens: 2000 },
};

const stream = await engine.chat.completions.create({
  stream: true,
  messages: [
    { role: 'system', content: 'Answer briefly.' },
    { role: 'user', content: 'Explain WebGPU in one sentence.' },
  ],
  temperature: 0.2,
  max_tokens: 512,
  extra_body,
});

// Consume streamed deltas; optionally ignore content until you see </think>
let buffer = '';
for await (const chunk of stream) {
  const delta = chunk.choices?.[0]?.delta?.content ?? '';
  buffer += delta;
  // Example: start rendering only after the model closes its <think> block
  const thinkEndIdx = buffer.indexOf('</think>');
  if (thinkEndIdx !== -1) {
    const finalPart = buffer.slice(thinkEndIdx + '</think>'.length);
    // render finalPart incrementally as more arrives
  }
}

License

Base model license: Qwen license (see original upstream).
This quantized build and configuration follow the same licensing terms. Ensure compliance for your use case.

Citation

If you use this model, please cite the original Qwen authors and MLC/WebLLM.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support