YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Qwen3-4B-Thinking-2507 q4f32_1 (MLC/WebLLM)

Base model: https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507

  • Model ID: Qwen3-4B-Thinking-2507-q4f32_1-MLC
  • Quantization: q4f32_1
  • Context window: 4096 tokens
  • Backend: WebGPU (WASM), via WebLLM
  • Thinking: optional via <think>...</think> with controllable budget

Requirements

  • Browser with WebGPU (Chrome/Edge โ‰ฅ 121; ensure chrome://flags/#enable-webgpu-developer-features if needed)
  • HTTPS origin recommended for WebGPU
  • Sufficient GPU memory (4B q4f32_1 typically runs on integrated GPUs; performance varies by device)

Quick start (JavaScript, in-browser)

<script type="module">
import * as webllm from 'https://cdn.jsdelivr.net/npm/@mlc-ai/web-llm@latest/+esm';

const modelId = 'Qwen3-4B-Thinking-2507-q4f32_1-MLC';
const appConfig = {
  ...webllm.prebuiltAppConfig,
  model_list: [
    {
      model: 'https://huggingface.co/vladbuinceanu/Qwen3-4B-Thinking-2507-q4f32_1-MLC',
      model_id: 'Qwen3-4B-Thinking-2507-q4f32_1-MLC',
      // Reuse a compatible Qwen3-4B runtime library (WASM) from WebLLM CDN
      model_lib: webllm.modelLibURLPrefix + webllm.modelVersion + '/Qwen3-4B-q4f32_1-ctx4k_cs1k-webgpu.wasm',
      overrides: { context_window_size: 4096 },
    },
  ],
};

const engine = await webllm.CreateMLCEngine(modelId, {
  appConfig,
  logLevel: 'INFO',
  initProgressCallback: (p) => console.log(p?.text ?? ''),
});

// Simple non-streamed chat
const res = await engine.chat.completions.create({
  stream: false,
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Hello!' },
  ],
  temperature: 0.2,
  max_tokens: 512,
});
console.log(res.choices[0].message.content);
</script>

Streaming example

// Enable "thinking" if desired; control max thinking tokens
const extra_body = {
  enable_thinking: true,
  max_thinking_tokens: 2000,
  chat_template_kwargs: { enable_thinking: true, max_thinking_tokens: 2000 },
};

const stream = await engine.chat.completions.create({
  stream: true,
  messages: [
    { role: 'system', content: 'Answer briefly.' },
    { role: 'user', content: 'Explain WebGPU in one sentence.' },
  ],
  temperature: 0.2,
  max_tokens: 512,
  extra_body,
});

// Consume streamed deltas; optionally ignore content until you see </think>
let buffer = '';
for await (const chunk of stream) {
  const delta = chunk.choices?.[0]?.delta?.content ?? '';
  buffer += delta;
  // Example: start rendering only after the model closes its <think> block
  const thinkEndIdx = buffer.indexOf('</think>');
  if (thinkEndIdx !== -1) {
    const finalPart = buffer.slice(thinkEndIdx + '</think>'.length);
    // render finalPart incrementally as more arrives
  }
}

License

  • Base model license: Qwen license (see original upstream).
  • This quantized build and configuration follow the same licensing terms. Ensure compliance for your use case.

Citation

If you use this model, please cite the original Qwen authors and MLC/WebLLM.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support