Qwen3-Embedding-0.6B ONNX INT8 for Text Embeddings Inference

This is an INT8 quantized ONNX version of Qwen/Qwen3-Embedding-0.6B optimized specifically for Text Embeddings Inference (TEI) with CPU acceleration.

Key Features

  • INT8 Quantization: ~8x smaller model size (0.56GB vs 4.7GB)
  • CPU Optimized: 2-4x faster inference on CPU compared to float32
  • TEI Compatible: Properly formatted for Text Embeddings Inference
  • Multilingual: Supports 29 languages including English, Chinese, Russian, Japanese, etc.
  • Mean Pooling: Configured for mean pooling (handled by TEI)

Performance

  • Model size: 0.56 GB (vs 4.7 GB float32)
  • Expected speedup: 2-4x on CPU
  • Accuracy: Minimal loss (1-3%) compared to float32
  • Best for: CPU deployments, edge devices, high-throughput scenarios

Usage with Text Embeddings Inference

Docker Deployment (CPU)

docker run -p 8080:80 \
  -e OMP_NUM_THREADS=$(nproc) \
  -e KMP_AFFINITY=granularity=fine,compact,1,0 \
  -e ORT_THREAD_POOL_SIZE=$(nproc) \
  ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
  --model-id YOUR_USERNAME/qwen3-embedding-0.6b-int8-tei-onnx

Python Client

from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# Single embedding
response = client.post(
    json={"inputs": "What is Deep Learning?"},
)
embedding = response.json()

# Batch embeddings
response = client.post(
    json={"inputs": ["What is Deep Learning?", "深度学习是什么?"]},
)
embeddings = response.json()

CPU Optimization

For optimal CPU performance, set these environment variables:

export OMP_NUM_THREADS=$(nproc)          # Use all physical cores
export KMP_AFFINITY=granularity=fine,compact,1,0
export ORT_THREAD_POOL_SIZE=$(nproc)

License

Apache 2.0

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support