--- language: - en - zh - ru - ja - de - fr - es - pt - vi - th - ar - ko - it - pl - nl - sv - tr - he - cs - uk - ro - bg - hu - el - da - fi - nb - sk - sl - hr - lt - lv - et - mt pipeline_tag: sentence-similarity tags: - qwen - embedding - onnx - int8 - quantized - text-embeddings-inference license: apache-2.0 --- # Qwen3-Embedding-0.6B ONNX INT8 for Text Embeddings Inference This is an INT8 quantized ONNX version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized specifically for [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) with CPU acceleration. ## Key Features - **INT8 Quantization**: ~8x smaller model size (0.56GB vs 4.7GB) - **CPU Optimized**: 2-4x faster inference on CPU compared to float32 - **TEI Compatible**: Properly formatted for Text Embeddings Inference - **Multilingual**: Supports 29 languages including English, Chinese, Russian, Japanese, etc. - **Mean Pooling**: Configured for mean pooling (handled by TEI) ## Performance - **Model size**: 0.56 GB (vs 4.7 GB float32) - **Expected speedup**: 2-4x on CPU - **Accuracy**: Minimal loss (1-3%) compared to float32 - **Best for**: CPU deployments, edge devices, high-throughput scenarios ## Usage with Text Embeddings Inference ### Docker Deployment (CPU) ```bash docker run -p 8080:80 \ -e OMP_NUM_THREADS=$(nproc) \ -e KMP_AFFINITY=granularity=fine,compact,1,0 \ -e ORT_THREAD_POOL_SIZE=$(nproc) \ ghcr.io/huggingface/text-embeddings-inference:cpu-latest \ --model-id YOUR_USERNAME/qwen3-embedding-0.6b-int8-tei-onnx ``` ### Python Client ```python from huggingface_hub import InferenceClient client = InferenceClient("http://localhost:8080") # Single embedding response = client.post( json={"inputs": "What is Deep Learning?"}, ) embedding = response.json() # Batch embeddings response = client.post( json={"inputs": ["What is Deep Learning?", "深度学习是什么?"]}, ) embeddings = response.json() ``` ## CPU Optimization For optimal CPU performance, set these environment variables: ```bash export OMP_NUM_THREADS=$(nproc) # Use all physical cores export KMP_AFFINITY=granularity=fine,compact,1,0 export ORT_THREAD_POOL_SIZE=$(nproc) ``` ## License Apache 2.0