janni-t
/

qwen3-embedding-0.6b-int8-tei-onnx

Sentence Similarity

text-embeddings-inference

Model card Files Files and versions Community

janni-t commited on 9 days ago

Commit

8fe0c23

·

verified ·

1 Parent(s): d716917

feat: model card

Files changed (1) hide show

README.md +112 -0

README.md ADDED Viewed

	@@ -0,0 +1,112 @@

+---
+language:
+- en
+- zh
+- ru
+- ja
+- de
+- fr
+- es
+- pt
+- vi
+- th
+- ar
+- ko
+- it
+- pl
+- nl
+- sv
+- tr
+- he
+- cs
+- uk
+- ro
+- bg
+- hu
+- el
+- da
+- fi
+- nb
+- sk
+- sl
+- hr
+- lt
+- lv
+- et
+- mt
+pipeline_tag: sentence-similarity
+tags:
+- qwen
+- embedding
+- onnx
+- int8
+- quantized
+- text-embeddings-inference
+license: apache-2.0
+---
+# Qwen3-Embedding-0.6B ONNX INT8 for Text Embeddings Inference
+This is an INT8 quantized ONNX version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized specifically for [Text Embeddings Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) with CPU acceleration.
+## Key Features
+- **INT8 Quantization**: ~8x smaller model size (0.56GB vs 4.7GB)
+- **CPU Optimized**: 2-4x faster inference on CPU compared to float32
+- **TEI Compatible**: Properly formatted for Text Embeddings Inference
+- **Multilingual**: Supports 29 languages including English, Chinese, Russian, Japanese, etc.
+- **Mean Pooling**: Configured for mean pooling (handled by TEI)
+## Performance
+- **Model size**: 0.56 GB (vs 4.7 GB float32)
+- **Expected speedup**: 2-4x on CPU
+- **Accuracy**: Minimal loss (1-3%) compared to float32
+- **Best for**: CPU deployments, edge devices, high-throughput scenarios
+## Usage with Text Embeddings Inference
+### Docker Deployment (CPU)
+```bash
+docker run -p 8080:80 \
+  -e OMP_NUM_THREADS=$(nproc) \
+  -e KMP_AFFINITY=granularity=fine,compact,1,0 \
+  -e ORT_THREAD_POOL_SIZE=$(nproc) \
+  ghcr.io/huggingface/text-embeddings-inference:cpu-latest \
+  --model-id YOUR_USERNAME/qwen3-embedding-0.6b-int8-tei-onnx
+```
+### Python Client
+```python
+from huggingface_hub import InferenceClient
+client = InferenceClient("http://localhost:8080")
+# Single embedding
+response = client.post(
+    json={"inputs": "What is Deep Learning?"},
+)
+embedding = response.json()
+# Batch embeddings
+response = client.post(
+    json={"inputs": ["What is Deep Learning?", "深度学习是什么？"]},
+)
+embeddings = response.json()
+```
+## CPU Optimization
+For optimal CPU performance, set these environment variables:
+```bash
+export OMP_NUM_THREADS=$(nproc)          # Use all physical cores
+export KMP_AFFINITY=granularity=fine,compact,1,0
+export ORT_THREAD_POOL_SIZE=$(nproc)
+```
+## License
+Apache 2.0