Introduction
Tarka-Embedding-150M-V1 is a 150M parameter embedding model designed to produce 768-dimensional dense text representations. It is optimized for a wide range of downstream applications such as semantic similarity, search, and retrieval-augmented generation (RAG). The model focuses on capturing deep contextual semantics to support general-purpose text understanding across diverse domains.
The model is trained using Data-Free Knowledge Distillation (DFKD). To prepare the training data, standard open-source datasets were used only as a source of raw textual content — all labels, annotations, and structural elements were stripped to create a plain, unlabeled text corpus. The resulting dataset contained approximately 2 billion tokens, all of which were used during model training.
Find more information about Tarka-Embedding-150M-V1 in our blog post.
🚀 Try our demo: https://huggingface.co/spaces/Tarka-AIR/Tarka-Embedding
Model Details
Tarka-Embedding-150M-V1 has the following features:
- Model Type: Text Embedding
- Supported Languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.
- Number of Paramaters: 150M
- Context Length: 2048
- Embedding Dimension: 768
Currently, only the MTEB (English) benchmark has been evaluated. Additional multilingual benchmark results will be released in future updates.
Training Details
- Initialization: Based on
google/embeddinggemma-300mandLiquidAI/LFM2-350M - Architecture Modifications: The tokenizer and embedding weights from the EmbeddingGemma model were replaced with the LFM2-350M tokenizer. The embedding layer and several hidden layers were reinitialized to align with the modified architecture, resulting in a dimensional change from [262,144, 768] → [64,400, 768], effectively reducing the total parameter count by approximately 50%.
- Training Data: 2 billion tokens curated from multiple open source datasets.
- Teacher Model: google/embeddinggemma-300m
- Compute Resources: 40 GPU hours on NVIDIA A100
Citation
@misc{tarka_ai_research_2025,
author = { Tarka AI Research },
title = { Tarka-Embedding-150M-V1 (Revision c5f4f43) },
year = 2025,
url = { https://huggingface.co/Tarka-AIR/Tarka-Embedding-150M-V1 },
doi = { 10.57967/hf/6875 },
publisher = { Hugging Face }
}
MTEB (Eng v2)
| MTEB English / Models | Param. | Mean(Task) | Mean(Type) | Class. | Clust. | Pair Class. | Rerank. | Retri. | STS | Summ. |
|---|---|---|---|---|---|---|---|---|---|---|
| GIST-large-Embedding-v0 | 335M | 66.25 | 61.96 | 78.91 | 48.84 | 86.7 | 48.76 | 54.52 | 84.44 | 31.52 |
| mxbai-embed-large-v1 | 335M | 66.26 | 62.04 | 79.1 | 47.48 | 87.2 | 48.05 | 55.4 | 84.42 | 32.63 |
| UAE-Large-V1 | 335M | 66.4 | 61.85 | 79.08 | 47.86 | 87.25 | 48.35 | 55.91 | 84.37 | 30.13 |
| GIST-Embedding-v0 | 109M | 65.5 | 61.4 | 78.16 | 48.5 | 86.33 | 47.52 | 53.59 | 83.35 | 32.32 |
| bge-large-en-v1.5 | 335M | 65.89 | 61.87 | 78.34 | 48.01 | 87.13 | 48.26 | 55.44 | 82.79 | 33.13 |
| multilingual-e5-large-instruct | 560M | 65.53 | 61.21 | 75.54 | 49.89 | 86.24 | 48.74 | 53.47 | 84.72 | 29.89 |
| gte-large | 335M | 64.77 | 60.86 | 75.47 | 48.2 | 85.08 | 47.84 | 53.29 | 83.27 | 32.9 |
| bge-base-en-v1.5 | 109M | 65.14 | 60.77 | 77.69 | 47.42 | 86.56 | 46.66 | 54.75 | 82.12 | 30.19 |
| mini-gte | 66M | 65.06 | 60.65 | 79.95 | 47.89 | 84.78 | 46.86 | 53.23 | 81.55 | 30.31 |
| bilingual-embedding-large | 559M | 63.77 | 60.2 | 77.17 | 46.53 | 85.62 | 46.25 | 46.86 | 86 | 32.95 |
| gte-base | 109M | 63.9 | 59.94 | 75.04 | 47.74 | 84.68 | 47.17 | 51.9 | 82.17 | 30.9 |
| mmlw-roberta-large | 434M | 61.8 | 59.45 | 79.66 | 47.89 | 85.2 | 47.56 | 39.69 | 81.2 | 34.97 |
| e5-large | 335M | 63.13 | 59.68 | 75.61 | 45.88 | 85.94 | 45.43 | 49.64 | 82 | 33.26 |
| mmlw-e5-base | 278M | 61.43 | 58.61 | 77.88 | 47.11 | 84.88 | 46.4 | 40.21 | 81.92 | 31.87 |
| e5-large-v2 | 335M | 62.79 | 59.4 | 76.44 | 45.23 | 86.06 | 45.72 | 49.31 | 80.67 | 32.34 |
| Tarka-Embedding-150M-V1 | 150M | 66.40 | 61.35 | 86.28 | 51.94 | 81.65 | 45.66 | 51.62 | 81.48 | 30.86 |
Sentence Transformers Usage
# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("Tarka-AIR/Tarka-Embedding-150M-V1")
# The queries and documents to embed
queries = [
"What is the capital of China?",
"Explain gravity",
]
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (2, 768) (2, 768)
# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.7830, 0.2929],[0.3390, 0.6536]])
Prompt Instructions
EmbeddingGemma can generate optimized embeddings for various use cases—such as document retrieval, question answering, and fact verification—or for specific input types—either a query or a document—using prompts that are prepended to the input strings.
Query prompts follow the form task: {task description} | query: where the task description varies by the use case, with the default task description being search result. Document-style prompts follow the form title: {title | "none"} | text: where the title is either none (the default) or the actual title of the document. Note that providing a title, if available, will improve model performance for document prompts but may require manual formatting.
Use the following prompts based on your use case and input data type. These may already be available in the EmbeddingGemma configuration in your modeling framework of choice.
Use Case (task type enum) |
Descriptions |
Recommended Prompt |
|---|---|---|
Retrieval (Query) |
Used to generate embeddings that are optimized for document search or information retrieval |
task: search result | query: {content} |
Retrieval (Document) |
title: {title | "none"} | text: {content} |
|
Question Answering |
task: question answering | query: {content} |
|
Fact Verification |
task: fact checking | query: {content} |
|
Classification |
Used to generate embeddings that are optimized to classify texts according to preset labels |
task: classification | query: {content} |
Clustering |
Used to generate embeddings that are optimized to cluster texts based on their similarities |
task: clustering | query: {content} |
Semantic Similarity |
Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases. |
task: sentence similarity | query: {content} |
Code Retrieval |
Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of the code blocks are computed using retrieval_document. |
task: code retrieval | query: {content} |
Usage and Limitations
These models have certain limitations that users should be aware of.
Intended Usage
Open embedding models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
Semantic Similarity: Embeddings optimized to assess text similarity, such as recommendation systems and duplicate detection
Classification: Embeddings optimized to classify texts according to preset labels, such as sentiment analysis and spam detection
Clustering: Embeddings optimized to cluster texts based on their similarities, such as document organization, market research, and anomaly detection
Retrieval
- Document: Embeddings optimized for document search, such as indexing articles, books, or web pages for search
- Query: Embeddings optimized for general search queries, such as custom search
- Code Query: Embeddings optimized for retrieval of code blocks based on natural language queries, such as code suggestions and search
Question Answering: Embeddings for questions in a question-answering system, optimized for finding documents that answer the question, such as chatbox.
Fact Verification: Embeddings for statements that need to be verified, optimized for retrieving documents that contain evidence supporting or refuting the statement, such as automated fact-checking systems.
Acknowledgments
Special thanks to:
- Google DeepMind and LFM Team team for providing the base model and foundational research.
Gratitude is also extended to the open-source community for creating the tools, frameworks, and datasets that enabled fine-tuning and evaluation of this model.
Disclaimer The creator of this Model is not responsible for any misuse, damages, or legal issues arising from the use of this model. This model incorporates Gemma materials provided under the Gemma Terms of Use , which continue to apply.
- Downloads last month
- 201