Introduction

Tarka-Embedding-150M-V1 is a 150M parameter embedding model designed to produce 768-dimensional dense text representations. It is optimized for a wide range of downstream applications such as semantic similarity, search, and retrieval-augmented generation (RAG). The model focuses on capturing deep contextual semantics to support general-purpose text understanding across diverse domains.

The model is trained using Data-Free Knowledge Distillation (DFKD). To prepare the training data, standard open-source datasets were used only as a source of raw textual content — all labels, annotations, and structural elements were stripped to create a plain, unlabeled text corpus. The resulting dataset contained approximately 2 billion tokens, all of which were used during model training.

Find more information about Tarka-Embedding-150M-V1 in our blog post.

🚀 Try our demo: https://huggingface.co/spaces/Tarka-AIR/Tarka-Embedding

Model Details

Tarka-Embedding-150M-V1 has the following features:

Model Type: Text Embedding
Supported Languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.
Number of Paramaters: 150M
Context Length: 2048
Embedding Dimension: 768

Currently, only the MTEB (English) benchmark has been evaluated. Additional multilingual benchmark results will be released in future updates.

Training Details

Initialization: Based on google/embeddinggemma-300m and LiquidAI/LFM2-350M
Architecture Modifications: The tokenizer and embedding weights from the EmbeddingGemma model were replaced with the LFM2-350M tokenizer. The embedding layer and several hidden layers were reinitialized to align with the modified architecture, resulting in a dimensional change from [262,144, 768] → [64,400, 768], effectively reducing the total parameter count by approximately 50%.
Training Data: 2 billion tokens curated from multiple open source datasets.
Teacher Model: google/embeddinggemma-300m
Compute Resources: 40 GPU hours on NVIDIA A100

Citation

@misc{tarka_ai_research_2025,
    author       = { Tarka AI Research },
    title        = { Tarka-Embedding-150M-V1 (Revision c5f4f43) },
    year         = 2025,
    url          = { https://huggingface.co/Tarka-AIR/Tarka-Embedding-150M-V1 },
    doi          = { 10.57967/hf/6875 },
    publisher    = { Hugging Face }
}

MTEB (Eng v2)

MTEB English / Models	Param.	Mean(Task)	Mean(Type)	Class.	Clust.	Pair Class.	Rerank.	Retri.	STS	Summ.
GIST-large-Embedding-v0	335M	66.25	61.96	78.91	48.84	86.7	48.76	54.52	84.44	31.52
mxbai-embed-large-v1	335M	66.26	62.04	79.1	47.48	87.2	48.05	55.4	84.42	32.63
UAE-Large-V1	335M	66.4	61.85	79.08	47.86	87.25	48.35	55.91	84.37	30.13
GIST-Embedding-v0	109M	65.5	61.4	78.16	48.5	86.33	47.52	53.59	83.35	32.32
bge-large-en-v1.5	335M	65.89	61.87	78.34	48.01	87.13	48.26	55.44	82.79	33.13
multilingual-e5-large-instruct	560M	65.53	61.21	75.54	49.89	86.24	48.74	53.47	84.72	29.89
gte-large	335M	64.77	60.86	75.47	48.2	85.08	47.84	53.29	83.27	32.9
bge-base-en-v1.5	109M	65.14	60.77	77.69	47.42	86.56	46.66	54.75	82.12	30.19
mini-gte	66M	65.06	60.65	79.95	47.89	84.78	46.86	53.23	81.55	30.31
bilingual-embedding-large	559M	63.77	60.2	77.17	46.53	85.62	46.25	46.86	86	32.95
gte-base	109M	63.9	59.94	75.04	47.74	84.68	47.17	51.9	82.17	30.9
mmlw-roberta-large	434M	61.8	59.45	79.66	47.89	85.2	47.56	39.69	81.2	34.97
e5-large	335M	63.13	59.68	75.61	45.88	85.94	45.43	49.64	82	33.26
mmlw-e5-base	278M	61.43	58.61	77.88	47.11	84.88	46.4	40.21	81.92	31.87
e5-large-v2	335M	62.79	59.4	76.44	45.23	86.06	45.72	49.31	80.67	32.34
Tarka-Embedding-150M-V1	150M	66.40	61.35	86.28	51.94	81.65	45.66	51.62	81.48	30.86

Sentence Transformers Usage

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Tarka-AIR/Tarka-Embedding-150M-V1")

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (2, 768) (2, 768)

# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.7830, 0.2929],[0.3390, 0.6536]])

Prompt Instructions

EmbeddingGemma can generate optimized embeddings for various use cases—such as document retrieval, question answering, and fact verification—or for specific input types—either a query or a document—using prompts that are prepended to the input strings. Query prompts follow the form task: {task description} | query: where the task description varies by the use case, with the default task description being search result. Document-style prompts follow the form title: {title | "none"} | text: where the title is either none (the default) or the actual title of the document. Note that providing a title, if available, will improve model performance for document prompts but may require manual formatting.

Use the following prompts based on your use case and input data type. These may already be available in the EmbeddingGemma configuration in your modeling framework of choice.

Use Case (task type enum)	Descriptions	Recommended Prompt
Retrieval (Query)	Used to generate embeddings that are optimized for document search or information retrieval	task: search result \| query: {content}
Retrieval (Document)		title: {title \| "none"} \| text: {content}
Question Answering		task: question answering \| query: {content}
Fact Verification		task: fact checking \| query: {content}
Classification	Used to generate embeddings that are optimized to classify texts according to preset labels	task: classification \| query: {content}
Clustering	Used to generate embeddings that are optimized to cluster texts based on their similarities	task: clustering \| query: {content}
Semantic Similarity	Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases.	task: sentence similarity \| query: {content}
Code Retrieval	Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of the code blocks are computed using retrieval_document.	task: code retrieval \| query: {content}

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Open embedding models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

Semantic Similarity: Embeddings optimized to assess text similarity, such as recommendation systems and duplicate detection
Classification: Embeddings optimized to classify texts according to preset labels, such as sentiment analysis and spam detection
Clustering: Embeddings optimized to cluster texts based on their similarities, such as document organization, market research, and anomaly detection
Retrieval
- Document: Embeddings optimized for document search, such as indexing articles, books, or web pages for search
- Query: Embeddings optimized for general search queries, such as custom search
- Code Query: Embeddings optimized for retrieval of code blocks based on natural language queries, such as code suggestions and search
Question Answering: Embeddings for questions in a question-answering system, optimized for finding documents that answer the question, such as chatbox.
Fact Verification: Embeddings for statements that need to be verified, optimized for retrieving documents that contain evidence supporting or refuting the statement, such as automated fact-checking systems.

Acknowledgments

Special thanks to:

Google DeepMind and LFM Team team for providing the base model and foundational research.

Gratitude is also extended to the open-source community for creating the tools, frameworks, and datasets that enabled fine-tuning and evaluation of this model.

Disclaimer The creator of this Model is not responsible for any misuse, damages, or legal issues arising from the use of this model. This model incorporates Gemma materials provided under the Gemma Terms of Use , which continue to apply.

Downloads last month: 201

Safetensors

Model size

0.2B params

Tensor type

F32

Space using Tarka-AIR/Tarka-Embedding-150M-V1 1

Collection including Tarka-AIR/Tarka-Embedding-150M-V1

Tarka Embed V1

Collection

Efficient DFKD embeddings for language understanding • 5 items • Updated about 22 hours ago • 5