Introduction

Tarka-Embedding-150M-V1 is a 150M parameter embedding model designed to produce 768-dimensional dense text representations. It is optimized for a wide range of downstream applications such as semantic similarity, search, and retrieval-augmented generation (RAG). The model focuses on capturing deep contextual semantics to support general-purpose text understanding across diverse domains.

The model is trained using Data-Free Knowledge Distillation (DFKD). To prepare the training data, standard open-source datasets were used only as a source of raw textual content — all labels, annotations, and structural elements were stripped to create a plain, unlabeled text corpus. The resulting dataset contained approximately 2 billion tokens, all of which were used during model training.

Find more information about Tarka-Embedding-150M-V1 in our blog post.

🚀 Try our demo: https://huggingface.co/spaces/Tarka-AIR/Tarka-Embedding

Model Details

Tarka-Embedding-150M-V1 has the following features:

  • Model Type: Text Embedding
  • Supported Languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.
  • Number of Paramaters: 150M
  • Context Length: 2048
  • Embedding Dimension: 768

Currently, only the MTEB (English) benchmark has been evaluated. Additional multilingual benchmark results will be released in future updates.

Training Details

  • Initialization: Based on google/embeddinggemma-300m and LiquidAI/LFM2-350M
  • Architecture Modifications: The tokenizer and embedding weights from the EmbeddingGemma model were replaced with the LFM2-350M tokenizer. The embedding layer and several hidden layers were reinitialized to align with the modified architecture, resulting in a dimensional change from [262,144, 768] → [64,400, 768], effectively reducing the total parameter count by approximately 50%.
  • Training Data: 2 billion tokens curated from multiple open source datasets.
  • Teacher Model: google/embeddinggemma-300m
  • Compute Resources: 40 GPU hours on NVIDIA A100

Citation

@misc{tarka_ai_research_2025,
    author       = { Tarka AI Research },
    title        = { Tarka-Embedding-150M-V1 (Revision c5f4f43) },
    year         = 2025,
    url          = { https://huggingface.co/Tarka-AIR/Tarka-Embedding-150M-V1 },
    doi          = { 10.57967/hf/6875 },
    publisher    = { Hugging Face }
}

MTEB (Eng v2)

MTEB English / Models Param. Mean(Task) Mean(Type) Class. Clust. Pair Class. Rerank. Retri. STS Summ.
GIST-large-Embedding-v0 335M 66.25 61.96 78.91 48.84 86.7 48.76 54.52 84.44 31.52
mxbai-embed-large-v1 335M 66.26 62.04 79.1 47.48 87.2 48.05 55.4 84.42 32.63
UAE-Large-V1 335M 66.4 61.85 79.08 47.86 87.25 48.35 55.91 84.37 30.13
GIST-Embedding-v0 109M 65.5 61.4 78.16 48.5 86.33 47.52 53.59 83.35 32.32
bge-large-en-v1.5 335M 65.89 61.87 78.34 48.01 87.13 48.26 55.44 82.79 33.13
multilingual-e5-large-instruct 560M 65.53 61.21 75.54 49.89 86.24 48.74 53.47 84.72 29.89
gte-large 335M 64.77 60.86 75.47 48.2 85.08 47.84 53.29 83.27 32.9
bge-base-en-v1.5 109M 65.14 60.77 77.69 47.42 86.56 46.66 54.75 82.12 30.19
mini-gte 66M 65.06 60.65 79.95 47.89 84.78 46.86 53.23 81.55 30.31
bilingual-embedding-large 559M 63.77 60.2 77.17 46.53 85.62 46.25 46.86 86 32.95
gte-base 109M 63.9 59.94 75.04 47.74 84.68 47.17 51.9 82.17 30.9
mmlw-roberta-large 434M 61.8 59.45 79.66 47.89 85.2 47.56 39.69 81.2 34.97
e5-large 335M 63.13 59.68 75.61 45.88 85.94 45.43 49.64 82 33.26
mmlw-e5-base 278M 61.43 58.61 77.88 47.11 84.88 46.4 40.21 81.92 31.87
e5-large-v2 335M 62.79 59.4 76.44 45.23 86.06 45.72 49.31 80.67 32.34
Tarka-Embedding-150M-V1 150M 66.40 61.35 86.28 51.94 81.65 45.66 51.62 81.48 30.86

Sentence Transformers Usage

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Tarka-AIR/Tarka-Embedding-150M-V1")

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (2, 768) (2, 768)

# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.7830, 0.2929],[0.3390, 0.6536]])

Prompt Instructions

EmbeddingGemma can generate optimized embeddings for various use cases—such as document retrieval, question answering, and fact verification—or for specific input types—either a query or a document—using prompts that are prepended to the input strings. Query prompts follow the form task: {task description} | query: where the task description varies by the use case, with the default task description being search result. Document-style prompts follow the form title: {title | "none"} | text: where the title is either none (the default) or the actual title of the document. Note that providing a title, if available, will improve model performance for document prompts but may require manual formatting.

Use the following prompts based on your use case and input data type. These may already be available in the EmbeddingGemma configuration in your modeling framework of choice.


Use Case (task type enum)

Descriptions

Recommended Prompt

Retrieval (Query)

Used to generate embeddings that are optimized for document search or information retrieval

task: search result | query: {content}

Retrieval (Document)

title: {title | "none"} | text: {content}

Question Answering

task: question answering | query: {content}

Fact Verification

task: fact checking | query: {content}

Classification

Used to generate embeddings that are optimized to classify texts according to preset labels

task: classification | query: {content}

Clustering

Used to generate embeddings that are optimized to cluster texts based on their similarities

task: clustering | query: {content}

Semantic Similarity

Used to generate embeddings that are optimized to assess text similarity. This is not intended for retrieval use cases.

task: sentence similarity | query: {content}

Code Retrieval

Used to retrieve a code block based on a natural language query, such as sort an array or reverse a linked list. Embeddings of the code blocks are computed using retrieval_document.

task: code retrieval | query: {content}

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

Open embedding models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

  • Semantic Similarity: Embeddings optimized to assess text similarity, such as recommendation systems and duplicate detection

  • Classification: Embeddings optimized to classify texts according to preset labels, such as sentiment analysis and spam detection

  • Clustering: Embeddings optimized to cluster texts based on their similarities, such as document organization, market research, and anomaly detection

  • Retrieval

    • Document: Embeddings optimized for document search, such as indexing articles, books, or web pages for search
    • Query: Embeddings optimized for general search queries, such as custom search
    • Code Query: Embeddings optimized for retrieval of code blocks based on natural language queries, such as code suggestions and search
  • Question Answering: Embeddings for questions in a question-answering system, optimized for finding documents that answer the question, such as chatbox.

  • Fact Verification: Embeddings for statements that need to be verified, optimized for retrieving documents that contain evidence supporting or refuting the statement, such as automated fact-checking systems.

Acknowledgments

Special thanks to:

  • Google DeepMind and LFM Team team for providing the base model and foundational research.

Gratitude is also extended to the open-source community for creating the tools, frameworks, and datasets that enabled fine-tuning and evaluation of this model.

Disclaimer The creator of this Model is not responsible for any misuse, damages, or legal issues arising from the use of this model. This model incorporates Gemma materials provided under the Gemma Terms of Use , which continue to apply.

Downloads last month
201
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Tarka-AIR/Tarka-Embedding-150M-V1 1

Collection including Tarka-AIR/Tarka-Embedding-150M-V1