SentenceTransformer based on Snowflake/snowflake-arctic-embed-l

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-l
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("rajatgupta99924/AIE6-S09-eca4bfc6-eb64-44a4-a71d-e09bf2b78f50")
# Run inference
queries = [
    "What type of dish is shown in the photo and what does it contain?",
]
documents = [
    'Against this photo of butterflies at the California Academy of Sciences:\n\n\nA shallow dish, likely a hummingbird or butterfly feeder, is red.  Pieces of orange slices of fruit are visible inside the dish.\nTwo butterflies are positioned in the feeder, one is a dark brown/black butterfly with white/cream-colored markings.  The other is a large, brown butterfly with patterns of lighter brown, beige, and black markings, including prominent eye spots. The larger brown butterfly appears to be feeding on the fruit.',
    'Except... you can run generated code to see if it’s correct. And with patterns like ChatGPT Code Interpreter the LLM can execute the code itself, process the error message, then rewrite it and keep trying until it works!\nSo hallucination is a much lesser problem for code generation than for anything else. If only we had the equivalent of Code Interpreter for fact-checking natural language!\nHow should we feel about this as software engineers?\nOn the one hand, this feels like a threat: who needs a programmer if ChatGPT can write code for you?',
    'On the other hand, as software engineers we are better placed to take advantage of this than anyone else. We’ve all been given weird coding interns—we can use our deep knowledge to prompt them to solve coding problems more effectively than anyone else can.\nThe ethics of this space remain diabolically complex\nIn September last year Andy Baio and I produced the first major story on the unlicensed training data behind Stable Diffusion.\nSince then, almost every major LLM (and most of the image generation models) have also been trained on unlicensed data.',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 1024] [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.4179, -0.0420,  0.0399]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.95
cosine_accuracy@3 1.0
cosine_accuracy@5 1.0
cosine_accuracy@10 1.0
cosine_precision@1 0.95
cosine_precision@3 0.3333
cosine_precision@5 0.2
cosine_precision@10 0.1
cosine_recall@1 0.95
cosine_recall@3 1.0
cosine_recall@5 1.0
cosine_recall@10 1.0
cosine_ndcg@10 0.9815
cosine_mrr@10 0.975
cosine_map@100 0.975

Training Details

Training Dataset

Unnamed Dataset

  • Size: 160 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 160 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 12 tokens
    • mean: 20.58 tokens
    • max: 33 tokens
    • min: 34 tokens
    • mean: 133.43 tokens
    • max: 214 tokens
  • Samples:
    sentence_0 sentence_1
    What topics are covered in the articles related to large language models (LLMs) and AI development in the provided context? Embeddings: What they are and why they matter
    61.7k
    79.3k


    Catching up on the weird world of LLMs
    61.6k
    85.9k


    llamafile is the new best way to run an LLM on your own computer
    52k
    66k


    Prompt injection explained, with video, slides, and a transcript
    51k
    61.9k


    AI-enhanced development makes me more ambitious with my projects
    49.6k
    60.1k


    Understanding GPT tokenizers
    49.5k
    61.1k


    Exploring GPTs: ChatGPT in a trench coat?
    46.4k
    58.5k


    Could you train a ChatGPT-beating model for $85,000 and run it in a browser?
    40.5k
    49.2k


    How to implement Q&A against your documentation with GPT3, embeddings and Datasette
    37.3k
    44.9k


    Lawyer cites fake cases invented by ChatGPT, judge is not amused
    37.1k
    47.4k
    Which article discusses the potential cost and feasibility of training a ChatGPT-beating model to run in a browser? Embeddings: What they are and why they matter
    61.7k
    79.3k


    Catching up on the weird world of LLMs
    61.6k
    85.9k


    llamafile is the new best way to run an LLM on your own computer
    52k
    66k


    Prompt injection explained, with video, slides, and a transcript
    51k
    61.9k


    AI-enhanced development makes me more ambitious with my projects
    49.6k
    60.1k


    Understanding GPT tokenizers
    49.5k
    61.1k


    Exploring GPTs: ChatGPT in a trench coat?
    46.4k
    58.5k


    Could you train a ChatGPT-beating model for $85,000 and run it in a browser?
    40.5k
    49.2k


    How to implement Q&A against your documentation with GPT3, embeddings and Datasette
    37.3k
    44.9k


    Lawyer cites fake cases invented by ChatGPT, judge is not amused
    37.1k
    47.4k
    What are some of the capabilities of Large Language Models mentioned in the context? Here’s the sequel to this post: Things we learned about LLMs in 2024.
    Large Language Models
    In the past 24-36 months, our species has discovered that you can take a GIANT corpus of text, run it through a pile of GPUs, and use it to create a fascinating new kind of software.
    LLMs can do a lot of things. They can answer questions, summarize documents, translate from one language to another, extract information and even write surprisingly competent code.
    They can also help you cheat at your homework, generate unlimited streams of fake content and be used for all manner of nefarious purposes.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • num_train_epochs: 10
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step cosine_ndcg@10
1.0 16 0.9815
2.0 32 0.9815
1.0 16 0.9815
2.0 32 0.9815
3.0 48 0.9815
3.125 50 0.9815
4.0 64 0.9815
5.0 80 0.9815
6.0 96 0.9815
6.25 100 0.9815
7.0 112 0.9815
8.0 128 0.9815
9.0 144 0.9815
9.375 150 0.9815
10.0 160 0.9815

Framework Versions

  • Python: 3.13.7
  • Sentence Transformers: 5.1.0
  • Transformers: 4.56.1
  • PyTorch: 2.8.0+cpu
  • Accelerate: 1.10.1
  • Datasets: 4.0.0
  • Tokenizers: 0.22.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
5
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rajatgupta99924/AIE6-S09-eca4bfc6-eb64-44a4-a71d-e09bf2b78f50

Finetuned
(169)
this model

Evaluation results