Splade++ SelfDistil finetuned on MS MARCO

This is a SPLADE Sparse Encoder model finetuned from naver/splade-cocondenser-selfdistil using the sentence-transformers library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

Model Details

Model Description

Model Type: SPLADE Sparse Encoder
Base model: naver/splade-cocondenser-selfdistil
Maximum Sequence Length: 512 tokens
Output Dimensionality: 30522 dimensions
Similarity Function: Dot Product
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Documentation: Sparse Encoder Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sparse Encoders on Hugging Face

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("tomaarsen/splade-cocondenser-selfdistil-msmarco-kldiv-marginmse-minilm-temp-4")
# Run inference
queries = [
    "who started gladiator lacrosse",
]
documents = [
    'Weed Eater was a string trimmer company founded in 1971 in Houston, Texas by George C. Ballas, Sr. , the inventor of the device. The idea for the Weed Eater trimmer came to him from the spinning nylon bristles of an automatic car wash.He thought that he could come up with a similar technique to protect the bark on trees that he was trimming around. His company was eventually bought by Emerson Electric and merged with Poulan.Poulan/Weed Eater was later purchased by Electrolux, which spun off the outdoors division as Husqvarna AB in 2006.Inventor Ballas was the father of champion ballroom dancer Corky Ballas and the grandfather of Dancing with the Stars dancer Mark Ballas.George Ballas died on June 25, 2011.he idea for the Weed Eater trimmer came to him from the spinning nylon bristles of an automatic car wash. He thought that he could come up with a similar technique to protect the bark on trees that he was trimming around. His company was eventually bought by Emerson Electric and merged with Poulan.',
    "The earliest types of gladiator were named after Rome's enemies of that time: the Samnite, Thracian and Gaul. The Samnite, heavily armed, elegantly helmed and probably the most popular type, was renamed Secutor and the Gaul renamed Murmillo, once these former enemies had been conquered then absorbed into Rome's Empire.",
    'Summit Hill, PA. Sponsored Topics. Summit Hill is a borough in Carbon County, Pennsylvania, United States. The population was 2,974 at the 2000 census. Summit Hill is located at 40Â°49â\x80²39â\x80³N 75Â°51â\x80²57â\x80³W / 40.8275Â°N 75.86583Â°W / 40.8275; -75.86583 (40.827420, -75.865892).',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 30522] [3, 30522]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[12.5921, 18.5601,  1.2212]])

Evaluation

Metrics

Sparse Information Retrieval

Datasets: NanoMSMARCO, NanoNFCorpus and NanoNQ
Evaluated with SparseInformationRetrievalEvaluator

Metric	NanoMSMARCO	NanoNFCorpus	NanoNQ
dot_accuracy@1	0.5	0.4	0.5
dot_accuracy@3	0.72	0.62	0.76
dot_accuracy@5	0.78	0.66	0.84
dot_accuracy@10	0.9	0.7	0.9
dot_precision@1	0.5	0.4	0.5
dot_precision@3	0.24	0.4133	0.26
dot_precision@5	0.156	0.376	0.176
dot_precision@10	0.09	0.28	0.096
dot_recall@1	0.5	0.0427	0.48
dot_recall@3	0.72	0.0984	0.72
dot_recall@5	0.78	0.1213	0.79
dot_recall@10	0.9	0.1442	0.86
dot_ndcg@10	0.6926	0.3524	0.6905
dot_mrr@10	0.6271	0.5146	0.6484
dot_map@100	0.6332	0.1603	0.6306
query_active_dims	52.62	48.42	58.66
query_sparsity_ratio	0.9983	0.9984	0.9981
corpus_active_dims	479.5088	909.0663	531.6292
corpus_sparsity_ratio	0.9843	0.9702	0.9826

Sparse Nano BEIR

Dataset: NanoBEIR_mean

Evaluated with SparseNanoBEIREvaluator with these parameters:

{
    "dataset_names": [
        "msmarco",
        "nfcorpus",
        "nq"
    ]
}

Metric	Value
dot_accuracy@1	0.4667
dot_accuracy@3	0.7
dot_accuracy@5	0.76
dot_accuracy@10	0.8333
dot_precision@1	0.4667
dot_precision@3	0.3044
dot_precision@5	0.236
dot_precision@10	0.1553
dot_recall@1	0.3409
dot_recall@3	0.5128
dot_recall@5	0.5638
dot_recall@10	0.6347
dot_ndcg@10	0.5785
dot_mrr@10	0.5967
dot_map@100	0.4747
query_active_dims	53.2333
query_sparsity_ratio	0.9983
corpus_active_dims	596.9909
corpus_sparsity_ratio	0.9804

Training Details

Training Dataset

Unnamed Dataset

Size: 99,000 training samples
Columns: query, positive, negative, and label

Approximate statistics based on the first 1000 samples:

	query	positive	negative	label
type	string	string	string	list
details	min: 4 tokens mean: 9.2 tokens max: 34 tokens	min: 18 tokens mean: 79.86 tokens max: 219 tokens	min: 18 tokens mean: 79.96 tokens max: 270 tokens	size: 2 elements

Samples:

query	positive	negative	label
`rtn tv network`	`Home Shopping Network. Home Shopping Network (HSN) is an American broadcast, basic cable and satellite television network that is owned by HSN, Inc. (NASDAQ: HSNI), which also owns catalog company Cornerstone Brands. Based in St. Petersburg, Florida, United States, the home shopping channel has former and current sister channels in several other countries.`	`The Public Switched Telephone Network - The public switched telephone network (PSTN) is the international network of circuit-switched telephones. Learn more about PSTN at HowStuffWorks. x`	`[-1.0804121494293213, -5.908488750457764]`
`how did president nixon react to the watergate investigation?`	`The Watergate scandal was a major political scandal that occurred in the United States during the early 1970s, following a break-in by five men at the Democratic National Committee headquarters at the Watergate office complex in Washington, D.C. on June 17, 1972, and President Richard Nixon's administration's subsequent attempt to cover up its involvement. After the five burglars were caught and the conspiracy was discovered, Watergate was investigated by the United States Congress. Meanwhile, N`	`The release of the tape was ordered by the Supreme Court on July 24, 1974, in a case known as United States v. Nixon. The courtâs decision was unanimous. President Nixon released the tape on August 5. It was one of three conversations he had with Haldeman six days after the Watergate break-in. The tapes prove that he ordered a cover-up of the Watergate burglary. The Smoking Gun tape reveals that Nixon ordered the FBI to abandon its investigation of the break-in. [Read moreâ¦]`	`[4.117279052734375, 3.191757917404175]`
`what is a summary offense in pennsylvania`	`We provide cost effective house arrest and electronic monitoring services to magisterial district court systems throughout Pennsylvania including York, Harrisburg, Philadelphia and Allentown.In addition, we also serve the York County, Lancaster County and Chester County.e provide cost effective house arrest and electronic monitoring services to magisterial district court systems throughout Pennsylvania including York, Harrisburg, Philadelphia and Allentown.`	`In order to be convicted of Simple Assault, one must cause bodily injury. To be convicted of Aggravated Assault, one must cause serious bodily injury. From my research, Pennsylvania law defines bodily injury as the impairment of physical condition or substantial pain.`	`[-8.954689025878906, -1.3361705541610718]`

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseDistillKLDivMarginMSELoss",
    "lambda_corpus": 0.0005,
    "lambda_query": 0.0005
}

Evaluation Dataset

Unnamed Dataset

Size: 1,000 evaluation samples
Columns: query, positive, negative, and label

Approximate statistics based on the first 1000 samples:

	query	positive	negative	label
type	string	string	string	list
details	min: 4 tokens mean: 9.12 tokens max: 37 tokens	min: 17 tokens mean: 78.91 tokens max: 239 tokens	min: 25 tokens mean: 81.25 tokens max: 239 tokens	size: 2 elements

Samples:

query	positive	negative	label
`how long to cook roast beef for`	Roasting times for beef. Preheat your oven to 160Â°C (325Â°F) and use these cooking times to prepare a roast that's moist, tender and delicious. Your roast should be covered with foil for the first half of the roasting time to prevent drying the outer layer.3 to 5lb Joint 1Â½ to 2 hours.reheat your oven to 160Â°C (325Â°F) and use these cooking times to prepare a roast that's moist, tender and delicious. Your roast should be covered with foil for the first half of the roasting time to prevent drying the outer layer.	`Estimating Cooking Time for Large Beef Roasts. If you roast at a steady 325F (160C), subtract 2 minutes or so per pound. If the roast is refrigerated just before going into the oven, add 2 or 3 minutes per pound. WARNING NOTES: Remember, the rib roast will continue to cook as it sets.`	`[6.501978874206543, 8.214995384216309]`
`definition of fire inspection`	`Learn how to do a monthly fire extinguisher inspection in your workplace. Departments must assign an individual to inspect monthly the extinguishers in or adjacent to the department's facilities.1 Read Fire Extinguisher Types and Maintenance for more information.earn how to do a monthly fire extinguisher inspection in your workplace. Departments must assign an individual to inspect monthly the extinguishers in or adjacent to the department's facilities.`	reconnaissance by fire-a method of reconnaissance in which fire is placed on a suspected enemy position in order to cause the enemy to disclose his presence by moving or returning fire. reconnaissance in force-an offensive operation designed to discover or test the enemy's strength (or to obtain other information). mission undertaken to obtain, by visual observation or other detection methods, information about the activities and resources of an enemy or potential enemy, or to secure data concerning the meteorological, hydrographic, or geographic characteristics of a particular area.	`[-0.38299351930618286, -0.9372650384902954]`
`how many stores does family dollar have`	`Property Spotlight: New Retail Center at Hamilton & Warner - Outlots Available!! Family Dollar is closing stores following a disappointing second quarter. Family Dollar Stores Inc. wonât just be cutting prices in an attempt to boost its business â itâll be closing stores as well. The Matthews, N.C.-based discount retailer plans to shutter 370 under-performing shops, according to the Charlotte Business Journal.`	`Glassdoor has 1,976 Family Dollar Stores reviews submitted anonymously by Family Dollar Stores employees. Read employee reviews and ratings on Glassdoor to decide if Family Dollar Stores is right for you.`	`[4.726407527923584, 8.284608840942383]`

Loss: SpladeLoss with these parameters:

{
    "loss": "SparseDistillKLDivMarginMSELoss",
    "lambda_corpus": 0.0005,
    "lambda_query": 0.0005
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
learning_rate: 2e-05
num_train_epochs: 1
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	Validation Loss	NanoMSMARCO_dot_ndcg@10	NanoNFCorpus_dot_ndcg@10	NanoNQ_dot_ndcg@10	NanoBEIR_mean_dot_ndcg@10
-1	-1	-	-	0.6592	0.3737	0.6949	0.5759
0.0162	100	0.6756	-	-	-	-	-
0.0323	200	0.5972	-	-	-	-	-
0.0485	300	0.6618	-	-	-	-	-
0.0646	400	0.6002	-	-	-	-	-
0.0808	500	0.6829	0.6560	0.6625	0.3631	0.7049	0.5768
0.0970	600	0.6771	-	-	-	-	-
0.1131	700	0.698	-	-	-	-	-
0.1293	800	0.6797	-	-	-	-	-
0.1454	900	0.7305	-	-	-	-	-
0.1616	1000	0.6967	0.7134	0.6809	0.3479	0.7080	0.5790
0.1778	1100	0.7317	-	-	-	-	-
0.1939	1200	0.6883	-	-	-	-	-
0.2101	1300	0.7193	-	-	-	-	-
0.2262	1400	0.6907	-	-	-	-	-
0.2424	1500	0.7232	0.6975	0.6675	0.3678	0.6848	0.5734
0.2586	1600	0.7119	-	-	-	-	-
0.2747	1700	0.6636	-	-	-	-	-
0.2909	1800	0.7288	-	-	-	-	-
0.3070	1900	0.7088	-	-	-	-	-
0.3232	2000	0.6427	0.6781	0.7055	0.3648	0.6932	0.5878
0.3394	2100	0.7419	-	-	-	-	-
0.3555	2200	0.6716	-	-	-	-	-
0.3717	2300	0.6726	-	-	-	-	-
0.3878	2400	0.6356	-	-	-	-	-
0.4040	2500	0.6827	0.6649	0.6845	0.3515	0.6935	0.5765
0.4202	2600	0.6984	-	-	-	-	-
0.4363	2700	0.6382	-	-	-	-	-
0.4525	2800	0.7045	-	-	-	-	-
0.4686	2900	0.6559	-	-	-	-	-
0.4848	3000	0.6053	0.6348	0.6839	0.3533	0.7043	0.5805
0.5010	3100	0.6589	-	-	-	-	-
0.5171	3200	0.6326	-	-	-	-	-
0.5333	3300	0.6237	-	-	-	-	-
0.5495	3400	0.6429	-	-	-	-	-
0.5656	3500	0.675	0.6037	0.7066	0.3561	0.6698	0.5775
0.5818	3600	0.5958	-	-	-	-	-
0.5979	3700	0.6323	-	-	-	-	-
0.6141	3800	0.6252	-	-	-	-	-
0.6303	3900	0.5801	-	-	-	-	-
0.6464	4000	0.6231	0.5921	0.6971	0.3626	0.6850	0.5815
0.6626	4100	0.6171	-	-	-	-	-
0.6787	4200	0.6024	-	-	-	-	-
0.6949	4300	0.6149	-	-	-	-	-
0.7111	4400	0.591	-	-	-	-	-
0.7272	4500	0.6045	0.5972	0.7017	0.3529	0.6862	0.5803
0.7434	4600	0.608	-	-	-	-	-
0.7595	4700	0.5621	-	-	-	-	-
0.7757	4800	0.5807	-	-	-	-	-
0.7919	4900	0.5568	-	-	-	-	-
0.8080	5000	0.5669	0.5739	0.6878	0.3537	0.6808	0.5741
0.8242	5100	0.6046	-	-	-	-	-
0.8403	5200	0.5583	-	-	-	-	-
0.8565	5300	0.573	-	-	-	-	-
0.8727	5400	0.5758	-	-	-	-	-
0.8888	5500	0.5538	0.5792	0.6892	0.3523	0.6869	0.5761
0.9050	5600	0.5776	-	-	-	-	-
0.9211	5700	0.5591	-	-	-	-	-
0.9373	5800	0.5959	-	-	-	-	-
0.9535	5900	0.5783	-	-	-	-	-
0.9696	6000	0.5689	0.5595	0.6852	0.3521	0.6902	0.5758
0.9858	6100	0.6144	-	-	-	-	-
-1	-1	-	-	0.6926	0.3524	0.6905	0.5785

Environmental Impact

Carbon emissions were measured using CodeCarbon.

Energy Consumed: 0.191 kWh
Carbon Emitted: 0.074 kg of CO2
Hours Used: 0.529 hours

Training Hardware

On Cloud: No
GPU Model: 1 x NVIDIA GeForce RTX 3090
CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
RAM Size: 31.78 GB

Framework Versions

Python: 3.11.6
Sentence Transformers: 4.2.0.dev0
Transformers: 4.52.4
PyTorch: 2.7.1+cu126
Accelerate: 1.5.1
Datasets: 2.21.0
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

SpladeLoss

@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}

FlopsLoss

@article{paria2020minimizing,
    title={Minimizing flops to learn efficient sparse representations},
    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
    journal={arXiv preprint arXiv:2004.05665},
    year={2020}
}

tomaarsen
/

splade-cocondenser-selfdistil-msmarco-kldiv-marginmse-minilm-temp-4

Splade++ SelfDistil finetuned on MS MARCO

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Evaluation

Metrics

Sparse Information Retrieval

Sparse Nano BEIR

Training Details

Training Dataset

Unnamed Dataset

Evaluation Dataset

Unnamed Dataset

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Environmental Impact

Training Hardware

Framework Versions

Citation

BibTeX

Sentence Transformers

SpladeLoss

FlopsLoss

Model tree for tomaarsen/splade-cocondenser-selfdistil-msmarco-kldiv-marginmse-minilm-temp-4

Evaluation results