loic-dagnas-sinequa commited on
Commit
dc93f12
·
verified ·
1 Parent(s): 95cb8fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -125
README.md CHANGED
@@ -1,125 +1,124 @@
1
- ---
2
- pipeline_tag: sentence-similarity
3
- tags:
4
- - feature-extraction
5
- - sentence-similarity
6
- language:
7
- - de
8
- - en
9
- - es
10
- - fr
11
- - it
12
- - nl
13
- - ja
14
- - pt
15
- - zh
16
- ---
17
-
18
- # Model Card for `vectorizer.raspberry`
19
-
20
- This model is a vectorizer developed by Sinequa. It produces an embedding vector given a passage or a query. The passage vectors are stored in our vector index and the query vector is used at query time to look up relevant passages in the index.
21
-
22
- Model name: `vectorizer.raspberry`
23
-
24
- ## Supported Languages
25
-
26
- The model was trained and tested in the following languages:
27
-
28
- - English
29
- - French
30
- - German
31
- - Spanish
32
- - Italian
33
- - Dutch
34
- - Japanese
35
- - Portuguese
36
- - Chinese (simplified)
37
-
38
- Besides these languages, basic support can be expected for additional 91 languages that were used during the pretraining of the base model (see Appendix A of XLM-R paper).
39
-
40
- ## Scores
41
-
42
- | Metric | Value |
43
- |:-----------------------|------:|
44
- | Relevance (Recall@100) | 0.613 |
45
-
46
- Note that the relevance score is computed as an average over 14 retrieval datasets (see
47
- [details below](#evaluation-metrics)).
48
-
49
- ## Inference Times
50
-
51
- | GPU | Quantization type | Batch size 1 | Batch size 32 |
52
- |:------------------------------------------|:------------------|---------------:|---------------:|
53
- | NVIDIA A10 | FP16 | 1 ms | 5 ms |
54
- | NVIDIA A10 | FP32 | 2 ms | 18 ms |
55
- | NVIDIA T4 | FP16 | 1 ms | 12 ms |
56
- | NVIDIA T4 | FP32 | 3 ms | 52 ms |
57
- | NVIDIA L4 | FP16 | 2 ms | 5 ms |
58
- | NVIDIA L4 | FP32 | 4 ms | 24 ms |
59
-
60
- ## Gpu Memory usage
61
-
62
- | Quantization type | Memory |
63
- |:-------------------------------------------------|-----------:|
64
- | FP16 | 550 MiB |
65
- | FP32 | 1050 MiB |
66
-
67
- Note that GPU memory usage only includes how much GPU memory the actual model consumes on an NVIDIA T4 GPU with a batch
68
- size of 32. It does not include the fix amount of memory that is consumed by the ONNX Runtime upon initialization which
69
- can be around 0.5 to 1 GiB depending on the used GPU.
70
-
71
- ## Requirements
72
-
73
- - Minimal Sinequa version: 11.10.0
74
- - Minimal Sinequa version for using FP16 models and GPUs with CUDA compute capability of 8.9+ (like NVIDIA L4): 11.11.0
75
- - [Cuda compute capability](https://developer.nvidia.com/cuda-gpus): above 5.0 (above 6.0 for FP16 use)
76
-
77
- ## Model Details
78
-
79
- ### Overview
80
-
81
- - Number of parameters: 107 million
82
- - Base language
83
- model: [mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) ([Paper](https://arxiv.org/abs/2012.15828), [GitHub](https://github.com/microsoft/unilm/tree/master/minilm))
84
- - Insensitive to casing and accents
85
- - Output dimensions: 256 (reduced with an additional dense layer)
86
- - Training procedure: Query-passage-negative triplets for datasets that have mined hard negative data, Query-passage
87
- pairs for the rest. Number of negatives is augmented with in-batch negative strategy
88
-
89
- ### Training Data
90
-
91
- The model have been trained using all datasets that are cited in the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. In addition to that, this model has been trained on the datasets cited in [this paper](https://arxiv.org/pdf/2108.13897.pdf) on the 9 aforementioned languages.
92
-
93
- ### Evaluation Metrics
94
-
95
- To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the
96
- [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in English.
97
-
98
- | Dataset | Recall@100 |
99
- |:------------------|-----------:|
100
- | Average | 0.613 |
101
- | | |
102
- | Arguana | 0.957 |
103
- | CLIMATE-FEVER | 0.468 |
104
- | DBPedia Entity | 0.377 |
105
- | FEVER | 0.820 |
106
- | FiQA-2018 | 0.639 |
107
- | HotpotQA | 0.560 |
108
- | MS MARCO | 0.845 |
109
- | NFCorpus | 0.287 |
110
- | NQ | 0.756 |
111
- | Quora | 0.992 |
112
- | SCIDOCS | 0.456 |
113
- | SciFact | 0.906 |
114
- | TREC-COVID | 0.100 |
115
- | Webis-Touche-2020 | 0.413 |
116
-
117
- We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics for the existing languages.
118
-
119
- | Language | Recall@100 |
120
- |:----------------------|-----------:|
121
- | French | 0.650 |
122
- | German | 0.528 |
123
- | Spanish | 0.602 |
124
- | Japanese | 0.614 |
125
- | Chinese (simplified) | 0.680 |
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - feature-extraction
5
+ - sentence-similarity
6
+ language:
7
+ - de
8
+ - en
9
+ - es
10
+ - fr
11
+ - it
12
+ - nl
13
+ - ja
14
+ - pt
15
+ - zh
16
+ ---
17
+
18
+ # Model Card for `vectorizer.raspberry`
19
+
20
+ This model is a vectorizer developed by Sinequa. It produces an embedding vector given a passage or a query. The passage vectors are stored in our vector index and the query vector is used at query time to look up relevant passages in the index.
21
+
22
+ Model name: `vectorizer.raspberry`
23
+
24
+ ## Supported Languages
25
+
26
+ The model was trained and tested in the following languages:
27
+
28
+ - English
29
+ - French
30
+ - German
31
+ - Spanish
32
+ - Italian
33
+ - Dutch
34
+ - Japanese
35
+ - Portuguese
36
+ - Chinese (simplified)
37
+
38
+ Besides these languages, basic support can be expected for additional 91 languages that were used during the pretraining of the base model (see Appendix A of XLM-R paper).
39
+
40
+ ## Scores
41
+
42
+ | Metric | Value |
43
+ |:-----------------------|------:|
44
+ | Relevance (Recall@100) | 0.613 |
45
+
46
+ Note that the relevance score is computed as an average over 14 retrieval datasets (see
47
+ [details below](#evaluation-metrics)).
48
+
49
+ ## Inference Times
50
+
51
+ | GPU | Quantization type | Batch size 1 | Batch size 32 |
52
+ |:------------------------------------------|:------------------|---------------:|---------------:|
53
+ | NVIDIA A10 | FP16 | 1 ms | 5 ms |
54
+ | NVIDIA A10 | FP32 | 2 ms | 18 ms |
55
+ | NVIDIA T4 | FP16 | 1 ms | 12 ms |
56
+ | NVIDIA T4 | FP32 | 3 ms | 52 ms |
57
+ | NVIDIA L4 | FP16 | 2 ms | 5 ms |
58
+ | NVIDIA L4 | FP32 | 4 ms | 24 ms |
59
+
60
+ ## Gpu Memory usage
61
+
62
+ | Quantization type | Memory |
63
+ |:-------------------------------------------------|-----------:|
64
+ | FP16 | 550 MiB |
65
+ | FP32 | 1050 MiB |
66
+
67
+ Note that GPU memory usage only includes how much GPU memory the actual model consumes on an NVIDIA T4 GPU with a batch
68
+ size of 32. It does not include the fix amount of memory that is consumed by the ONNX Runtime upon initialization which
69
+ can be around 0.5 to 1 GiB depending on the used GPU.
70
+
71
+ ## Requirements
72
+
73
+ - Minimal Sinequa version: 11.10.0
74
+ - [Cuda compute capability](https://developer.nvidia.com/cuda-gpus): above 5.0 (above 6.0 for FP16 use)
75
+
76
+ ## Model Details
77
+
78
+ ### Overview
79
+
80
+ - Number of parameters: 107 million
81
+ - Base language
82
+ model: [mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) ([Paper](https://arxiv.org/abs/2012.15828), [GitHub](https://github.com/microsoft/unilm/tree/master/minilm))
83
+ - Insensitive to casing and accents
84
+ - Output dimensions: 256 (reduced with an additional dense layer)
85
+ - Training procedure: Query-passage-negative triplets for datasets that have mined hard negative data, Query-passage
86
+ pairs for the rest. Number of negatives is augmented with in-batch negative strategy
87
+
88
+ ### Training Data
89
+
90
+ The model have been trained using all datasets that are cited in the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. In addition to that, this model has been trained on the datasets cited in [this paper](https://arxiv.org/pdf/2108.13897.pdf) on the 9 aforementioned languages.
91
+
92
+ ### Evaluation Metrics
93
+
94
+ To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the
95
+ [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in English.
96
+
97
+ | Dataset | Recall@100 |
98
+ |:------------------|-----------:|
99
+ | Average | 0.613 |
100
+ | | |
101
+ | Arguana | 0.957 |
102
+ | CLIMATE-FEVER | 0.468 |
103
+ | DBPedia Entity | 0.377 |
104
+ | FEVER | 0.820 |
105
+ | FiQA-2018 | 0.639 |
106
+ | HotpotQA | 0.560 |
107
+ | MS MARCO | 0.845 |
108
+ | NFCorpus | 0.287 |
109
+ | NQ | 0.756 |
110
+ | Quora | 0.992 |
111
+ | SCIDOCS | 0.456 |
112
+ | SciFact | 0.906 |
113
+ | TREC-COVID | 0.100 |
114
+ | Webis-Touche-2020 | 0.413 |
115
+
116
+ We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics for the existing languages.
117
+
118
+ | Language | Recall@100 |
119
+ |:----------------------|-----------:|
120
+ | French | 0.650 |
121
+ | German | 0.528 |
122
+ | Spanish | 0.602 |
123
+ | Japanese | 0.614 |
124
+ | Chinese (simplified) | 0.680 |