File size: 54,232 Bytes
09385a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:96
- loss:MultipleNegativesRankingLoss
base_model: BAAI/bge-small-en-v1.5
widget:
- source_sentence: What are the de facto required fields in a SAM/BAM read group?
  sentences:
  - "Question: Several gene set enrichment methods are available, the most famous/popular\
    \ is the Broad Institute tool. Many other tools are available (See for example\
    \ the biocView of GSE which list 82 different packages). There are several parameters\
    \ in consideration :\n\nthe statistic used to order the genes, \nif it competitive\
    \ or self-contained,\nif it is supervised or not,\nand how is the enrichment score\
    \ calculated.\n\nI am using the fgsea - Fast Gene Set Enrichment Analysis package\
    \ to calculate the enrichment scores and someone told me that the numbers are\
    \ different from the ones on the Broad Institute despite all the other parameters\
    \ being equivalent.\nAre these two methods (fgsea and Broad Institute GSEA) equivalent\
    \ to calculate the enrichment score?\nI looked to the algorithms of both papers,\
    \ and they seem fairly similar, but I don't know if in real datasets they are\
    \ equivalent or not.\nIs there any article reviewing and comparing how does the\
    \ enrichment score method affect to the result?\n\nAnswer: According to the FGSEA\
    \ preprint:\n\nWe ran reference GSEA with default parameters. The permutation\
    \ number\n  was set to 1000, which means that for each input gene set 1000\n \
    \ independent samples were generated. The run took 100 seconds and\n  resulted\
    \ in 79 gene sets with GSEA-adjusted FDR q-value of less than\n  10−2. All significant\
    \ gene sets were in a positive mode. First, to get\n  a similar nominal p-values\
    \ accuracy we ran FGSEA algorithm on 1000\n  permutations. This took 2 seconds,\
    \ but resulted in no significant hits\n  due after multiple testing correction\
    \ (with FRD ≤ 1%).\n\nThus, FGSEA and GSEA are not identical.\nAnd again in the\
    \ conclusion:\n\nConsequently, gene sets can be ranked more precisely in the results\n\
    \  and, which is even more important, standard multiple testing\n  correction\
    \ methods can be applied instead of approximate ones as in\n  [GSEA].\n\nThe author\
    \ argues that FGSEA is more accurate, so it can't be equivalent.\nIf you are interested\
    \ specifically in the enrichment score, that was addressed by the author in the\
    \ preprint comments:\n\nValues of enrichment scores and normalized enrichment\
    \ scores are the\n  same for both broad version and fgsea.\n\nSo that part seems\
    \ to be the same."
  - 'Question: I am running samtools mpileup (v1.4) on a bam file with very choppy
    coverage (ChIP-seq style data). I want to get a first-pass list of positions with
    SNVs and their frequency as reported by the read counts, but no matter what I
    do, I keep getting all SNVs filtered out as not passing QC.

    What''s the magic parameter set for an initial list of SNVs and frequencies?

    EDIT: this is a question I posted on "the other" website, but didn''t get a reply
    there.


    Answer: I used this in the past for ChIP-seq data and it generated SNVs:

    samtools mpileup \

    --uncompressed --max-depth 10000 --min-MQ 20 --ignore-RG --skip-indels \

    --fasta-ref ref.fa file.bam \

    | bcftools call --consensus-caller \

    > out.vcf


    This was samtools 1.3 in case that makes a difference.'
  - "Question: The SAM specification indicates that each read group must have a unique\
    \ ID field, but does not mark any other field as required. \nI have also discovered\
    \ that htsjdk throws exceptions if the sample (SM) field is empty, though there\
    \ is no indication in the specification that this is required. \nAre there other\
    \ read group fields that I should expect to be required by common tools? \n\n\
    Answer: The sample tag (i.e. SM) was a mandatory tag in the initial SAM spec (see\
    \ the .pages file; you need a mac to open it). When transitioned to Latex, this\
    \ requirement was mysteriously dropped. Picard is conforming to the initial spec.\
    \ Anyway, the sample tag is important to quite a few tools. I would encourage\
    \ you to add it."
- source_sentence: Is the optional SAM NM field strictly computable from the MD and
    CIGAR?
  sentences:
  - "Question: I'm looking for tools to check the quality of a VCF I have of a human\
    \ genome. I would like to check the VCF against publicly known variants across\
    \ other human genomes, e.g. how many SNPs are already in public databases, whether\
    \ insertions/deletions are at known positions, insertion/deletion length distribution,\
    \ other SNVs/SVs, etc.? I suspect that there are resources from previous projects\
    \ to check for known SNPs and InDels by human subpopulations.\nWhat resources\
    \ exist for this, and how do I do it? \n\nAnswer: To achieve (at least some of)\
    \ your goals, I would recommend the Variant Effect Predictor (VEP). It is a flexible\
    \ tool that provides several types of annotations on an input .vcf file.  I agree\
    \ that ExAC is the de facto gold standard catalog for human genetic variation\
    \ in coding regions.  To see the frequency distribution of variants by global\
    \ subpopulation make sure \"ExAC allele frequencies\" is checked in addition to\
    \ the 1000 genomes. \nOutput in the web-browser:\n\nIf you download the annotated\
    \ .vcf, frequencies will be in the INFO field:\n##INFO=<ID=CSQ,Number=.,Type=String,Description=\"\
    Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|TSL|SIFT|PolyPhen|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|ExAC_AF|ExAC_Adj_AF|ExAC_AFR_AF|ExAC_AMR_AF|ExAC_EAS_AF|ExAC_FIN_AF|ExAC_NFE_AF|ExAC_OTH_AF|ExAC_SAS_AF|CLIN_SIG|SOMATIC|PHENO|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE\n\
    \nThe previously mentioned Annovar can also annotate with ExAC allele frequencies.\
    \  Finally, should mention the newest whole-genome resource, gnomAD."
  - 'Question: I produced a bam file by aligning reads to a small set of synthetic
    sequences using bwa-mem.

    I am heavily filtering reads that are not paired and of a certain orientation.

    Applying the filtering, I get a few thousands of reads:

    samtools view -h $myfilebam | \

    samtools view -h -F4 - | \

    samtools view -h -F8 - | \

    samtools view -h -F256 - | \

    samtools view -h -F512 - | \

    samtools view -h -F1024 - | \

    samtools view -h -F2048 - | \

    samtools view -h -f16 - | \

    samtools view -h -f32 -  | wc -l


    Gives me 89502 reads.

    If I then pipe this into samtools mpileup, I get no results:

    samtools view -h $myfilebam | \

    samtools view -h -F4 - | \

    samtools view -h -F8 - | \

    samtools view -h -F256 - | \

    samtools view -h -F512 - | \

    samtools view -h -F1024 - | \

    samtools view -h -F2048 - | \

    samtools view -h -f16 - | \

    samtools view -h -f32 -  | \

    samtools mpileup --excl-flags 0 -Q0 -B -d 999999 - | wc -l


    Returns 0.

    I tried different combinations of filtering, and when I do both -f 16 and -f 32
    returns empty, but if I do either of those, then it works:

    samtools view -h $myfilebam | \

    samtools view -h -F4 - | \

    samtools view -h -F8 - | \

    samtools view -h -F256 - | \

    samtools view -h -F512 - | \

    samtools view -h -F1024 - | \

    samtools view -h -F2048 - | \

    samtools view -h -f16 - | \

    samtools mpileup --excl-flags 0 -Q0 -B -d 999999 - | wc -l


    Returns 1056.

    Any ideas why? My thinking was that it would work with --excl-flags 0.

    EDIT: substituting mpileup for depth does work, and prints out each position and
    the depth as expected.

    EDIT2: adding -q 0 to mpileup gives the same empty result.

    Thanks in advance


    Answer: By using -h in the samtools view command, you''re including all the header
    lines in your word count. If you happen to have about 89500 reference sequences,
    then the lengths of those would all appear in the header and inflate the -h word
    count, but not the mpileup count. Try piping it through an additional samtools
    view (i.e. without -h) and see if the counts change:

    ...

    samtools view -h -f32 -  | \

    samtools view | wc -l


    Also, samtools mpileup by default only considers high-quality bases and concordant
    reads. Try adding a -A to your mpileup line (which stops anomalous read pairs
    from being discarded):

    ...

    samtools mpileup -A -Q0 -B -d 999999 - | wc -l


    Whether or not this is actually a good idea will be dependent on what you want
    to get out of the analysis, and what the downstream programs / analyses are expecting.'
  - "Question: From SAM Optional Fields Specification the NM field is \n\nEdit distance\
    \ to the reference, including ambiguous bases but excluding clipping\n\nAssuming\
    \ both the MD and CIGAR are present, is the edit distance simply the number of\
    \ characters [A-Z] appearing in the MD field plus the number of bases inserted\
    \ (xI, if any) from the CIGAR string? Are there any other complications? \n\n\
    Answer: Assuming both the MD and CIGAR are present and correct, then yes, you\
    \ can parse both to get the edit distance (NM auxiliary tag). One big caveat to\
    \ this is that there's a reason that the samtools calmd command exists, since\
    \ it's historically been the case that not all aligners have output correct MD\
    \ strings. It's rare for the CIGAR string to be wrong and that'd be more of a\
    \ catastrophic error on the part of an aligner. For what it's worth, if the NM\
    \ auxiliary is absent on a given alignment but present on others produced by the\
    \ same aligner then it's fair to assume NM:i:0 for a given alignment by default\
    \ (many aligners only produce NM:i:XXX if the edit distance is at least 1)."
- source_sentence: How to read structural variant VCF?
  sentences:
  - "Question: I am calling SNPs from WGS samples produced at my lab. I am currently\
    \ using bwa-mem for mapping Illumina reads as it is recommended by GATK best practice.\
    \ However, bwa is a bit slow. I heard from my colleague that SNAP is much faster\
    \ than bwa. I tried it on a small set of reads and it is indeed faster. However,\
    \ I am not sure how it works with downstream SNP callers, so here are my questions:\
    \ have you used SNAP for short-read mapping? What is your experience? Does SNAP\
    \ work well with SNP callers like GATK and freebayes? Thanks!\n\nAnswer: GATK\
    \ best practices are explicably meant to consume BWA MEM generated BAM.  Whilst\
    \ SNAP may be faster, the Broad will not have tested it for compatibility with\
    \ GATK as such you can't guaranty using it won't have unexpected consequences.\
    \  \nAs such you'd be better off using BWA MEM because I assume accurately called\
    \ variation is always better than fast and incorrectly called variation.  The\
    \ main issue you'll have is ensuring shorter split hits and mapping quality are\
    \ reported in the same way as bwa MEM -M which GATK/Picard is expecting.  Ultimately\
    \ however you'd be better off posting this question on the GATK forum. \nIt's\
    \ also worth noting that the soon to be released GATK 4 will utilise bwaspark\
    \ which can distribute it's alignment processes across Apache Spark for increase\
    \ performance.  Consequently I can't see SNAP being adopted anytime soon."
  - 'Question: I have a computer engineering background, not biology.

    I started working on a bioinformatics project recently, which involves de-novo
    assembly. I came to know the terms Transcriptome and Genome, but I cannot identify
    the difference between these two.

    I know a transcriptome is the set of all messenger RNA molecules in a cell, but
    am not sure how this is different from a genome.


    Answer: In brief, the  “genome”  is the collection of all  DNA  present  in  the  nucleus  and  the  mitochondria
    of a  somatic  cell. The initial product of genome expression is the “transcriptome”,
    a collection of RNA molecules derived from those genes.'
  - "Question: The IGSR has a sample for encoding structural variants in the VCF 4.0\
    \ format.\nAn example from the site (the first record):\n#CHROM  POS   ID  REF\
    \ ALT   QUAL  FILTER  INFO  FORMAT  NA00001\n1 2827693   . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA\
    \  C . PASS  SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66\
    \ GT:GQ 1/1:13.9\n\nHow to read it? From what I can see:\n\nThis is a deletion\
    \ (SVTYPE=DEL)\nThe end position of the variant comes before the starting position\
    \ (reverse strand?)\nThe reference starts from 2827693 to 2827680 (13 bases on\
    \ the reverse strand)\nThe difference between reference and alternative is 66\
    \ bases (SVLEN=-66)\n\nThis doesn't sound right to me. For instance, I don't see\
    \ where exactly the deletion starts. The SVLEN field says 66 bases deleted, but\
    \ where? 2827693 to 2827680 only has 13 bases between.\nQ: How to read the deletion\
    \ correctly from this structural VCF record? Where is the missing 66-13=53 bases?\n\
    \nAnswer: I just received a reply from 1000Genomes regarding this. I'll post it\
    \ in its entirety below:\n\nLooking at the example you mention, I find it difficult\
    \ to come up with an\n  interpretation of the information whereby the stated end\
    \ seems to be correct,\n  so believe that this may indeed be an error.\nSince\
    \ the v4.0 was created, however, new versions of VCF have been introduced,\n \
    \ improving and correcting the specification. The current version is v4.3\n  (http://samtools.github.io/hts-specs/).\
    \ I believe the first record shown on\n  page 11 provides an accurate example\
    \ of this type of deletion.\nI will update the web page to include this information.\n\
    \nSo we can take this as official confirmation that we were all correct in suspecting\
    \ the example was just wrong."
- source_sentence: Publicly available genome sequence database for viruses?
  sentences:
  - "Question: This question is based on a question on BioStars  posted >2 years ago\
    \ by user jack.\nIt describes a very frequent problem of generating GO annotations\
    \ for non-model organisms. While it is based on some specific format and single\
    \ application (Ontologizer), it would be useful to have a general description\
    \ of the pathway to getting to a GAF file. \nNote, that the input format is lacking\
    \ a bit of essential information, like how it was obtained. Therefore, it is har\
    \ to assign evidence code. Therefore, lets assume that the assignments of GO terms\
    \ were done automagically. \n\nI want to do the Gene enrichment using Ontologizer\
    \ without a\n  predefined association file(it's not model organism). \nI have\
    \ parsed a file with two columns for that organism like this : \ngeneA  GO:0006950,GO:0005737\n\
    geneB  GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031\n\
    geneC  GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863\n\
    geneD  GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528\n\
    \nI have downloaded the .ob file from Gene ontology file which contain\n  this\
    \ information (from here) : \n!\n! GO IDs (primary only) and name text strings\n\
    ! GO:0000000 [tab] text string [tab] F|P|C\n! where F = molecular function, P\
    \ = biological process, C = cellular component\n!\nGO:0000001  mitochondrion inheritance\
    \   P\nGO:0000002  mitochondrial genome maintenance    P\nGO:0000003  reproduction\
    \    P\nGO:0000005  ribosomal chaperone activity    F\nGO:0000006  high affinity\
    \ zinc uptake transmembrane transporter activity    F\nGO:0000007  low-affinity\
    \ zinc ion transmembrane transporter activity    F\nGO:0000008  thioredoxin F\n\
    GO:0000009  alpha-1,6-mannosyltransferase activity  F\nGO:0000010  trans-hexaprenyltranstransferase\
    \ activity   F\nGO:0000011  vacuole inheritance P\n\nWhat I need as output is\
    \ .gaf file in the following format (in the\n  format of the files here):\n!gaf-version:\
    \ 2.0\n\n!Project_name: Leishmania major GeneDB\n\n!URL: http://www.genedb.org/leish\n\
    \n!Contact Email: mb4@sanger.ac.uk\n\n GeneDB_Lmajor    LmjF.36.4770    LmjF.36.4770\
    \        GO:0003723    PMID:22396527    ISO    GeneDB:Tb927.10.10130    F    mitochondrial\
    \ RNA binding complex 1 subunit, putative    LmjF36.4770    gene    taxon:347515\
    \    20120910    GeneDB_Lmajor       \n GeneDB_Lmajor    LmjF.36.4770    LmjF.36.4770\
    \        GO:0044429    PMID:20660476    ISS        C    mitochondrial RNA binding\
    \ complex 1 subunit, putative    LmjF36.4770    gene    taxon:347515    20100803\
    \ GeneDB_Lmajor             GeneDB_Lmajor    LmjF.36.4770    LmjF.36.4770    \
    \    GO:0016554    PMID:22396527    ISO    GeneDB:Tb927.10.10130    P    mitochondrial\
    \ RNA binding complex 1 subunit, putative    LmjF36.4770    gene   taxon:347515\
    \    20120910    GeneDB_Lmajor       \n GeneDB_Lmajor    LmjF.36.4770    LmjF.36.4770\
    \        GO:0048255    PMID:22396527    ISO    GeneDB:Tb927.10.10130    P    mitochondrial\
    \ RNA binding complex 1 subunit, putative    LmjF36.4770    gene    taxon:347515\
    \    20120910    GeneDB_Lmajor  \n\nHow to create your own GO association file\
    \ (gaf)?\n\nAnswer: Here's a Perl script that can do this:\n#!/usr/bin/env perl\
    \ \nuse strict;\nuse warnings;\n\n## Change this to whatever taxon you are working\
    \ with\nmy $taxon = 'taxon:1000';\nchomp(my $date = `date +%Y%M%d`);\n\nmy (%aspect,\
    \ %gos);\n## Read the GO.terms_and_ids file to get the aspect (sub ontology)\n\
    ## of each GO term. \nopen(my $fh, $ARGV[0]) or die \"Need a GO.terms_and_ids\
    \ file as 1st arg: $!\\n\";\nwhile (<$fh>) {\n    next if /^!/;\n    chomp;\n\
    \    my @fields = split(/\\t/);\n    ## $aspect{GO:0000001} = 'P'\n    $aspect{$fields[0]}\
    \ = $fields[2];\n}\nclose($fh);\n\n## Read the list of gene annotations\nopen($fh,\
    \ $ARGV[1]) or die \"Need a list of gene annotattions as 2nd arg: $!\\n\";\nwhile\
    \ (<$fh>) {\n    chomp;\n    my ($gene, @terms) = split(/[\\s,]+/);\n    ## $gos{geneA}\
    \ = (go1, go2 ... goN)\n    $gos{$gene} = [ @terms ];\n}\nclose($fh);\n\nforeach\
    \ my $gene (keys(%gos)) {\n    foreach my $term (@{$gos{$gene}}) {\n        ##\
    \ Warn and skip if there is no aspect for this term\n        if (!$aspect{$term})\
    \ {\n            print STDERR \"Unknown GO term ($term) for gene $gene\\n\";\n\
    \            next;\n        }\n        ## Build a pseudo GAF line \n        my\
    \ @out = ('DB', $gene, $gene, ' ', $term, 'PMID:foo', 'TAS', ' ', $aspect{$term},\n\
    \                             $gene, ' ', 'protein', $taxon, $date, 'DB', ' ',\
    \ ' ');\n        print join(\"\\t\", @out). \"\\n\";\n    }\n}\n\nMake it executable\
    \ and run it with the GO.terms_and_ids file as the 1st argument and the list of\
    \ gene annotations as the second. Using the current GO.terms_and_ids and the example\
    \ annotations in the question, I get:\n$ foo.pl GO.terms_and_ids file.gos \nDB\
    \  geneD   geneD       GO:0005634  PMID:foo    TAS     C   geneD       protein\
    \ taxon:1000  20170308    DB       \nDB  geneD   geneD       GO:0003677  PMID:foo\
    \    TAS     F   geneD       protein taxon:1000  20170308    DB       \nDB  geneD\
    \   geneD       GO:0030154  PMID:foo    TAS     P   geneD       protein taxon:1000\
    \  20170308    DB       \nUnknown GO term (GO:0006350) for gene geneD\nDB  geneD\
    \   geneD       GO:0006355  PMID:foo    TAS     P   geneD       protein taxon:1000\
    \  20170308    DB       \nDB  geneD   geneD       GO:0007275  PMID:foo    TAS\
    \     P   geneD       protein taxon:1000  20170308    DB       \nDB  geneD   geneD\
    \       GO:0030528  PMID:foo    TAS     F   geneD       protein taxon:1000  20170308\
    \    DB       \nDB  geneB   geneB       GO:0016020  PMID:foo    TAS     C   geneB\
    \       protein taxon:1000  20170308    DB       \nDB  geneB   geneB       GO:0005524\
    \  PMID:foo    TAS     F   geneB       protein taxon:1000  20170308    DB    \
    \   \nDB  geneB   geneB       GO:0006468  PMID:foo    TAS     P   geneB      \
    \ protein taxon:1000  20170308    DB       \nDB  geneB   geneB       GO:0005737\
    \  PMID:foo    TAS     C   geneB       protein taxon:1000  20170308    DB    \
    \   \nDB  geneB   geneB       GO:0004674  PMID:foo    TAS     F   geneB      \
    \ protein taxon:1000  20170308    DB       \nDB  geneB   geneB       GO:0006914\
    \  PMID:foo    TAS     P   geneB       protein taxon:1000  20170308    DB    \
    \   \nDB  geneB   geneB       GO:0016021  PMID:foo    TAS     C   geneB      \
    \ protein taxon:1000  20170308    DB       \nDB  geneB   geneB       GO:0015031\
    \  PMID:foo    TAS     P   geneB       protein taxon:1000  20170308    DB    \
    \   \nDB  geneA   geneA       GO:0006950  PMID:foo    TAS     P   geneA      \
    \ protein taxon:1000  20170308    DB       \nDB  geneA   geneA       GO:0005737\
    \  PMID:foo    TAS     C   geneA       protein taxon:1000  20170308    DB    \
    \   \nDB  geneC   geneC       GO:0003779  PMID:foo    TAS     F   geneC      \
    \ protein taxon:1000  20170308    DB       \nDB  geneC   geneC       GO:0006941\
    \  PMID:foo    TAS     P   geneC       protein taxon:1000  20170308    DB    \
    \   \nDB  geneC   geneC       GO:0005524  PMID:foo    TAS     F   geneC      \
    \ protein taxon:1000  20170308    DB       \nDB  geneC   geneC       GO:0003774\
    \  PMID:foo    TAS     F   geneC       protein taxon:1000  20170308    DB    \
    \   \nDB  geneC   geneC       GO:0005516  PMID:foo    TAS     F   geneC      \
    \ protein taxon:1000  20170308    DB       \nDB  geneC   geneC       GO:0005737\
    \  PMID:foo    TAS     C   geneC       protein taxon:1000  20170308    DB    \
    \   \nDB  geneC   geneC       GO:0005863  PMID:foo    TAS     C   geneC      \
    \ protein taxon:1000  20170308    DB       \n\nNote that this is very much a pseudo-GAF\
    \ file since most of the fields apart from the gene name, GO term and sub-ontology\
    \ are fake. It should still work for what you need, however."
  - 'Question: As a small introductory project, I want to compare genome sequences
    of  different strains of influenza virus.

    What are the publicly available databases of influenza virus gene/genome sequences?


    Answer: There area few different influenza virus database resources:


    The Influenza Research Database (IRD) (a.k.a FluDB - based upon URL)


    A NIAID Bioinformatics Resource Center or BRC which highly curates the data brought
    in and integrates it with numerous other relevant data types


    The NCBI Influenza Virus Resource


    A sub-project of the NCBI with data curated over and above the GenBank data that
    is part of the NCBI


    The GISAID EpiFlu Database


    A database of sequences from the Global Initiative on Sharing All Influenza Data.
    Has unique data from many countries but requires user agree to a data sharing
    policy.


    The OpenFluDB


    Former GISAID database that contains some sequence data that GenBank does not
    have.


    For those who also may be interested in other virus databases, there are:


    Virus Pathogen Resource (VIPR)


    A companion portal to the IRD, which hosts curated and integrated data for most
    other NIAID A-C virus pathogens including (but not limited to) Ebola, Zika, Dengue,
    Enterovirus, and Hepatitis C


    LANL HIV database


    Los Alamos National Laboratory HIV database with HIV data and many useful tools
    for all virus bioinformatics


    PaVE: Papilloma virus genome database (from quintik comment)


    NIAID developed and maintained Papilloma virus bioinformatics portal


    Disclaimer: I used to work for the IRD / VIPR and currently work for NIAID.'
  - "Question: I have a set of genomic ranges that are potentially overlapping. I\
    \ want to count the amount of ranges at certain positions using R. \nI'm Pretty\
    \ sure there are good solutions, but I seem to be unable to find them. \nSolutions\
    \ like cut or findIntervals don't achieve what I want as they only count on one\
    \ vector or accumulate by all values <= break.\nAlso countMatches {GenomicRanges}\
    \ doesn't seem to cover it.\nProbably one could use Bedtools, but I don't want\
    \ to leave R.\nI could only come up with a hilariously slow solution\n# generate\
    \ test data\ntestdata <- data.frame(chrom = rep(seq(1,10),10),\n             \
    \          starts = abs(rnorm(100, mean = 1, sd = 1)) * 1000,\n              \
    \         ends = abs(rnorm(100, mean = 2, sd = 1)) * 2000)\n\n# make sure that\
    \ all end coordinates are bigger than start\n# this is a requirement of the original\
    \ data\ntestdata <- testdata[testdata$ends - testdata$starts > 0,]\n\n# count\
    \ overlapping ranges on certain positions\ncount.data <- lapply(unique(testdata$chrom),\
    \ function(chromosome){\n    tmp.inner <- lapply(seq(1,10000, by = 120), function(i){\n\
    \        sum(testdata$chrom == chromosome & testdata$starts <= i & testdata$ends\
    \ >= i)\n    })\n    return(unlist(tmp.inner))\n})\n\n# generate a data.frame\
    \ containing all data\ndf.count.data <- ldply(count.data, rbind)\n\n# ideally\
    \ the chromosome will be columns and not rows\nt(df.count.data)\n\nAnswer: GenomicRanges::countOverlaps\
    \ seems to be what you’re after:\nposition_range = GRanges(position$chrom, IRanges(position,\
    \ position, width = 1))\nranges_at_position = countOverlaps(position_ranges, granges)"
- source_sentence: samtools depth print out all positions
  sentences:
  - "Question: I have around ~3,000 short sequences of approximately ~10Kb long. What\
    \ are the best ways to find the motifs among all of these sequences? Is there\
    \ a certain software/method recommended?\nThere are several ways to do this. My\
    \ goal would be to:\n(1) Check for motifs repeated within individual sequences\n\
    (2) Check for motifs shared among all sequences\n(3) Check for the presence of\
    \ \"expected\" or known motifs\nWith respect to #3, I'm also curious if I find\
    \ e.g. trinucleotide sequences, how does one check the context around these regions?\n\
    Thank you for the recommendations/help!\n\nAnswer: For (3), this page has a lot\
    \ of links to pattern/motif finding tools. Following through the YMF link on that\
    \ page, I came across the University of Washington Motif Discovery section. Of\
    \ these projection seemed to be the only downloadable tool. I find it interesting\
    \ how old all these tools are; maybe the introduction of microarrays and NGS has\
    \ made them all redundant.\nYour sub-problem (2) seems similar to the problem\
    \ I'm having with Nippostrongylus brasiliensis genome sequences, where I'd like\
    \ to find regions of very high homology (length 500bp to 20kb or more, 95-99%\
    \ similar) that are repeated throughout the genome. These sequences are killing\
    \ the assembly.\nThe main way I can find these regions is by looking at a coverage\
    \ plot of long nanopore reads mapped to the assembled genome (using GraphMap or\
    \ BWA). Any regions with substantially higher than median coverage are likely\
    \ to be shared repeats.\nI've played around in the past with chopping up the reads\
    \ to smaller sizes, which works better for hitting smaller repeated regions that\
    \ are such a small proportion of most reads that they are never mapped to all\
    \ the repeated locations. I wrote my own script a while back to chop up reads\
    \ (for a different purpose), which produces a FASTA/FASTQ file where all reads\
    \ are exactly the same length. For some unknown reason I took the time to document\
    \ that script \"properly\" using POD, so here's a short summary:\n\nConverts all\
    \ sequences in the input FASTA file to the same length.\n     Sequences shorter\
    \ than the target length are dropped, and sequences longer\n     than the target\
    \ length are split into overlapping subsequences covering\n     the entire range.\
    \ This prepares the sequences for use in an\n     overlap-consensus assembler\
    \ requiring constant-length sequences (such as\n     edena).\n\nAnd here's the\
    \ syntax:\n$ ./normalise_seqlengths.pl -h\nUsage:\n    ./normalise_seqlengths.pl\
    \ <reads.fa> [options]\n\n  Options:\n    -help\n      Only display this help\
    \ message\n\n    -fraglength\n      Target fragment length (in base-pairs, default\
    \ 2000)\n\n    -overlap\n      Minimum overlap length (in base-pairs, default\
    \ 200)\n\n    -short\n      Keep short sequences (shorter than fraglength)"
  - 'Question: Without going into too much background, I just joined up with a lab
    as a bioinformatics intern while I''m completing my masters degree in the field.
    The lab has data from an RNA-seq they outsourced, but the only problem is that
    the only data they have is preprocessed from the company that did the sequencing:
    filtering the reads, aligning them, and putting the aligned reads through RSEM.
    I currently have output from RSEM for each of the four samples consisting of:
    gene id, transcript id(s), length, expected count, and FPKM. I am attempting to
    get the FASTQ files from the sequencing, but for now, this is what I have, and
    I''m trying to get something out of it if possible.

    I found this article that talks about how expected read counts can be better than
    raw read counts when analyzing differential expression using EBSeq; it''s just
    one guy''s opinion, and it''s from 2014, so it may be wrong or outdated, but I
    thought I''d give it a try since I have the expected counts.

    However, I have just a couple of questions about running EBSeq that I can''t find
    the answers to:

    1: In the output RSEM files I have, not all genes are represented in each, about
    80% of them are, but for the ones that aren''t, should I remove them before analysis
    with EBSeq? It runs when I do, but I''m not sure if it is correct.

    2: How do I know which normalization factor to use when running EBSeq? This is
    more of a conceptual question rather than a technical question.

    Thanks!


    Answer: Yes, that blog post does represent just one guy''s opinion (hi!) and it
    does date all the way back to 2014, which is, like, decades in genomics years.
    :-) By the way, there is quite a bit of literature discussing the improvements
    that expected read counts derived from an Expectation Maximization algorithm provide
    over raw read counts. I''d suggest reading the RSEM papers for a start[1][2].

    But your main question is about the mechanics of running RSEM and EBSeq. First,
    RSEM was written explicitly to be compatible with EBSeq, so I''d be very surprised
    if it does not work correctly out-of-the-box. Second, EBSeq''s MedianNorm function
    worked very well in my experience for normalizing the library counts. Along those
    lines, the blog you mentioned above has another post that you may find useful.

    But all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools
    provide orders-of-magnitude improvements in runtime over the older alignment-based
    alternatives, with comparable accuracy. Sailfish was the first in a growing list
    of tools that now includes Salmon and Kallisto. When starting a new analysis from
    scratch (i.e. if you ever get the original FASTQ files), there''s really no good
    reason not to estimate expression using these much faster tools, followed by a
    differential expression analysis with DESeq2, edgeR, or sleuth.


    1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression
    estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692.

    2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data
    with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.'
  - 'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed
    file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out
    the coverage at every position:

    cat GRCh38.karyo.bed | awk ''{print $3}'' | datamash sum 1

    3088286401


    I would like to know how to run samtools depth so that it produces 3,088,286,401
    entries when run against a GRCh38 bam file:

    samtools depth -b $bedfile -a $inputfile


    I tried it for a few bam files that were aligned the same way, and I get differing
    number of entries:

    3087003274

    3087005666

    3087007158

    3087009435

    3087009439

    3087009621

    3087009818

    3087010065

    3087010408

    3087010477

    3087010481

    3087012115

    3087013147

    3087013186

    3087013500

    3087149616


    Is there a special flag in samtools depth so that it reports all entries from
    the bed file?

    If samtools depth is not the best tool for this, what would be the equivalent
    with sambamba depth base?

    sambamba depth base --min-coverage=0 --regions $bedfile $inputfile


    Any other options?


    Answer: You might try using bedtools genomecov instead. If you provide the -d
    option, it reports the coverage at every position in the BAM file.

    bedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"


    You can also provide a BED file if you just want to calculate in the target  region.'
pipeline_tag: sentence-similarity
library_name: sentence-transformers
---

# SentenceTransformer based on BAAI/bge-small-en-v1.5

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) <!-- at revision 5c38ec7c405ec4b44b94cc5a9bb96e735b38267a -->
- **Maximum Sequence Length:** 512 tokens
- **Output Dimensionality:** 384 dimensions
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
```

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'samtools depth print out all positions',
    'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out the coverage at every position:\ncat GRCh38.karyo.bed | awk \'{print $3}\' | datamash sum 1\n3088286401\n\nI would like to know how to run samtools depth so that it produces 3,088,286,401 entries when run against a GRCh38 bam file:\nsamtools depth -b $bedfile -a $inputfile\n\nI tried it for a few bam files that were aligned the same way, and I get differing number of entries:\n3087003274\n3087005666\n3087007158\n3087009435\n3087009439\n3087009621\n3087009818\n3087010065\n3087010408\n3087010477\n3087010481\n3087012115\n3087013147\n3087013186\n3087013500\n3087149616\n\nIs there a special flag in samtools depth so that it reports all entries from the bed file?\nIf samtools depth is not the best tool for this, what would be the equivalent with sambamba depth base?\nsambamba depth base --min-coverage=0 --regions $bedfile $inputfile\n\nAny other options?\n\nAnswer: You might try using bedtools genomecov instead. If you provide the -d option, it reports the coverage at every position in the BAM file.\nbedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"\n\nYou can also provide a BED file if you just want to calculate in the target  region.',
    "Question: Without going into too much background, I just joined up with a lab as a bioinformatics intern while I'm completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they have is preprocessed from the company that did the sequencing: filtering the reads, aligning them, and putting the aligned reads through RSEM. I currently have output from RSEM for each of the four samples consisting of: gene id, transcript id(s), length, expected count, and FPKM. I am attempting to get the FASTQ files from the sequencing, but for now, this is what I have, and I'm trying to get something out of it if possible.\nI found this article that talks about how expected read counts can be better than raw read counts when analyzing differential expression using EBSeq; it's just one guy's opinion, and it's from 2014, so it may be wrong or outdated, but I thought I'd give it a try since I have the expected counts.\nHowever, I have just a couple of questions about running EBSeq that I can't find the answers to:\n1: In the output RSEM files I have, not all genes are represented in each, about 80% of them are, but for the ones that aren't, should I remove them before analysis with EBSeq? It runs when I do, but I'm not sure if it is correct.\n2: How do I know which normalization factor to use when running EBSeq? This is more of a conceptual question rather than a technical question.\nThanks!\n\nAnswer: Yes, that blog post does represent just one guy's opinion (hi!) and it does date all the way back to 2014, which is, like, decades in genomics years. :-) By the way, there is quite a bit of literature discussing the improvements that expected read counts derived from an Expectation Maximization algorithm provide over raw read counts. I'd suggest reading the RSEM papers for a start[1][2].\nBut your main question is about the mechanics of running RSEM and EBSeq. First, RSEM was written explicitly to be compatible with EBSeq, so I'd be very surprised if it does not work correctly out-of-the-box. Second, EBSeq's MedianNorm function worked very well in my experience for normalizing the library counts. Along those lines, the blog you mentioned above has another post that you may find useful.\nBut all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools provide orders-of-magnitude improvements in runtime over the older alignment-based alternatives, with comparable accuracy. Sailfish was the first in a growing list of tools that now includes Salmon and Kallisto. When starting a new analysis from scratch (i.e. if you ever get the original FASTQ files), there's really no good reason not to estimate expression using these much faster tools, followed by a differential expression analysis with DESeq2, edgeR, or sleuth.\n\n1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692.\n2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Training Details

### Training Dataset

#### Unnamed Dataset

* Size: 96 training samples
* Columns: <code>sentence_0</code> and <code>sentence_1</code>
* Approximate statistics based on the first 96 samples:
  |         | sentence_0                                                                        | sentence_1                                                                            |
  |:--------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
  | type    | string                                                                            | string                                                                                |
  | details | <ul><li>min: 6 tokens</li><li>mean: 14.93 tokens</li><li>max: 34 tokens</li></ul> | <ul><li>min: 103 tokens</li><li>mean: 397.92 tokens</li><li>max: 512 tokens</li></ul> |
* Samples:
  | sentence_0                                                                                 | sentence_1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
  |:-------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  | <code>Using shells other than bash</code>                                                  | <code>Question: As someone who's beginning to delve into bioinformatics, I'm noticing that like biology there are industry standards here, similar to Illumina in genomics and bowtie for alignment, many people use bash as shell. <br>Is using a shell besides bash going to cause issues for me?<br><br>Answer: Bioinformatics tools written in shell and other shell scripts generally specify the shell they want to use (via #!/bin/sh or e.g. #!/bin/bash if it matters), so won't be affected by your choice of user shell.<br>If you are writing significant shell scripts yourself, there are reasons to do it in a Bourne-style shell.  See Csh Programming Considered Harmful and other essays/polemics.<br>A Bourne-style shell is pretty much the industry standard, and if you choose a substantially different shell you'll have to translate some of the documentation of your bioinformatics tools.  It's not uncommon to have things like<br><br>Set some variables pointing at reference data and add the script to your PATH to run it:<br>export...</code> |
  | <code>Linear models of complex diseases</code>                                             | <code>Question: A popular framework to analyze differences between groups, either experiments or diseases, in transcriptomics is using linear models (limma is a popular choice). <br>For instance we have a disease D with three stages as defined by clinicians, A, B and C. 10 samples each stage and the healthy H to compare with is RNA-sequenced. A typical linear model would be to observe the three stages~A+B+C independently. The data of each stage is not from the same person. (but for the question assume it isn't)<br>My understanding is that such a model would not take into account that stage C appears only on 30% of patients in stage B. And that a healthy patient upon external factors can jump to stage B. <br>If we want to find the role of a gene in the disease we should include somehow this information in the model. Which makes me think about mixing linear models and hidden Markov chains.<br>How can such a disease be described in terms of linear models with such data and information?<br><br>Answer: There are t...</code>       |
  | <code>Detecting portions of human proteins with high degree of microbial similarity</code> | <code>Question: I'm a newcomer to the world of bioinformatics, and in need of help solving a problem.<br>My goal is to take a list of human proteins, and identify segments (13-17aa in length) with a high degree of similarity to microbial sequences. Ideally, I would like to start with list of FASTA sequences, and have an easy way to generate an output of the corresponding high similarity segments of each protein.<br>Are there existing tools or software that I should be aware of that will make my life easier?<br>Thanks in advance.<br><br>Answer: Sounds like precisely the job BLAST was developed for. Now, which flavor will depend on what you want to do and what data you have available. Some options:<br><br>PSI-BLAST: this is usually the best choice if you are trying to find protein homologs. It works by building a hidden markov model describing your query sequence and using that model to query a database of proteins. The advantage is that it is run in multiple iterations, giving you the chance to add or remove resu...</code>    |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
  ```json
  {
      "scale": 20.0,
      "similarity_fct": "cos_sim"
  }
  ```

### Training Hyperparameters
#### Non-Default Hyperparameters

- `per_device_train_batch_size`: 32
- `per_device_eval_batch_size`: 32
- `num_train_epochs`: 1
- `fp16`: True
- `batch_sampler`: no_duplicates
- `multi_dataset_batch_sampler`: round_robin

#### All Hyperparameters
<details><summary>Click to expand</summary>

- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: no
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 32
- `per_device_eval_batch_size`: 32
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 5e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1
- `num_train_epochs`: 1
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.0
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: True
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `tp_size`: 0
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`: 
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: no_duplicates
- `multi_dataset_batch_sampler`: round_robin

</details>

### Framework Versions
- Python: 3.12.8
- Sentence Transformers: 3.4.1
- Transformers: 4.51.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.7.0
- Datasets: 3.2.0
- Tokenizers: 0.21.0

## Citation

### BibTeX

#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```

<!--
## Glossary

*Clearly define terms in order to be accessible across audiences.*
-->

<!--
## Model Card Authors

*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->

<!--
## Model Card Contact

*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->