|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- dataset_size:96 |
|
- loss:MultipleNegativesRankingLoss |
|
base_model: BAAI/bge-small-en-v1.5 |
|
widget: |
|
- source_sentence: What are the de facto required fields in a SAM/BAM read group? |
|
sentences: |
|
- "Question: Several gene set enrichment methods are available, the most famous/popular\ |
|
\ is the Broad Institute tool. Many other tools are available (See for example\ |
|
\ the biocView of GSE which list 82 different packages). There are several parameters\ |
|
\ in consideration :\n\nthe statistic used to order the genes, \nif it competitive\ |
|
\ or self-contained,\nif it is supervised or not,\nand how is the enrichment score\ |
|
\ calculated.\n\nI am using the fgsea - Fast Gene Set Enrichment Analysis package\ |
|
\ to calculate the enrichment scores and someone told me that the numbers are\ |
|
\ different from the ones on the Broad Institute despite all the other parameters\ |
|
\ being equivalent.\nAre these two methods (fgsea and Broad Institute GSEA) equivalent\ |
|
\ to calculate the enrichment score?\nI looked to the algorithms of both papers,\ |
|
\ and they seem fairly similar, but I don't know if in real datasets they are\ |
|
\ equivalent or not.\nIs there any article reviewing and comparing how does the\ |
|
\ enrichment score method affect to the result?\n\nAnswer: According to the FGSEA\ |
|
\ preprint:\n\nWe ran reference GSEA with default parameters. The permutation\ |
|
\ number\n was set to 1000, which means that for each input gene set 1000\n \ |
|
\ independent samples were generated. The run took 100 seconds and\n resulted\ |
|
\ in 79 gene sets with GSEA-adjusted FDR q-value of less than\n 10−2. All significant\ |
|
\ gene sets were in a positive mode. First, to get\n a similar nominal p-values\ |
|
\ accuracy we ran FGSEA algorithm on 1000\n permutations. This took 2 seconds,\ |
|
\ but resulted in no significant hits\n due after multiple testing correction\ |
|
\ (with FRD ≤ 1%).\n\nThus, FGSEA and GSEA are not identical.\nAnd again in the\ |
|
\ conclusion:\n\nConsequently, gene sets can be ranked more precisely in the results\n\ |
|
\ and, which is even more important, standard multiple testing\n correction\ |
|
\ methods can be applied instead of approximate ones as in\n [GSEA].\n\nThe author\ |
|
\ argues that FGSEA is more accurate, so it can't be equivalent.\nIf you are interested\ |
|
\ specifically in the enrichment score, that was addressed by the author in the\ |
|
\ preprint comments:\n\nValues of enrichment scores and normalized enrichment\ |
|
\ scores are the\n same for both broad version and fgsea.\n\nSo that part seems\ |
|
\ to be the same." |
|
- 'Question: I am running samtools mpileup (v1.4) on a bam file with very choppy |
|
coverage (ChIP-seq style data). I want to get a first-pass list of positions with |
|
SNVs and their frequency as reported by the read counts, but no matter what I |
|
do, I keep getting all SNVs filtered out as not passing QC. |
|
|
|
What''s the magic parameter set for an initial list of SNVs and frequencies? |
|
|
|
EDIT: this is a question I posted on "the other" website, but didn''t get a reply |
|
there. |
|
|
|
|
|
Answer: I used this in the past for ChIP-seq data and it generated SNVs: |
|
|
|
samtools mpileup \ |
|
|
|
--uncompressed --max-depth 10000 --min-MQ 20 --ignore-RG --skip-indels \ |
|
|
|
--fasta-ref ref.fa file.bam \ |
|
|
|
| bcftools call --consensus-caller \ |
|
|
|
> out.vcf |
|
|
|
|
|
This was samtools 1.3 in case that makes a difference.' |
|
- "Question: The SAM specification indicates that each read group must have a unique\ |
|
\ ID field, but does not mark any other field as required. \nI have also discovered\ |
|
\ that htsjdk throws exceptions if the sample (SM) field is empty, though there\ |
|
\ is no indication in the specification that this is required. \nAre there other\ |
|
\ read group fields that I should expect to be required by common tools? \n\n\ |
|
Answer: The sample tag (i.e. SM) was a mandatory tag in the initial SAM spec (see\ |
|
\ the .pages file; you need a mac to open it). When transitioned to Latex, this\ |
|
\ requirement was mysteriously dropped. Picard is conforming to the initial spec.\ |
|
\ Anyway, the sample tag is important to quite a few tools. I would encourage\ |
|
\ you to add it." |
|
- source_sentence: Is the optional SAM NM field strictly computable from the MD and |
|
CIGAR? |
|
sentences: |
|
- "Question: I'm looking for tools to check the quality of a VCF I have of a human\ |
|
\ genome. I would like to check the VCF against publicly known variants across\ |
|
\ other human genomes, e.g. how many SNPs are already in public databases, whether\ |
|
\ insertions/deletions are at known positions, insertion/deletion length distribution,\ |
|
\ other SNVs/SVs, etc.? I suspect that there are resources from previous projects\ |
|
\ to check for known SNPs and InDels by human subpopulations.\nWhat resources\ |
|
\ exist for this, and how do I do it? \n\nAnswer: To achieve (at least some of)\ |
|
\ your goals, I would recommend the Variant Effect Predictor (VEP). It is a flexible\ |
|
\ tool that provides several types of annotations on an input .vcf file. I agree\ |
|
\ that ExAC is the de facto gold standard catalog for human genetic variation\ |
|
\ in coding regions. To see the frequency distribution of variants by global\ |
|
\ subpopulation make sure \"ExAC allele frequencies\" is checked in addition to\ |
|
\ the 1000 genomes. \nOutput in the web-browser:\n\nIf you download the annotated\ |
|
\ .vcf, frequencies will be in the INFO field:\n##INFO=<ID=CSQ,Number=.,Type=String,Description=\"\ |
|
Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|TSL|SIFT|PolyPhen|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|ExAC_AF|ExAC_Adj_AF|ExAC_AFR_AF|ExAC_AMR_AF|ExAC_EAS_AF|ExAC_FIN_AF|ExAC_NFE_AF|ExAC_OTH_AF|ExAC_SAS_AF|CLIN_SIG|SOMATIC|PHENO|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE\n\ |
|
\nThe previously mentioned Annovar can also annotate with ExAC allele frequencies.\ |
|
\ Finally, should mention the newest whole-genome resource, gnomAD." |
|
- 'Question: I produced a bam file by aligning reads to a small set of synthetic |
|
sequences using bwa-mem. |
|
|
|
I am heavily filtering reads that are not paired and of a certain orientation. |
|
|
|
Applying the filtering, I get a few thousands of reads: |
|
|
|
samtools view -h $myfilebam | \ |
|
|
|
samtools view -h -F4 - | \ |
|
|
|
samtools view -h -F8 - | \ |
|
|
|
samtools view -h -F256 - | \ |
|
|
|
samtools view -h -F512 - | \ |
|
|
|
samtools view -h -F1024 - | \ |
|
|
|
samtools view -h -F2048 - | \ |
|
|
|
samtools view -h -f16 - | \ |
|
|
|
samtools view -h -f32 - | wc -l |
|
|
|
|
|
Gives me 89502 reads. |
|
|
|
If I then pipe this into samtools mpileup, I get no results: |
|
|
|
samtools view -h $myfilebam | \ |
|
|
|
samtools view -h -F4 - | \ |
|
|
|
samtools view -h -F8 - | \ |
|
|
|
samtools view -h -F256 - | \ |
|
|
|
samtools view -h -F512 - | \ |
|
|
|
samtools view -h -F1024 - | \ |
|
|
|
samtools view -h -F2048 - | \ |
|
|
|
samtools view -h -f16 - | \ |
|
|
|
samtools view -h -f32 - | \ |
|
|
|
samtools mpileup --excl-flags 0 -Q0 -B -d 999999 - | wc -l |
|
|
|
|
|
Returns 0. |
|
|
|
I tried different combinations of filtering, and when I do both -f 16 and -f 32 |
|
returns empty, but if I do either of those, then it works: |
|
|
|
samtools view -h $myfilebam | \ |
|
|
|
samtools view -h -F4 - | \ |
|
|
|
samtools view -h -F8 - | \ |
|
|
|
samtools view -h -F256 - | \ |
|
|
|
samtools view -h -F512 - | \ |
|
|
|
samtools view -h -F1024 - | \ |
|
|
|
samtools view -h -F2048 - | \ |
|
|
|
samtools view -h -f16 - | \ |
|
|
|
samtools mpileup --excl-flags 0 -Q0 -B -d 999999 - | wc -l |
|
|
|
|
|
Returns 1056. |
|
|
|
Any ideas why? My thinking was that it would work with --excl-flags 0. |
|
|
|
EDIT: substituting mpileup for depth does work, and prints out each position and |
|
the depth as expected. |
|
|
|
EDIT2: adding -q 0 to mpileup gives the same empty result. |
|
|
|
Thanks in advance |
|
|
|
|
|
Answer: By using -h in the samtools view command, you''re including all the header |
|
lines in your word count. If you happen to have about 89500 reference sequences, |
|
then the lengths of those would all appear in the header and inflate the -h word |
|
count, but not the mpileup count. Try piping it through an additional samtools |
|
view (i.e. without -h) and see if the counts change: |
|
|
|
... |
|
|
|
samtools view -h -f32 - | \ |
|
|
|
samtools view | wc -l |
|
|
|
|
|
Also, samtools mpileup by default only considers high-quality bases and concordant |
|
reads. Try adding a -A to your mpileup line (which stops anomalous read pairs |
|
from being discarded): |
|
|
|
... |
|
|
|
samtools mpileup -A -Q0 -B -d 999999 - | wc -l |
|
|
|
|
|
Whether or not this is actually a good idea will be dependent on what you want |
|
to get out of the analysis, and what the downstream programs / analyses are expecting.' |
|
- "Question: From SAM Optional Fields Specification the NM field is \n\nEdit distance\ |
|
\ to the reference, including ambiguous bases but excluding clipping\n\nAssuming\ |
|
\ both the MD and CIGAR are present, is the edit distance simply the number of\ |
|
\ characters [A-Z] appearing in the MD field plus the number of bases inserted\ |
|
\ (xI, if any) from the CIGAR string? Are there any other complications? \n\n\ |
|
Answer: Assuming both the MD and CIGAR are present and correct, then yes, you\ |
|
\ can parse both to get the edit distance (NM auxiliary tag). One big caveat to\ |
|
\ this is that there's a reason that the samtools calmd command exists, since\ |
|
\ it's historically been the case that not all aligners have output correct MD\ |
|
\ strings. It's rare for the CIGAR string to be wrong and that'd be more of a\ |
|
\ catastrophic error on the part of an aligner. For what it's worth, if the NM\ |
|
\ auxiliary is absent on a given alignment but present on others produced by the\ |
|
\ same aligner then it's fair to assume NM:i:0 for a given alignment by default\ |
|
\ (many aligners only produce NM:i:XXX if the edit distance is at least 1)." |
|
- source_sentence: How to read structural variant VCF? |
|
sentences: |
|
- "Question: I am calling SNPs from WGS samples produced at my lab. I am currently\ |
|
\ using bwa-mem for mapping Illumina reads as it is recommended by GATK best practice.\ |
|
\ However, bwa is a bit slow. I heard from my colleague that SNAP is much faster\ |
|
\ than bwa. I tried it on a small set of reads and it is indeed faster. However,\ |
|
\ I am not sure how it works with downstream SNP callers, so here are my questions:\ |
|
\ have you used SNAP for short-read mapping? What is your experience? Does SNAP\ |
|
\ work well with SNP callers like GATK and freebayes? Thanks!\n\nAnswer: GATK\ |
|
\ best practices are explicably meant to consume BWA MEM generated BAM. Whilst\ |
|
\ SNAP may be faster, the Broad will not have tested it for compatibility with\ |
|
\ GATK as such you can't guaranty using it won't have unexpected consequences.\ |
|
\ \nAs such you'd be better off using BWA MEM because I assume accurately called\ |
|
\ variation is always better than fast and incorrectly called variation. The\ |
|
\ main issue you'll have is ensuring shorter split hits and mapping quality are\ |
|
\ reported in the same way as bwa MEM -M which GATK/Picard is expecting. Ultimately\ |
|
\ however you'd be better off posting this question on the GATK forum. \nIt's\ |
|
\ also worth noting that the soon to be released GATK 4 will utilise bwaspark\ |
|
\ which can distribute it's alignment processes across Apache Spark for increase\ |
|
\ performance. Consequently I can't see SNAP being adopted anytime soon." |
|
- 'Question: I have a computer engineering background, not biology. |
|
|
|
I started working on a bioinformatics project recently, which involves de-novo |
|
assembly. I came to know the terms Transcriptome and Genome, but I cannot identify |
|
the difference between these two. |
|
|
|
I know a transcriptome is the set of all messenger RNA molecules in a cell, but |
|
am not sure how this is different from a genome. |
|
|
|
|
|
Answer: In brief, the “genome” is the collection of all DNA present in the nucleus and the mitochondria |
|
of a somatic cell. The initial product of genome expression is the “transcriptome”, |
|
a collection of RNA molecules derived from those genes.' |
|
- "Question: The IGSR has a sample for encoding structural variants in the VCF 4.0\ |
|
\ format.\nAn example from the site (the first record):\n#CHROM POS ID REF\ |
|
\ ALT QUAL FILTER INFO FORMAT NA00001\n1 2827693 . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA\ |
|
\ C . PASS SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66\ |
|
\ GT:GQ 1/1:13.9\n\nHow to read it? From what I can see:\n\nThis is a deletion\ |
|
\ (SVTYPE=DEL)\nThe end position of the variant comes before the starting position\ |
|
\ (reverse strand?)\nThe reference starts from 2827693 to 2827680 (13 bases on\ |
|
\ the reverse strand)\nThe difference between reference and alternative is 66\ |
|
\ bases (SVLEN=-66)\n\nThis doesn't sound right to me. For instance, I don't see\ |
|
\ where exactly the deletion starts. The SVLEN field says 66 bases deleted, but\ |
|
\ where? 2827693 to 2827680 only has 13 bases between.\nQ: How to read the deletion\ |
|
\ correctly from this structural VCF record? Where is the missing 66-13=53 bases?\n\ |
|
\nAnswer: I just received a reply from 1000Genomes regarding this. I'll post it\ |
|
\ in its entirety below:\n\nLooking at the example you mention, I find it difficult\ |
|
\ to come up with an\n interpretation of the information whereby the stated end\ |
|
\ seems to be correct,\n so believe that this may indeed be an error.\nSince\ |
|
\ the v4.0 was created, however, new versions of VCF have been introduced,\n \ |
|
\ improving and correcting the specification. The current version is v4.3\n (http://samtools.github.io/hts-specs/).\ |
|
\ I believe the first record shown on\n page 11 provides an accurate example\ |
|
\ of this type of deletion.\nI will update the web page to include this information.\n\ |
|
\nSo we can take this as official confirmation that we were all correct in suspecting\ |
|
\ the example was just wrong." |
|
- source_sentence: Publicly available genome sequence database for viruses? |
|
sentences: |
|
- "Question: This question is based on a question on BioStars posted >2 years ago\ |
|
\ by user jack.\nIt describes a very frequent problem of generating GO annotations\ |
|
\ for non-model organisms. While it is based on some specific format and single\ |
|
\ application (Ontologizer), it would be useful to have a general description\ |
|
\ of the pathway to getting to a GAF file. \nNote, that the input format is lacking\ |
|
\ a bit of essential information, like how it was obtained. Therefore, it is har\ |
|
\ to assign evidence code. Therefore, lets assume that the assignments of GO terms\ |
|
\ were done automagically. \n\nI want to do the Gene enrichment using Ontologizer\ |
|
\ without a\n predefined association file(it's not model organism). \nI have\ |
|
\ parsed a file with two columns for that organism like this : \ngeneA GO:0006950,GO:0005737\n\ |
|
geneB GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031\n\ |
|
geneC GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863\n\ |
|
geneD GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528\n\ |
|
\nI have downloaded the .ob file from Gene ontology file which contain\n this\ |
|
\ information (from here) : \n!\n! GO IDs (primary only) and name text strings\n\ |
|
! GO:0000000 [tab] text string [tab] F|P|C\n! where F = molecular function, P\ |
|
\ = biological process, C = cellular component\n!\nGO:0000001 mitochondrion inheritance\ |
|
\ P\nGO:0000002 mitochondrial genome maintenance P\nGO:0000003 reproduction\ |
|
\ P\nGO:0000005 ribosomal chaperone activity F\nGO:0000006 high affinity\ |
|
\ zinc uptake transmembrane transporter activity F\nGO:0000007 low-affinity\ |
|
\ zinc ion transmembrane transporter activity F\nGO:0000008 thioredoxin F\n\ |
|
GO:0000009 alpha-1,6-mannosyltransferase activity F\nGO:0000010 trans-hexaprenyltranstransferase\ |
|
\ activity F\nGO:0000011 vacuole inheritance P\n\nWhat I need as output is\ |
|
\ .gaf file in the following format (in the\n format of the files here):\n!gaf-version:\ |
|
\ 2.0\n\n!Project_name: Leishmania major GeneDB\n\n!URL: http://www.genedb.org/leish\n\ |
|
\n!Contact Email: mb4@sanger.ac.uk\n\n GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770\ |
|
\ GO:0003723 PMID:22396527 ISO GeneDB:Tb927.10.10130 F mitochondrial\ |
|
\ RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515\ |
|
\ 20120910 GeneDB_Lmajor \n GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770\ |
|
\ GO:0044429 PMID:20660476 ISS C mitochondrial RNA binding\ |
|
\ complex 1 subunit, putative LmjF36.4770 gene taxon:347515 20100803\ |
|
\ GeneDB_Lmajor GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770 \ |
|
\ GO:0016554 PMID:22396527 ISO GeneDB:Tb927.10.10130 P mitochondrial\ |
|
\ RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515\ |
|
\ 20120910 GeneDB_Lmajor \n GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770\ |
|
\ GO:0048255 PMID:22396527 ISO GeneDB:Tb927.10.10130 P mitochondrial\ |
|
\ RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515\ |
|
\ 20120910 GeneDB_Lmajor \n\nHow to create your own GO association file\ |
|
\ (gaf)?\n\nAnswer: Here's a Perl script that can do this:\n#!/usr/bin/env perl\ |
|
\ \nuse strict;\nuse warnings;\n\n## Change this to whatever taxon you are working\ |
|
\ with\nmy $taxon = 'taxon:1000';\nchomp(my $date = `date +%Y%M%d`);\n\nmy (%aspect,\ |
|
\ %gos);\n## Read the GO.terms_and_ids file to get the aspect (sub ontology)\n\ |
|
## of each GO term. \nopen(my $fh, $ARGV[0]) or die \"Need a GO.terms_and_ids\ |
|
\ file as 1st arg: $!\\n\";\nwhile (<$fh>) {\n next if /^!/;\n chomp;\n\ |
|
\ my @fields = split(/\\t/);\n ## $aspect{GO:0000001} = 'P'\n $aspect{$fields[0]}\ |
|
\ = $fields[2];\n}\nclose($fh);\n\n## Read the list of gene annotations\nopen($fh,\ |
|
\ $ARGV[1]) or die \"Need a list of gene annotattions as 2nd arg: $!\\n\";\nwhile\ |
|
\ (<$fh>) {\n chomp;\n my ($gene, @terms) = split(/[\\s,]+/);\n ## $gos{geneA}\ |
|
\ = (go1, go2 ... goN)\n $gos{$gene} = [ @terms ];\n}\nclose($fh);\n\nforeach\ |
|
\ my $gene (keys(%gos)) {\n foreach my $term (@{$gos{$gene}}) {\n ##\ |
|
\ Warn and skip if there is no aspect for this term\n if (!$aspect{$term})\ |
|
\ {\n print STDERR \"Unknown GO term ($term) for gene $gene\\n\";\n\ |
|
\ next;\n }\n ## Build a pseudo GAF line \n my\ |
|
\ @out = ('DB', $gene, $gene, ' ', $term, 'PMID:foo', 'TAS', ' ', $aspect{$term},\n\ |
|
\ $gene, ' ', 'protein', $taxon, $date, 'DB', ' ',\ |
|
\ ' ');\n print join(\"\\t\", @out). \"\\n\";\n }\n}\n\nMake it executable\ |
|
\ and run it with the GO.terms_and_ids file as the 1st argument and the list of\ |
|
\ gene annotations as the second. Using the current GO.terms_and_ids and the example\ |
|
\ annotations in the question, I get:\n$ foo.pl GO.terms_and_ids file.gos \nDB\ |
|
\ geneD geneD GO:0005634 PMID:foo TAS C geneD protein\ |
|
\ taxon:1000 20170308 DB \nDB geneD geneD GO:0003677 PMID:foo\ |
|
\ TAS F geneD protein taxon:1000 20170308 DB \nDB geneD\ |
|
\ geneD GO:0030154 PMID:foo TAS P geneD protein taxon:1000\ |
|
\ 20170308 DB \nUnknown GO term (GO:0006350) for gene geneD\nDB geneD\ |
|
\ geneD GO:0006355 PMID:foo TAS P geneD protein taxon:1000\ |
|
\ 20170308 DB \nDB geneD geneD GO:0007275 PMID:foo TAS\ |
|
\ P geneD protein taxon:1000 20170308 DB \nDB geneD geneD\ |
|
\ GO:0030528 PMID:foo TAS F geneD protein taxon:1000 20170308\ |
|
\ DB \nDB geneB geneB GO:0016020 PMID:foo TAS C geneB\ |
|
\ protein taxon:1000 20170308 DB \nDB geneB geneB GO:0005524\ |
|
\ PMID:foo TAS F geneB protein taxon:1000 20170308 DB \ |
|
\ \nDB geneB geneB GO:0006468 PMID:foo TAS P geneB \ |
|
\ protein taxon:1000 20170308 DB \nDB geneB geneB GO:0005737\ |
|
\ PMID:foo TAS C geneB protein taxon:1000 20170308 DB \ |
|
\ \nDB geneB geneB GO:0004674 PMID:foo TAS F geneB \ |
|
\ protein taxon:1000 20170308 DB \nDB geneB geneB GO:0006914\ |
|
\ PMID:foo TAS P geneB protein taxon:1000 20170308 DB \ |
|
\ \nDB geneB geneB GO:0016021 PMID:foo TAS C geneB \ |
|
\ protein taxon:1000 20170308 DB \nDB geneB geneB GO:0015031\ |
|
\ PMID:foo TAS P geneB protein taxon:1000 20170308 DB \ |
|
\ \nDB geneA geneA GO:0006950 PMID:foo TAS P geneA \ |
|
\ protein taxon:1000 20170308 DB \nDB geneA geneA GO:0005737\ |
|
\ PMID:foo TAS C geneA protein taxon:1000 20170308 DB \ |
|
\ \nDB geneC geneC GO:0003779 PMID:foo TAS F geneC \ |
|
\ protein taxon:1000 20170308 DB \nDB geneC geneC GO:0006941\ |
|
\ PMID:foo TAS P geneC protein taxon:1000 20170308 DB \ |
|
\ \nDB geneC geneC GO:0005524 PMID:foo TAS F geneC \ |
|
\ protein taxon:1000 20170308 DB \nDB geneC geneC GO:0003774\ |
|
\ PMID:foo TAS F geneC protein taxon:1000 20170308 DB \ |
|
\ \nDB geneC geneC GO:0005516 PMID:foo TAS F geneC \ |
|
\ protein taxon:1000 20170308 DB \nDB geneC geneC GO:0005737\ |
|
\ PMID:foo TAS C geneC protein taxon:1000 20170308 DB \ |
|
\ \nDB geneC geneC GO:0005863 PMID:foo TAS C geneC \ |
|
\ protein taxon:1000 20170308 DB \n\nNote that this is very much a pseudo-GAF\ |
|
\ file since most of the fields apart from the gene name, GO term and sub-ontology\ |
|
\ are fake. It should still work for what you need, however." |
|
- 'Question: As a small introductory project, I want to compare genome sequences |
|
of different strains of influenza virus. |
|
|
|
What are the publicly available databases of influenza virus gene/genome sequences? |
|
|
|
|
|
Answer: There area few different influenza virus database resources: |
|
|
|
|
|
The Influenza Research Database (IRD) (a.k.a FluDB - based upon URL) |
|
|
|
|
|
A NIAID Bioinformatics Resource Center or BRC which highly curates the data brought |
|
in and integrates it with numerous other relevant data types |
|
|
|
|
|
The NCBI Influenza Virus Resource |
|
|
|
|
|
A sub-project of the NCBI with data curated over and above the GenBank data that |
|
is part of the NCBI |
|
|
|
|
|
The GISAID EpiFlu Database |
|
|
|
|
|
A database of sequences from the Global Initiative on Sharing All Influenza Data. |
|
Has unique data from many countries but requires user agree to a data sharing |
|
policy. |
|
|
|
|
|
The OpenFluDB |
|
|
|
|
|
Former GISAID database that contains some sequence data that GenBank does not |
|
have. |
|
|
|
|
|
For those who also may be interested in other virus databases, there are: |
|
|
|
|
|
Virus Pathogen Resource (VIPR) |
|
|
|
|
|
A companion portal to the IRD, which hosts curated and integrated data for most |
|
other NIAID A-C virus pathogens including (but not limited to) Ebola, Zika, Dengue, |
|
Enterovirus, and Hepatitis C |
|
|
|
|
|
LANL HIV database |
|
|
|
|
|
Los Alamos National Laboratory HIV database with HIV data and many useful tools |
|
for all virus bioinformatics |
|
|
|
|
|
PaVE: Papilloma virus genome database (from quintik comment) |
|
|
|
|
|
NIAID developed and maintained Papilloma virus bioinformatics portal |
|
|
|
|
|
Disclaimer: I used to work for the IRD / VIPR and currently work for NIAID.' |
|
- "Question: I have a set of genomic ranges that are potentially overlapping. I\ |
|
\ want to count the amount of ranges at certain positions using R. \nI'm Pretty\ |
|
\ sure there are good solutions, but I seem to be unable to find them. \nSolutions\ |
|
\ like cut or findIntervals don't achieve what I want as they only count on one\ |
|
\ vector or accumulate by all values <= break.\nAlso countMatches {GenomicRanges}\ |
|
\ doesn't seem to cover it.\nProbably one could use Bedtools, but I don't want\ |
|
\ to leave R.\nI could only come up with a hilariously slow solution\n# generate\ |
|
\ test data\ntestdata <- data.frame(chrom = rep(seq(1,10),10),\n \ |
|
\ starts = abs(rnorm(100, mean = 1, sd = 1)) * 1000,\n \ |
|
\ ends = abs(rnorm(100, mean = 2, sd = 1)) * 2000)\n\n# make sure that\ |
|
\ all end coordinates are bigger than start\n# this is a requirement of the original\ |
|
\ data\ntestdata <- testdata[testdata$ends - testdata$starts > 0,]\n\n# count\ |
|
\ overlapping ranges on certain positions\ncount.data <- lapply(unique(testdata$chrom),\ |
|
\ function(chromosome){\n tmp.inner <- lapply(seq(1,10000, by = 120), function(i){\n\ |
|
\ sum(testdata$chrom == chromosome & testdata$starts <= i & testdata$ends\ |
|
\ >= i)\n })\n return(unlist(tmp.inner))\n})\n\n# generate a data.frame\ |
|
\ containing all data\ndf.count.data <- ldply(count.data, rbind)\n\n# ideally\ |
|
\ the chromosome will be columns and not rows\nt(df.count.data)\n\nAnswer: GenomicRanges::countOverlaps\ |
|
\ seems to be what you’re after:\nposition_range = GRanges(position$chrom, IRanges(position,\ |
|
\ position, width = 1))\nranges_at_position = countOverlaps(position_ranges, granges)" |
|
- source_sentence: samtools depth print out all positions |
|
sentences: |
|
- "Question: I have around ~3,000 short sequences of approximately ~10Kb long. What\ |
|
\ are the best ways to find the motifs among all of these sequences? Is there\ |
|
\ a certain software/method recommended?\nThere are several ways to do this. My\ |
|
\ goal would be to:\n(1) Check for motifs repeated within individual sequences\n\ |
|
(2) Check for motifs shared among all sequences\n(3) Check for the presence of\ |
|
\ \"expected\" or known motifs\nWith respect to #3, I'm also curious if I find\ |
|
\ e.g. trinucleotide sequences, how does one check the context around these regions?\n\ |
|
Thank you for the recommendations/help!\n\nAnswer: For (3), this page has a lot\ |
|
\ of links to pattern/motif finding tools. Following through the YMF link on that\ |
|
\ page, I came across the University of Washington Motif Discovery section. Of\ |
|
\ these projection seemed to be the only downloadable tool. I find it interesting\ |
|
\ how old all these tools are; maybe the introduction of microarrays and NGS has\ |
|
\ made them all redundant.\nYour sub-problem (2) seems similar to the problem\ |
|
\ I'm having with Nippostrongylus brasiliensis genome sequences, where I'd like\ |
|
\ to find regions of very high homology (length 500bp to 20kb or more, 95-99%\ |
|
\ similar) that are repeated throughout the genome. These sequences are killing\ |
|
\ the assembly.\nThe main way I can find these regions is by looking at a coverage\ |
|
\ plot of long nanopore reads mapped to the assembled genome (using GraphMap or\ |
|
\ BWA). Any regions with substantially higher than median coverage are likely\ |
|
\ to be shared repeats.\nI've played around in the past with chopping up the reads\ |
|
\ to smaller sizes, which works better for hitting smaller repeated regions that\ |
|
\ are such a small proportion of most reads that they are never mapped to all\ |
|
\ the repeated locations. I wrote my own script a while back to chop up reads\ |
|
\ (for a different purpose), which produces a FASTA/FASTQ file where all reads\ |
|
\ are exactly the same length. For some unknown reason I took the time to document\ |
|
\ that script \"properly\" using POD, so here's a short summary:\n\nConverts all\ |
|
\ sequences in the input FASTA file to the same length.\n Sequences shorter\ |
|
\ than the target length are dropped, and sequences longer\n than the target\ |
|
\ length are split into overlapping subsequences covering\n the entire range.\ |
|
\ This prepares the sequences for use in an\n overlap-consensus assembler\ |
|
\ requiring constant-length sequences (such as\n edena).\n\nAnd here's the\ |
|
\ syntax:\n$ ./normalise_seqlengths.pl -h\nUsage:\n ./normalise_seqlengths.pl\ |
|
\ <reads.fa> [options]\n\n Options:\n -help\n Only display this help\ |
|
\ message\n\n -fraglength\n Target fragment length (in base-pairs, default\ |
|
\ 2000)\n\n -overlap\n Minimum overlap length (in base-pairs, default\ |
|
\ 200)\n\n -short\n Keep short sequences (shorter than fraglength)" |
|
- 'Question: Without going into too much background, I just joined up with a lab |
|
as a bioinformatics intern while I''m completing my masters degree in the field. |
|
The lab has data from an RNA-seq they outsourced, but the only problem is that |
|
the only data they have is preprocessed from the company that did the sequencing: |
|
filtering the reads, aligning them, and putting the aligned reads through RSEM. |
|
I currently have output from RSEM for each of the four samples consisting of: |
|
gene id, transcript id(s), length, expected count, and FPKM. I am attempting to |
|
get the FASTQ files from the sequencing, but for now, this is what I have, and |
|
I''m trying to get something out of it if possible. |
|
|
|
I found this article that talks about how expected read counts can be better than |
|
raw read counts when analyzing differential expression using EBSeq; it''s just |
|
one guy''s opinion, and it''s from 2014, so it may be wrong or outdated, but I |
|
thought I''d give it a try since I have the expected counts. |
|
|
|
However, I have just a couple of questions about running EBSeq that I can''t find |
|
the answers to: |
|
|
|
1: In the output RSEM files I have, not all genes are represented in each, about |
|
80% of them are, but for the ones that aren''t, should I remove them before analysis |
|
with EBSeq? It runs when I do, but I''m not sure if it is correct. |
|
|
|
2: How do I know which normalization factor to use when running EBSeq? This is |
|
more of a conceptual question rather than a technical question. |
|
|
|
Thanks! |
|
|
|
|
|
Answer: Yes, that blog post does represent just one guy''s opinion (hi!) and it |
|
does date all the way back to 2014, which is, like, decades in genomics years. |
|
:-) By the way, there is quite a bit of literature discussing the improvements |
|
that expected read counts derived from an Expectation Maximization algorithm provide |
|
over raw read counts. I''d suggest reading the RSEM papers for a start[1][2]. |
|
|
|
But your main question is about the mechanics of running RSEM and EBSeq. First, |
|
RSEM was written explicitly to be compatible with EBSeq, so I''d be very surprised |
|
if it does not work correctly out-of-the-box. Second, EBSeq''s MedianNorm function |
|
worked very well in my experience for normalizing the library counts. Along those |
|
lines, the blog you mentioned above has another post that you may find useful. |
|
|
|
But all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools |
|
provide orders-of-magnitude improvements in runtime over the older alignment-based |
|
alternatives, with comparable accuracy. Sailfish was the first in a growing list |
|
of tools that now includes Salmon and Kallisto. When starting a new analysis from |
|
scratch (i.e. if you ever get the original FASTQ files), there''s really no good |
|
reason not to estimate expression using these much faster tools, followed by a |
|
differential expression analysis with DESeq2, edgeR, or sleuth. |
|
|
|
|
|
1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression |
|
estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692. |
|
|
|
2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data |
|
with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.' |
|
- 'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed |
|
file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out |
|
the coverage at every position: |
|
|
|
cat GRCh38.karyo.bed | awk ''{print $3}'' | datamash sum 1 |
|
|
|
3088286401 |
|
|
|
|
|
I would like to know how to run samtools depth so that it produces 3,088,286,401 |
|
entries when run against a GRCh38 bam file: |
|
|
|
samtools depth -b $bedfile -a $inputfile |
|
|
|
|
|
I tried it for a few bam files that were aligned the same way, and I get differing |
|
number of entries: |
|
|
|
3087003274 |
|
|
|
3087005666 |
|
|
|
3087007158 |
|
|
|
3087009435 |
|
|
|
3087009439 |
|
|
|
3087009621 |
|
|
|
3087009818 |
|
|
|
3087010065 |
|
|
|
3087010408 |
|
|
|
3087010477 |
|
|
|
3087010481 |
|
|
|
3087012115 |
|
|
|
3087013147 |
|
|
|
3087013186 |
|
|
|
3087013500 |
|
|
|
3087149616 |
|
|
|
|
|
Is there a special flag in samtools depth so that it reports all entries from |
|
the bed file? |
|
|
|
If samtools depth is not the best tool for this, what would be the equivalent |
|
with sambamba depth base? |
|
|
|
sambamba depth base --min-coverage=0 --regions $bedfile $inputfile |
|
|
|
|
|
Any other options? |
|
|
|
|
|
Answer: You might try using bedtools genomecov instead. If you provide the -d |
|
option, it reports the coverage at every position in the BAM file. |
|
|
|
bedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov" |
|
|
|
|
|
You can also provide a BED file if you just want to calculate in the target region.' |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
--- |
|
|
|
# SentenceTransformer based on BAAI/bge-small-en-v1.5 |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) <!-- at revision 5c38ec7c405ec4b44b94cc5a9bb96e735b38267a --> |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 384 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
<!-- - **Training Dataset:** Unknown --> |
|
<!-- - **Language:** Unknown --> |
|
<!-- - **License:** Unknown --> |
|
|
|
### Model Sources |
|
|
|
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net) |
|
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) |
|
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel |
|
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
(2): Normalize() |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("sentence_transformers_model_id") |
|
# Run inference |
|
sentences = [ |
|
'samtools depth print out all positions', |
|
'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out the coverage at every position:\ncat GRCh38.karyo.bed | awk \'{print $3}\' | datamash sum 1\n3088286401\n\nI would like to know how to run samtools depth so that it produces 3,088,286,401 entries when run against a GRCh38 bam file:\nsamtools depth -b $bedfile -a $inputfile\n\nI tried it for a few bam files that were aligned the same way, and I get differing number of entries:\n3087003274\n3087005666\n3087007158\n3087009435\n3087009439\n3087009621\n3087009818\n3087010065\n3087010408\n3087010477\n3087010481\n3087012115\n3087013147\n3087013186\n3087013500\n3087149616\n\nIs there a special flag in samtools depth so that it reports all entries from the bed file?\nIf samtools depth is not the best tool for this, what would be the equivalent with sambamba depth base?\nsambamba depth base --min-coverage=0 --regions $bedfile $inputfile\n\nAny other options?\n\nAnswer: You might try using bedtools genomecov instead. If you provide the -d option, it reports the coverage at every position in the BAM file.\nbedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"\n\nYou can also provide a BED file if you just want to calculate in the target region.', |
|
"Question: Without going into too much background, I just joined up with a lab as a bioinformatics intern while I'm completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they have is preprocessed from the company that did the sequencing: filtering the reads, aligning them, and putting the aligned reads through RSEM. I currently have output from RSEM for each of the four samples consisting of: gene id, transcript id(s), length, expected count, and FPKM. I am attempting to get the FASTQ files from the sequencing, but for now, this is what I have, and I'm trying to get something out of it if possible.\nI found this article that talks about how expected read counts can be better than raw read counts when analyzing differential expression using EBSeq; it's just one guy's opinion, and it's from 2014, so it may be wrong or outdated, but I thought I'd give it a try since I have the expected counts.\nHowever, I have just a couple of questions about running EBSeq that I can't find the answers to:\n1: In the output RSEM files I have, not all genes are represented in each, about 80% of them are, but for the ones that aren't, should I remove them before analysis with EBSeq? It runs when I do, but I'm not sure if it is correct.\n2: How do I know which normalization factor to use when running EBSeq? This is more of a conceptual question rather than a technical question.\nThanks!\n\nAnswer: Yes, that blog post does represent just one guy's opinion (hi!) and it does date all the way back to 2014, which is, like, decades in genomics years. :-) By the way, there is quite a bit of literature discussing the improvements that expected read counts derived from an Expectation Maximization algorithm provide over raw read counts. I'd suggest reading the RSEM papers for a start[1][2].\nBut your main question is about the mechanics of running RSEM and EBSeq. First, RSEM was written explicitly to be compatible with EBSeq, so I'd be very surprised if it does not work correctly out-of-the-box. Second, EBSeq's MedianNorm function worked very well in my experience for normalizing the library counts. Along those lines, the blog you mentioned above has another post that you may find useful.\nBut all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools provide orders-of-magnitude improvements in runtime over the older alignment-based alternatives, with comparable accuracy. Sailfish was the first in a growing list of tools that now includes Salmon and Kallisto. When starting a new analysis from scratch (i.e. if you ever get the original FASTQ files), there's really no good reason not to estimate expression using these much faster tools, followed by a differential expression analysis with DESeq2, edgeR, or sleuth.\n\n1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692.\n2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.", |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 384] |
|
|
|
# Get the similarity scores for the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) |
|
# [3, 3] |
|
``` |
|
|
|
<!-- |
|
### Direct Usage (Transformers) |
|
|
|
<details><summary>Click to see the direct usage in Transformers</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Downstream Usage (Sentence Transformers) |
|
|
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
<!-- |
|
## Bias, Risks and Limitations |
|
|
|
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.* |
|
--> |
|
|
|
<!-- |
|
### Recommendations |
|
|
|
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.* |
|
--> |
|
|
|
## Training Details |
|
|
|
### Training Dataset |
|
|
|
#### Unnamed Dataset |
|
|
|
* Size: 96 training samples |
|
* Columns: <code>sentence_0</code> and <code>sentence_1</code> |
|
* Approximate statistics based on the first 96 samples: |
|
| | sentence_0 | sentence_1 | |
|
|:--------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------| |
|
| type | string | string | |
|
| details | <ul><li>min: 6 tokens</li><li>mean: 14.93 tokens</li><li>max: 34 tokens</li></ul> | <ul><li>min: 103 tokens</li><li>mean: 397.92 tokens</li><li>max: 512 tokens</li></ul> | |
|
* Samples: |
|
| sentence_0 | sentence_1 | |
|
|:-------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
|
| <code>Using shells other than bash</code> | <code>Question: As someone who's beginning to delve into bioinformatics, I'm noticing that like biology there are industry standards here, similar to Illumina in genomics and bowtie for alignment, many people use bash as shell. <br>Is using a shell besides bash going to cause issues for me?<br><br>Answer: Bioinformatics tools written in shell and other shell scripts generally specify the shell they want to use (via #!/bin/sh or e.g. #!/bin/bash if it matters), so won't be affected by your choice of user shell.<br>If you are writing significant shell scripts yourself, there are reasons to do it in a Bourne-style shell. See Csh Programming Considered Harmful and other essays/polemics.<br>A Bourne-style shell is pretty much the industry standard, and if you choose a substantially different shell you'll have to translate some of the documentation of your bioinformatics tools. It's not uncommon to have things like<br><br>Set some variables pointing at reference data and add the script to your PATH to run it:<br>export...</code> | |
|
| <code>Linear models of complex diseases</code> | <code>Question: A popular framework to analyze differences between groups, either experiments or diseases, in transcriptomics is using linear models (limma is a popular choice). <br>For instance we have a disease D with three stages as defined by clinicians, A, B and C. 10 samples each stage and the healthy H to compare with is RNA-sequenced. A typical linear model would be to observe the three stages~A+B+C independently. The data of each stage is not from the same person. (but for the question assume it isn't)<br>My understanding is that such a model would not take into account that stage C appears only on 30% of patients in stage B. And that a healthy patient upon external factors can jump to stage B. <br>If we want to find the role of a gene in the disease we should include somehow this information in the model. Which makes me think about mixing linear models and hidden Markov chains.<br>How can such a disease be described in terms of linear models with such data and information?<br><br>Answer: There are t...</code> | |
|
| <code>Detecting portions of human proteins with high degree of microbial similarity</code> | <code>Question: I'm a newcomer to the world of bioinformatics, and in need of help solving a problem.<br>My goal is to take a list of human proteins, and identify segments (13-17aa in length) with a high degree of similarity to microbial sequences. Ideally, I would like to start with list of FASTA sequences, and have an easy way to generate an output of the corresponding high similarity segments of each protein.<br>Are there existing tools or software that I should be aware of that will make my life easier?<br>Thanks in advance.<br><br>Answer: Sounds like precisely the job BLAST was developed for. Now, which flavor will depend on what you want to do and what data you have available. Some options:<br><br>PSI-BLAST: this is usually the best choice if you are trying to find protein homologs. It works by building a hidden markov model describing your query sequence and using that model to query a database of proteins. The advantage is that it is run in multiple iterations, giving you the chance to add or remove resu...</code> | |
|
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: |
|
```json |
|
{ |
|
"scale": 20.0, |
|
"similarity_fct": "cos_sim" |
|
} |
|
``` |
|
|
|
### Training Hyperparameters |
|
#### Non-Default Hyperparameters |
|
|
|
- `per_device_train_batch_size`: 32 |
|
- `per_device_eval_batch_size`: 32 |
|
- `num_train_epochs`: 1 |
|
- `fp16`: True |
|
- `batch_sampler`: no_duplicates |
|
- `multi_dataset_batch_sampler`: round_robin |
|
|
|
#### All Hyperparameters |
|
<details><summary>Click to expand</summary> |
|
|
|
- `overwrite_output_dir`: False |
|
- `do_predict`: False |
|
- `eval_strategy`: no |
|
- `prediction_loss_only`: True |
|
- `per_device_train_batch_size`: 32 |
|
- `per_device_eval_batch_size`: 32 |
|
- `per_gpu_train_batch_size`: None |
|
- `per_gpu_eval_batch_size`: None |
|
- `gradient_accumulation_steps`: 1 |
|
- `eval_accumulation_steps`: None |
|
- `torch_empty_cache_steps`: None |
|
- `learning_rate`: 5e-05 |
|
- `weight_decay`: 0.0 |
|
- `adam_beta1`: 0.9 |
|
- `adam_beta2`: 0.999 |
|
- `adam_epsilon`: 1e-08 |
|
- `max_grad_norm`: 1 |
|
- `num_train_epochs`: 1 |
|
- `max_steps`: -1 |
|
- `lr_scheduler_type`: linear |
|
- `lr_scheduler_kwargs`: {} |
|
- `warmup_ratio`: 0.0 |
|
- `warmup_steps`: 0 |
|
- `log_level`: passive |
|
- `log_level_replica`: warning |
|
- `log_on_each_node`: True |
|
- `logging_nan_inf_filter`: True |
|
- `save_safetensors`: True |
|
- `save_on_each_node`: False |
|
- `save_only_model`: False |
|
- `restore_callback_states_from_checkpoint`: False |
|
- `no_cuda`: False |
|
- `use_cpu`: False |
|
- `use_mps_device`: False |
|
- `seed`: 42 |
|
- `data_seed`: None |
|
- `jit_mode_eval`: False |
|
- `use_ipex`: False |
|
- `bf16`: False |
|
- `fp16`: True |
|
- `fp16_opt_level`: O1 |
|
- `half_precision_backend`: auto |
|
- `bf16_full_eval`: False |
|
- `fp16_full_eval`: False |
|
- `tf32`: None |
|
- `local_rank`: 0 |
|
- `ddp_backend`: None |
|
- `tpu_num_cores`: None |
|
- `tpu_metrics_debug`: False |
|
- `debug`: [] |
|
- `dataloader_drop_last`: False |
|
- `dataloader_num_workers`: 0 |
|
- `dataloader_prefetch_factor`: None |
|
- `past_index`: -1 |
|
- `disable_tqdm`: False |
|
- `remove_unused_columns`: True |
|
- `label_names`: None |
|
- `load_best_model_at_end`: False |
|
- `ignore_data_skip`: False |
|
- `fsdp`: [] |
|
- `fsdp_min_num_params`: 0 |
|
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} |
|
- `tp_size`: 0 |
|
- `fsdp_transformer_layer_cls_to_wrap`: None |
|
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} |
|
- `deepspeed`: None |
|
- `label_smoothing_factor`: 0.0 |
|
- `optim`: adamw_torch |
|
- `optim_args`: None |
|
- `adafactor`: False |
|
- `group_by_length`: False |
|
- `length_column_name`: length |
|
- `ddp_find_unused_parameters`: None |
|
- `ddp_bucket_cap_mb`: None |
|
- `ddp_broadcast_buffers`: False |
|
- `dataloader_pin_memory`: True |
|
- `dataloader_persistent_workers`: False |
|
- `skip_memory_metrics`: True |
|
- `use_legacy_prediction_loop`: False |
|
- `push_to_hub`: False |
|
- `resume_from_checkpoint`: None |
|
- `hub_model_id`: None |
|
- `hub_strategy`: every_save |
|
- `hub_private_repo`: None |
|
- `hub_always_push`: False |
|
- `gradient_checkpointing`: False |
|
- `gradient_checkpointing_kwargs`: None |
|
- `include_inputs_for_metrics`: False |
|
- `include_for_metrics`: [] |
|
- `eval_do_concat_batches`: True |
|
- `fp16_backend`: auto |
|
- `push_to_hub_model_id`: None |
|
- `push_to_hub_organization`: None |
|
- `mp_parameters`: |
|
- `auto_find_batch_size`: False |
|
- `full_determinism`: False |
|
- `torchdynamo`: None |
|
- `ray_scope`: last |
|
- `ddp_timeout`: 1800 |
|
- `torch_compile`: False |
|
- `torch_compile_backend`: None |
|
- `torch_compile_mode`: None |
|
- `include_tokens_per_second`: False |
|
- `include_num_input_tokens_seen`: False |
|
- `neftune_noise_alpha`: None |
|
- `optim_target_modules`: None |
|
- `batch_eval_metrics`: False |
|
- `eval_on_start`: False |
|
- `use_liger_kernel`: False |
|
- `eval_use_gather_object`: False |
|
- `average_tokens_across_devices`: False |
|
- `prompts`: None |
|
- `batch_sampler`: no_duplicates |
|
- `multi_dataset_batch_sampler`: round_robin |
|
|
|
</details> |
|
|
|
### Framework Versions |
|
- Python: 3.12.8 |
|
- Sentence Transformers: 3.4.1 |
|
- Transformers: 4.51.3 |
|
- PyTorch: 2.5.1+cu124 |
|
- Accelerate: 1.7.0 |
|
- Datasets: 3.2.0 |
|
- Tokenizers: 0.21.0 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
#### MultipleNegativesRankingLoss |
|
```bibtex |
|
@misc{henderson2017efficient, |
|
title={Efficient Natural Language Response Suggestion for Smart Reply}, |
|
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
|
year={2017}, |
|
eprint={1705.00652}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |