--- tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:96 - loss:MultipleNegativesRankingLoss base_model: BAAI/bge-small-en-v1.5 widget: - source_sentence: What are the de facto required fields in a SAM/BAM read group? sentences: - "Question: Several gene set enrichment methods are available, the most famous/popular\ \ is the Broad Institute tool. Many other tools are available (See for example\ \ the biocView of GSE which list 82 different packages). There are several parameters\ \ in consideration :\n\nthe statistic used to order the genes, \nif it competitive\ \ or self-contained,\nif it is supervised or not,\nand how is the enrichment score\ \ calculated.\n\nI am using the fgsea - Fast Gene Set Enrichment Analysis package\ \ to calculate the enrichment scores and someone told me that the numbers are\ \ different from the ones on the Broad Institute despite all the other parameters\ \ being equivalent.\nAre these two methods (fgsea and Broad Institute GSEA) equivalent\ \ to calculate the enrichment score?\nI looked to the algorithms of both papers,\ \ and they seem fairly similar, but I don't know if in real datasets they are\ \ equivalent or not.\nIs there any article reviewing and comparing how does the\ \ enrichment score method affect to the result?\n\nAnswer: According to the FGSEA\ \ preprint:\n\nWe ran reference GSEA with default parameters. The permutation\ \ number\n was set to 1000, which means that for each input gene set 1000\n \ \ independent samples were generated. The run took 100 seconds and\n resulted\ \ in 79 gene sets with GSEA-adjusted FDR q-value of less than\n 10−2. All significant\ \ gene sets were in a positive mode. First, to get\n a similar nominal p-values\ \ accuracy we ran FGSEA algorithm on 1000\n permutations. This took 2 seconds,\ \ but resulted in no significant hits\n due after multiple testing correction\ \ (with FRD ≤ 1%).\n\nThus, FGSEA and GSEA are not identical.\nAnd again in the\ \ conclusion:\n\nConsequently, gene sets can be ranked more precisely in the results\n\ \ and, which is even more important, standard multiple testing\n correction\ \ methods can be applied instead of approximate ones as in\n [GSEA].\n\nThe author\ \ argues that FGSEA is more accurate, so it can't be equivalent.\nIf you are interested\ \ specifically in the enrichment score, that was addressed by the author in the\ \ preprint comments:\n\nValues of enrichment scores and normalized enrichment\ \ scores are the\n same for both broad version and fgsea.\n\nSo that part seems\ \ to be the same." - 'Question: I am running samtools mpileup (v1.4) on a bam file with very choppy coverage (ChIP-seq style data). I want to get a first-pass list of positions with SNVs and their frequency as reported by the read counts, but no matter what I do, I keep getting all SNVs filtered out as not passing QC. What''s the magic parameter set for an initial list of SNVs and frequencies? EDIT: this is a question I posted on "the other" website, but didn''t get a reply there. Answer: I used this in the past for ChIP-seq data and it generated SNVs: samtools mpileup \ --uncompressed --max-depth 10000 --min-MQ 20 --ignore-RG --skip-indels \ --fasta-ref ref.fa file.bam \ | bcftools call --consensus-caller \ > out.vcf This was samtools 1.3 in case that makes a difference.' - "Question: The SAM specification indicates that each read group must have a unique\ \ ID field, but does not mark any other field as required. \nI have also discovered\ \ that htsjdk throws exceptions if the sample (SM) field is empty, though there\ \ is no indication in the specification that this is required. \nAre there other\ \ read group fields that I should expect to be required by common tools? \n\n\ Answer: The sample tag (i.e. SM) was a mandatory tag in the initial SAM spec (see\ \ the .pages file; you need a mac to open it). When transitioned to Latex, this\ \ requirement was mysteriously dropped. Picard is conforming to the initial spec.\ \ Anyway, the sample tag is important to quite a few tools. I would encourage\ \ you to add it." - source_sentence: Is the optional SAM NM field strictly computable from the MD and CIGAR? sentences: - "Question: I'm looking for tools to check the quality of a VCF I have of a human\ \ genome. I would like to check the VCF against publicly known variants across\ \ other human genomes, e.g. how many SNPs are already in public databases, whether\ \ insertions/deletions are at known positions, insertion/deletion length distribution,\ \ other SNVs/SVs, etc.? I suspect that there are resources from previous projects\ \ to check for known SNPs and InDels by human subpopulations.\nWhat resources\ \ exist for this, and how do I do it? \n\nAnswer: To achieve (at least some of)\ \ your goals, I would recommend the Variant Effect Predictor (VEP). It is a flexible\ \ tool that provides several types of annotations on an input .vcf file. I agree\ \ that ExAC is the de facto gold standard catalog for human genetic variation\ \ in coding regions. To see the frequency distribution of variants by global\ \ subpopulation make sure \"ExAC allele frequencies\" is checked in addition to\ \ the 1000 genomes. \nOutput in the web-browser:\n\nIf you download the annotated\ \ .vcf, frequencies will be in the INFO field:\n##INFO=2 years ago\ \ by user jack.\nIt describes a very frequent problem of generating GO annotations\ \ for non-model organisms. While it is based on some specific format and single\ \ application (Ontologizer), it would be useful to have a general description\ \ of the pathway to getting to a GAF file. \nNote, that the input format is lacking\ \ a bit of essential information, like how it was obtained. Therefore, it is har\ \ to assign evidence code. Therefore, lets assume that the assignments of GO terms\ \ were done automagically. \n\nI want to do the Gene enrichment using Ontologizer\ \ without a\n predefined association file(it's not model organism). \nI have\ \ parsed a file with two columns for that organism like this : \ngeneA GO:0006950,GO:0005737\n\ geneB GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031\n\ geneC GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863\n\ geneD GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528\n\ \nI have downloaded the .ob file from Gene ontology file which contain\n this\ \ information (from here) : \n!\n! GO IDs (primary only) and name text strings\n\ ! GO:0000000 [tab] text string [tab] F|P|C\n! where F = molecular function, P\ \ = biological process, C = cellular component\n!\nGO:0000001 mitochondrion inheritance\ \ P\nGO:0000002 mitochondrial genome maintenance P\nGO:0000003 reproduction\ \ P\nGO:0000005 ribosomal chaperone activity F\nGO:0000006 high affinity\ \ zinc uptake transmembrane transporter activity F\nGO:0000007 low-affinity\ \ zinc ion transmembrane transporter activity F\nGO:0000008 thioredoxin F\n\ GO:0000009 alpha-1,6-mannosyltransferase activity F\nGO:0000010 trans-hexaprenyltranstransferase\ \ activity F\nGO:0000011 vacuole inheritance P\n\nWhat I need as output is\ \ .gaf file in the following format (in the\n format of the files here):\n!gaf-version:\ \ 2.0\n\n!Project_name: Leishmania major GeneDB\n\n!URL: http://www.genedb.org/leish\n\ \n!Contact Email: mb4@sanger.ac.uk\n\n GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770\ \ GO:0003723 PMID:22396527 ISO GeneDB:Tb927.10.10130 F mitochondrial\ \ RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515\ \ 20120910 GeneDB_Lmajor \n GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770\ \ GO:0044429 PMID:20660476 ISS C mitochondrial RNA binding\ \ complex 1 subunit, putative LmjF36.4770 gene taxon:347515 20100803\ \ GeneDB_Lmajor GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770 \ \ GO:0016554 PMID:22396527 ISO GeneDB:Tb927.10.10130 P mitochondrial\ \ RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515\ \ 20120910 GeneDB_Lmajor \n GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770\ \ GO:0048255 PMID:22396527 ISO GeneDB:Tb927.10.10130 P mitochondrial\ \ RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515\ \ 20120910 GeneDB_Lmajor \n\nHow to create your own GO association file\ \ (gaf)?\n\nAnswer: Here's a Perl script that can do this:\n#!/usr/bin/env perl\ \ \nuse strict;\nuse warnings;\n\n## Change this to whatever taxon you are working\ \ with\nmy $taxon = 'taxon:1000';\nchomp(my $date = `date +%Y%M%d`);\n\nmy (%aspect,\ \ %gos);\n## Read the GO.terms_and_ids file to get the aspect (sub ontology)\n\ ## of each GO term. \nopen(my $fh, $ARGV[0]) or die \"Need a GO.terms_and_ids\ \ file as 1st arg: $!\\n\";\nwhile (<$fh>) {\n next if /^!/;\n chomp;\n\ \ my @fields = split(/\\t/);\n ## $aspect{GO:0000001} = 'P'\n $aspect{$fields[0]}\ \ = $fields[2];\n}\nclose($fh);\n\n## Read the list of gene annotations\nopen($fh,\ \ $ARGV[1]) or die \"Need a list of gene annotattions as 2nd arg: $!\\n\";\nwhile\ \ (<$fh>) {\n chomp;\n my ($gene, @terms) = split(/[\\s,]+/);\n ## $gos{geneA}\ \ = (go1, go2 ... goN)\n $gos{$gene} = [ @terms ];\n}\nclose($fh);\n\nforeach\ \ my $gene (keys(%gos)) {\n foreach my $term (@{$gos{$gene}}) {\n ##\ \ Warn and skip if there is no aspect for this term\n if (!$aspect{$term})\ \ {\n print STDERR \"Unknown GO term ($term) for gene $gene\\n\";\n\ \ next;\n }\n ## Build a pseudo GAF line \n my\ \ @out = ('DB', $gene, $gene, ' ', $term, 'PMID:foo', 'TAS', ' ', $aspect{$term},\n\ \ $gene, ' ', 'protein', $taxon, $date, 'DB', ' ',\ \ ' ');\n print join(\"\\t\", @out). \"\\n\";\n }\n}\n\nMake it executable\ \ and run it with the GO.terms_and_ids file as the 1st argument and the list of\ \ gene annotations as the second. Using the current GO.terms_and_ids and the example\ \ annotations in the question, I get:\n$ foo.pl GO.terms_and_ids file.gos \nDB\ \ geneD geneD GO:0005634 PMID:foo TAS C geneD protein\ \ taxon:1000 20170308 DB \nDB geneD geneD GO:0003677 PMID:foo\ \ TAS F geneD protein taxon:1000 20170308 DB \nDB geneD\ \ geneD GO:0030154 PMID:foo TAS P geneD protein taxon:1000\ \ 20170308 DB \nUnknown GO term (GO:0006350) for gene geneD\nDB geneD\ \ geneD GO:0006355 PMID:foo TAS P geneD protein taxon:1000\ \ 20170308 DB \nDB geneD geneD GO:0007275 PMID:foo TAS\ \ P geneD protein taxon:1000 20170308 DB \nDB geneD geneD\ \ GO:0030528 PMID:foo TAS F geneD protein taxon:1000 20170308\ \ DB \nDB geneB geneB GO:0016020 PMID:foo TAS C geneB\ \ protein taxon:1000 20170308 DB \nDB geneB geneB GO:0005524\ \ PMID:foo TAS F geneB protein taxon:1000 20170308 DB \ \ \nDB geneB geneB GO:0006468 PMID:foo TAS P geneB \ \ protein taxon:1000 20170308 DB \nDB geneB geneB GO:0005737\ \ PMID:foo TAS C geneB protein taxon:1000 20170308 DB \ \ \nDB geneB geneB GO:0004674 PMID:foo TAS F geneB \ \ protein taxon:1000 20170308 DB \nDB geneB geneB GO:0006914\ \ PMID:foo TAS P geneB protein taxon:1000 20170308 DB \ \ \nDB geneB geneB GO:0016021 PMID:foo TAS C geneB \ \ protein taxon:1000 20170308 DB \nDB geneB geneB GO:0015031\ \ PMID:foo TAS P geneB protein taxon:1000 20170308 DB \ \ \nDB geneA geneA GO:0006950 PMID:foo TAS P geneA \ \ protein taxon:1000 20170308 DB \nDB geneA geneA GO:0005737\ \ PMID:foo TAS C geneA protein taxon:1000 20170308 DB \ \ \nDB geneC geneC GO:0003779 PMID:foo TAS F geneC \ \ protein taxon:1000 20170308 DB \nDB geneC geneC GO:0006941\ \ PMID:foo TAS P geneC protein taxon:1000 20170308 DB \ \ \nDB geneC geneC GO:0005524 PMID:foo TAS F geneC \ \ protein taxon:1000 20170308 DB \nDB geneC geneC GO:0003774\ \ PMID:foo TAS F geneC protein taxon:1000 20170308 DB \ \ \nDB geneC geneC GO:0005516 PMID:foo TAS F geneC \ \ protein taxon:1000 20170308 DB \nDB geneC geneC GO:0005737\ \ PMID:foo TAS C geneC protein taxon:1000 20170308 DB \ \ \nDB geneC geneC GO:0005863 PMID:foo TAS C geneC \ \ protein taxon:1000 20170308 DB \n\nNote that this is very much a pseudo-GAF\ \ file since most of the fields apart from the gene name, GO term and sub-ontology\ \ are fake. It should still work for what you need, however." - 'Question: As a small introductory project, I want to compare genome sequences of different strains of influenza virus. What are the publicly available databases of influenza virus gene/genome sequences? Answer: There area few different influenza virus database resources: The Influenza Research Database (IRD) (a.k.a FluDB - based upon URL) A NIAID Bioinformatics Resource Center or BRC which highly curates the data brought in and integrates it with numerous other relevant data types The NCBI Influenza Virus Resource A sub-project of the NCBI with data curated over and above the GenBank data that is part of the NCBI The GISAID EpiFlu Database A database of sequences from the Global Initiative on Sharing All Influenza Data. Has unique data from many countries but requires user agree to a data sharing policy. The OpenFluDB Former GISAID database that contains some sequence data that GenBank does not have. For those who also may be interested in other virus databases, there are: Virus Pathogen Resource (VIPR) A companion portal to the IRD, which hosts curated and integrated data for most other NIAID A-C virus pathogens including (but not limited to) Ebola, Zika, Dengue, Enterovirus, and Hepatitis C LANL HIV database Los Alamos National Laboratory HIV database with HIV data and many useful tools for all virus bioinformatics PaVE: Papilloma virus genome database (from quintik comment) NIAID developed and maintained Papilloma virus bioinformatics portal Disclaimer: I used to work for the IRD / VIPR and currently work for NIAID.' - "Question: I have a set of genomic ranges that are potentially overlapping. I\ \ want to count the amount of ranges at certain positions using R. \nI'm Pretty\ \ sure there are good solutions, but I seem to be unable to find them. \nSolutions\ \ like cut or findIntervals don't achieve what I want as they only count on one\ \ vector or accumulate by all values <= break.\nAlso countMatches {GenomicRanges}\ \ doesn't seem to cover it.\nProbably one could use Bedtools, but I don't want\ \ to leave R.\nI could only come up with a hilariously slow solution\n# generate\ \ test data\ntestdata <- data.frame(chrom = rep(seq(1,10),10),\n \ \ starts = abs(rnorm(100, mean = 1, sd = 1)) * 1000,\n \ \ ends = abs(rnorm(100, mean = 2, sd = 1)) * 2000)\n\n# make sure that\ \ all end coordinates are bigger than start\n# this is a requirement of the original\ \ data\ntestdata <- testdata[testdata$ends - testdata$starts > 0,]\n\n# count\ \ overlapping ranges on certain positions\ncount.data <- lapply(unique(testdata$chrom),\ \ function(chromosome){\n tmp.inner <- lapply(seq(1,10000, by = 120), function(i){\n\ \ sum(testdata$chrom == chromosome & testdata$starts <= i & testdata$ends\ \ >= i)\n })\n return(unlist(tmp.inner))\n})\n\n# generate a data.frame\ \ containing all data\ndf.count.data <- ldply(count.data, rbind)\n\n# ideally\ \ the chromosome will be columns and not rows\nt(df.count.data)\n\nAnswer: GenomicRanges::countOverlaps\ \ seems to be what you’re after:\nposition_range = GRanges(position$chrom, IRanges(position,\ \ position, width = 1))\nranges_at_position = countOverlaps(position_ranges, granges)" - source_sentence: samtools depth print out all positions sentences: - "Question: I have around ~3,000 short sequences of approximately ~10Kb long. What\ \ are the best ways to find the motifs among all of these sequences? Is there\ \ a certain software/method recommended?\nThere are several ways to do this. My\ \ goal would be to:\n(1) Check for motifs repeated within individual sequences\n\ (2) Check for motifs shared among all sequences\n(3) Check for the presence of\ \ \"expected\" or known motifs\nWith respect to #3, I'm also curious if I find\ \ e.g. trinucleotide sequences, how does one check the context around these regions?\n\ Thank you for the recommendations/help!\n\nAnswer: For (3), this page has a lot\ \ of links to pattern/motif finding tools. Following through the YMF link on that\ \ page, I came across the University of Washington Motif Discovery section. Of\ \ these projection seemed to be the only downloadable tool. I find it interesting\ \ how old all these tools are; maybe the introduction of microarrays and NGS has\ \ made them all redundant.\nYour sub-problem (2) seems similar to the problem\ \ I'm having with Nippostrongylus brasiliensis genome sequences, where I'd like\ \ to find regions of very high homology (length 500bp to 20kb or more, 95-99%\ \ similar) that are repeated throughout the genome. These sequences are killing\ \ the assembly.\nThe main way I can find these regions is by looking at a coverage\ \ plot of long nanopore reads mapped to the assembled genome (using GraphMap or\ \ BWA). Any regions with substantially higher than median coverage are likely\ \ to be shared repeats.\nI've played around in the past with chopping up the reads\ \ to smaller sizes, which works better for hitting smaller repeated regions that\ \ are such a small proportion of most reads that they are never mapped to all\ \ the repeated locations. I wrote my own script a while back to chop up reads\ \ (for a different purpose), which produces a FASTA/FASTQ file where all reads\ \ are exactly the same length. For some unknown reason I took the time to document\ \ that script \"properly\" using POD, so here's a short summary:\n\nConverts all\ \ sequences in the input FASTA file to the same length.\n Sequences shorter\ \ than the target length are dropped, and sequences longer\n than the target\ \ length are split into overlapping subsequences covering\n the entire range.\ \ This prepares the sequences for use in an\n overlap-consensus assembler\ \ requiring constant-length sequences (such as\n edena).\n\nAnd here's the\ \ syntax:\n$ ./normalise_seqlengths.pl -h\nUsage:\n ./normalise_seqlengths.pl\ \ [options]\n\n Options:\n -help\n Only display this help\ \ message\n\n -fraglength\n Target fragment length (in base-pairs, default\ \ 2000)\n\n -overlap\n Minimum overlap length (in base-pairs, default\ \ 200)\n\n -short\n Keep short sequences (shorter than fraglength)" - 'Question: Without going into too much background, I just joined up with a lab as a bioinformatics intern while I''m completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they have is preprocessed from the company that did the sequencing: filtering the reads, aligning them, and putting the aligned reads through RSEM. I currently have output from RSEM for each of the four samples consisting of: gene id, transcript id(s), length, expected count, and FPKM. I am attempting to get the FASTQ files from the sequencing, but for now, this is what I have, and I''m trying to get something out of it if possible. I found this article that talks about how expected read counts can be better than raw read counts when analyzing differential expression using EBSeq; it''s just one guy''s opinion, and it''s from 2014, so it may be wrong or outdated, but I thought I''d give it a try since I have the expected counts. However, I have just a couple of questions about running EBSeq that I can''t find the answers to: 1: In the output RSEM files I have, not all genes are represented in each, about 80% of them are, but for the ones that aren''t, should I remove them before analysis with EBSeq? It runs when I do, but I''m not sure if it is correct. 2: How do I know which normalization factor to use when running EBSeq? This is more of a conceptual question rather than a technical question. Thanks! Answer: Yes, that blog post does represent just one guy''s opinion (hi!) and it does date all the way back to 2014, which is, like, decades in genomics years. :-) By the way, there is quite a bit of literature discussing the improvements that expected read counts derived from an Expectation Maximization algorithm provide over raw read counts. I''d suggest reading the RSEM papers for a start[1][2]. But your main question is about the mechanics of running RSEM and EBSeq. First, RSEM was written explicitly to be compatible with EBSeq, so I''d be very surprised if it does not work correctly out-of-the-box. Second, EBSeq''s MedianNorm function worked very well in my experience for normalizing the library counts. Along those lines, the blog you mentioned above has another post that you may find useful. But all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools provide orders-of-magnitude improvements in runtime over the older alignment-based alternatives, with comparable accuracy. Sailfish was the first in a growing list of tools that now includes Salmon and Kallisto. When starting a new analysis from scratch (i.e. if you ever get the original FASTQ files), there''s really no good reason not to estimate expression using these much faster tools, followed by a differential expression analysis with DESeq2, edgeR, or sleuth. 1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692. 2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.' - 'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out the coverage at every position: cat GRCh38.karyo.bed | awk ''{print $3}'' | datamash sum 1 3088286401 I would like to know how to run samtools depth so that it produces 3,088,286,401 entries when run against a GRCh38 bam file: samtools depth -b $bedfile -a $inputfile I tried it for a few bam files that were aligned the same way, and I get differing number of entries: 3087003274 3087005666 3087007158 3087009435 3087009439 3087009621 3087009818 3087010065 3087010408 3087010477 3087010481 3087012115 3087013147 3087013186 3087013500 3087149616 Is there a special flag in samtools depth so that it reports all entries from the bed file? If samtools depth is not the best tool for this, what would be the equivalent with sambamba depth base? sambamba depth base --min-coverage=0 --regions $bedfile $inputfile Any other options? Answer: You might try using bedtools genomecov instead. If you provide the -d option, it reports the coverage at every position in the BAM file. bedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov" You can also provide a BED file if you just want to calculate in the target region.' pipeline_tag: sentence-similarity library_name: sentence-transformers --- # SentenceTransformer based on BAAI/bge-small-en-v1.5 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 384 dimensions - **Similarity Function:** Cosine Similarity ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("sentence_transformers_model_id") # Run inference sentences = [ 'samtools depth print out all positions', 'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out the coverage at every position:\ncat GRCh38.karyo.bed | awk \'{print $3}\' | datamash sum 1\n3088286401\n\nI would like to know how to run samtools depth so that it produces 3,088,286,401 entries when run against a GRCh38 bam file:\nsamtools depth -b $bedfile -a $inputfile\n\nI tried it for a few bam files that were aligned the same way, and I get differing number of entries:\n3087003274\n3087005666\n3087007158\n3087009435\n3087009439\n3087009621\n3087009818\n3087010065\n3087010408\n3087010477\n3087010481\n3087012115\n3087013147\n3087013186\n3087013500\n3087149616\n\nIs there a special flag in samtools depth so that it reports all entries from the bed file?\nIf samtools depth is not the best tool for this, what would be the equivalent with sambamba depth base?\nsambamba depth base --min-coverage=0 --regions $bedfile $inputfile\n\nAny other options?\n\nAnswer: You might try using bedtools genomecov instead. If you provide the -d option, it reports the coverage at every position in the BAM file.\nbedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"\n\nYou can also provide a BED file if you just want to calculate in the target region.', "Question: Without going into too much background, I just joined up with a lab as a bioinformatics intern while I'm completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they have is preprocessed from the company that did the sequencing: filtering the reads, aligning them, and putting the aligned reads through RSEM. I currently have output from RSEM for each of the four samples consisting of: gene id, transcript id(s), length, expected count, and FPKM. I am attempting to get the FASTQ files from the sequencing, but for now, this is what I have, and I'm trying to get something out of it if possible.\nI found this article that talks about how expected read counts can be better than raw read counts when analyzing differential expression using EBSeq; it's just one guy's opinion, and it's from 2014, so it may be wrong or outdated, but I thought I'd give it a try since I have the expected counts.\nHowever, I have just a couple of questions about running EBSeq that I can't find the answers to:\n1: In the output RSEM files I have, not all genes are represented in each, about 80% of them are, but for the ones that aren't, should I remove them before analysis with EBSeq? It runs when I do, but I'm not sure if it is correct.\n2: How do I know which normalization factor to use when running EBSeq? This is more of a conceptual question rather than a technical question.\nThanks!\n\nAnswer: Yes, that blog post does represent just one guy's opinion (hi!) and it does date all the way back to 2014, which is, like, decades in genomics years. :-) By the way, there is quite a bit of literature discussing the improvements that expected read counts derived from an Expectation Maximization algorithm provide over raw read counts. I'd suggest reading the RSEM papers for a start[1][2].\nBut your main question is about the mechanics of running RSEM and EBSeq. First, RSEM was written explicitly to be compatible with EBSeq, so I'd be very surprised if it does not work correctly out-of-the-box. Second, EBSeq's MedianNorm function worked very well in my experience for normalizing the library counts. Along those lines, the blog you mentioned above has another post that you may find useful.\nBut all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools provide orders-of-magnitude improvements in runtime over the older alignment-based alternatives, with comparable accuracy. Sailfish was the first in a growing list of tools that now includes Salmon and Kallisto. When starting a new analysis from scratch (i.e. if you ever get the original FASTQ files), there's really no good reason not to estimate expression using these much faster tools, followed by a differential expression analysis with DESeq2, edgeR, or sleuth.\n\n1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692.\n2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.", ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 384] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Training Details ### Training Dataset #### Unnamed Dataset * Size: 96 training samples * Columns: sentence_0 and sentence_1 * Approximate statistics based on the first 96 samples: | | sentence_0 | sentence_1 | |:--------|:----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------| | type | string | string | | details | | | * Samples: | sentence_0 | sentence_1 | |:-------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Using shells other than bash | Question: As someone who's beginning to delve into bioinformatics, I'm noticing that like biology there are industry standards here, similar to Illumina in genomics and bowtie for alignment, many people use bash as shell.
Is using a shell besides bash going to cause issues for me?

Answer: Bioinformatics tools written in shell and other shell scripts generally specify the shell they want to use (via #!/bin/sh or e.g. #!/bin/bash if it matters), so won't be affected by your choice of user shell.
If you are writing significant shell scripts yourself, there are reasons to do it in a Bourne-style shell. See Csh Programming Considered Harmful and other essays/polemics.
A Bourne-style shell is pretty much the industry standard, and if you choose a substantially different shell you'll have to translate some of the documentation of your bioinformatics tools. It's not uncommon to have things like

Set some variables pointing at reference data and add the script to your PATH to run it:
export...
| | Linear models of complex diseases | Question: A popular framework to analyze differences between groups, either experiments or diseases, in transcriptomics is using linear models (limma is a popular choice).
For instance we have a disease D with three stages as defined by clinicians, A, B and C. 10 samples each stage and the healthy H to compare with is RNA-sequenced. A typical linear model would be to observe the three stages~A+B+C independently. The data of each stage is not from the same person. (but for the question assume it isn't)
My understanding is that such a model would not take into account that stage C appears only on 30% of patients in stage B. And that a healthy patient upon external factors can jump to stage B.
If we want to find the role of a gene in the disease we should include somehow this information in the model. Which makes me think about mixing linear models and hidden Markov chains.
How can such a disease be described in terms of linear models with such data and information?

Answer: There are t...
| | Detecting portions of human proteins with high degree of microbial similarity | Question: I'm a newcomer to the world of bioinformatics, and in need of help solving a problem.
My goal is to take a list of human proteins, and identify segments (13-17aa in length) with a high degree of similarity to microbial sequences. Ideally, I would like to start with list of FASTA sequences, and have an easy way to generate an output of the corresponding high similarity segments of each protein.
Are there existing tools or software that I should be aware of that will make my life easier?
Thanks in advance.

Answer: Sounds like precisely the job BLAST was developed for. Now, which flavor will depend on what you want to do and what data you have available. Some options:

PSI-BLAST: this is usually the best choice if you are trying to find protein homologs. It works by building a hidden markov model describing your query sequence and using that model to query a database of proteins. The advantage is that it is run in multiple iterations, giving you the chance to add or remove resu...
| * Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: ```json { "scale": 20.0, "similarity_fct": "cos_sim" } ``` ### Training Hyperparameters #### Non-Default Hyperparameters - `per_device_train_batch_size`: 32 - `per_device_eval_batch_size`: 32 - `num_train_epochs`: 1 - `fp16`: True - `batch_sampler`: no_duplicates - `multi_dataset_batch_sampler`: round_robin #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: no - `prediction_loss_only`: True - `per_device_train_batch_size`: 32 - `per_device_eval_batch_size`: 32 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 1 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 5e-05 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1 - `num_train_epochs`: 1 - `max_steps`: -1 - `lr_scheduler_type`: linear - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.0 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `use_ipex`: False - `bf16`: False - `fp16`: True - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 0 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `tp_size`: 0 - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: False - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `use_liger_kernel`: False - `eval_use_gather_object`: False - `average_tokens_across_devices`: False - `prompts`: None - `batch_sampler`: no_duplicates - `multi_dataset_batch_sampler`: round_robin
### Framework Versions - Python: 3.12.8 - Sentence Transformers: 3.4.1 - Transformers: 4.51.3 - PyTorch: 2.5.1+cu124 - Accelerate: 1.7.0 - Datasets: 3.2.0 - Tokenizers: 0.21.0 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```