raphassaraf's picture
Training in progress, step 1000
09385a6 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:96
  - loss:MultipleNegativesRankingLoss
base_model: BAAI/bge-small-en-v1.5
widget:
  - source_sentence: What are the de facto required fields in a SAM/BAM read group?
    sentences:
      - >-
        Question: Several gene set enrichment methods are available, the most
        famous/popular is the Broad Institute tool. Many other tools are
        available (See for example the biocView of GSE which list 82 different
        packages). There are several parameters in consideration :


        the statistic used to order the genes, 

        if it competitive or self-contained,

        if it is supervised or not,

        and how is the enrichment score calculated.


        I am using the fgsea - Fast Gene Set Enrichment Analysis package to
        calculate the enrichment scores and someone told me that the numbers are
        different from the ones on the Broad Institute despite all the other
        parameters being equivalent.

        Are these two methods (fgsea and Broad Institute GSEA) equivalent to
        calculate the enrichment score?

        I looked to the algorithms of both papers, and they seem fairly similar,
        but I don't know if in real datasets they are equivalent or not.

        Is there any article reviewing and comparing how does the enrichment
        score method affect to the result?


        Answer: According to the FGSEA preprint:


        We ran reference GSEA with default parameters. The permutation number
          was set to 1000, which means that for each input gene set 1000
          independent samples were generated. The run took 100 seconds and
          resulted in 79 gene sets with GSEA-adjusted FDR q-value of less than
          10−2. All significant gene sets were in a positive mode. First, to get
          a similar nominal p-values accuracy we ran FGSEA algorithm on 1000
          permutations. This took 2 seconds, but resulted in no significant hits
          due after multiple testing correction (with FRD  1%).

        Thus, FGSEA and GSEA are not identical.

        And again in the conclusion:


        Consequently, gene sets can be ranked more precisely in the results
          and, which is even more important, standard multiple testing
          correction methods can be applied instead of approximate ones as in
          [GSEA].

        The author argues that FGSEA is more accurate, so it can't be
        equivalent.

        If you are interested specifically in the enrichment score, that was
        addressed by the author in the preprint comments:


        Values of enrichment scores and normalized enrichment scores are the
          same for both broad version and fgsea.

        So that part seems to be the same.
      - >-
        Question: I am running samtools mpileup (v1.4) on a bam file with very
        choppy coverage (ChIP-seq style data). I want to get a first-pass list
        of positions with SNVs and their frequency as reported by the read
        counts, but no matter what I do, I keep getting all SNVs filtered out as
        not passing QC.

        What's the magic parameter set for an initial list of SNVs and
        frequencies?

        EDIT: this is a question I posted on "the other" website, but didn't get
        a reply there.


        Answer: I used this in the past for ChIP-seq data and it generated SNVs:

        samtools mpileup \

        --uncompressed --max-depth 10000 --min-MQ 20 --ignore-RG --skip-indels \

        --fasta-ref ref.fa file.bam \

        | bcftools call --consensus-caller \

        > out.vcf


        This was samtools 1.3 in case that makes a difference.
      - >-
        Question: The SAM specification indicates that each read group must have
        a unique ID field, but does not mark any other field as required. 

        I have also discovered that htsjdk throws exceptions if the sample (SM)
        field is empty, though there is no indication in the specification that
        this is required. 

        Are there other read group fields that I should expect to be required by
        common tools? 


        Answer: The sample tag (i.e. SM) was a mandatory tag in the initial SAM
        spec (see the .pages file; you need a mac to open it). When transitioned
        to Latex, this requirement was mysteriously dropped. Picard is
        conforming to the initial spec. Anyway, the sample tag is important to
        quite a few tools. I would encourage you to add it.
  - source_sentence: Is the optional SAM NM field strictly computable from the MD and CIGAR?
    sentences:
      - >-
        Question: I'm looking for tools to check the quality of a VCF I have of
        a human genome. I would like to check the VCF against publicly known
        variants across other human genomes, e.g. how many SNPs are already in
        public databases, whether insertions/deletions are at known positions,
        insertion/deletion length distribution, other SNVs/SVs, etc.? I suspect
        that there are resources from previous projects to check for known SNPs
        and InDels by human subpopulations.

        What resources exist for this, and how do I do it? 


        Answer: To achieve (at least some of) your goals, I would recommend the
        Variant Effect Predictor (VEP). It is a flexible tool that provides
        several types of annotations on an input .vcf file.  I agree that ExAC
        is the de facto gold standard catalog for human genetic variation in
        coding regions.  To see the frequency distribution of variants by global
        subpopulation make sure "ExAC allele frequencies" is checked in addition
        to the 1000 genomes. 

        Output in the web-browser:


        If you download the annotated .vcf, frequencies will be in the INFO
        field:

        ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations
        from Ensembl VEP. Format:
        Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|TSL|SIFT|PolyPhen|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|ExAC_AF|ExAC_Adj_AF|ExAC_AFR_AF|ExAC_AMR_AF|ExAC_EAS_AF|ExAC_FIN_AF|ExAC_NFE_AF|ExAC_OTH_AF|ExAC_SAS_AF|CLIN_SIG|SOMATIC|PHENO|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE


        The previously mentioned Annovar can also annotate with ExAC allele
        frequencies.  Finally, should mention the newest whole-genome resource,
        gnomAD.
      - >-
        Question: I produced a bam file by aligning reads to a small set of
        synthetic sequences using bwa-mem.

        I am heavily filtering reads that are not paired and of a certain
        orientation.

        Applying the filtering, I get a few thousands of reads:

        samtools view -h $myfilebam | \

        samtools view -h -F4 - | \

        samtools view -h -F8 - | \

        samtools view -h -F256 - | \

        samtools view -h -F512 - | \

        samtools view -h -F1024 - | \

        samtools view -h -F2048 - | \

        samtools view -h -f16 - | \

        samtools view -h -f32 -  | wc -l


        Gives me 89502 reads.

        If I then pipe this into samtools mpileup, I get no results:

        samtools view -h $myfilebam | \

        samtools view -h -F4 - | \

        samtools view -h -F8 - | \

        samtools view -h -F256 - | \

        samtools view -h -F512 - | \

        samtools view -h -F1024 - | \

        samtools view -h -F2048 - | \

        samtools view -h -f16 - | \

        samtools view -h -f32 -  | \

        samtools mpileup --excl-flags 0 -Q0 -B -d 999999 - | wc -l


        Returns 0.

        I tried different combinations of filtering, and when I do both -f 16
        and -f 32 returns empty, but if I do either of those, then it works:

        samtools view -h $myfilebam | \

        samtools view -h -F4 - | \

        samtools view -h -F8 - | \

        samtools view -h -F256 - | \

        samtools view -h -F512 - | \

        samtools view -h -F1024 - | \

        samtools view -h -F2048 - | \

        samtools view -h -f16 - | \

        samtools mpileup --excl-flags 0 -Q0 -B -d 999999 - | wc -l


        Returns 1056.

        Any ideas why? My thinking was that it would work with --excl-flags 0.

        EDIT: substituting mpileup for depth does work, and prints out each
        position and the depth as expected.

        EDIT2: adding -q 0 to mpileup gives the same empty result.

        Thanks in advance


        Answer: By using -h in the samtools view command, you're including all
        the header lines in your word count. If you happen to have about 89500
        reference sequences, then the lengths of those would all appear in the
        header and inflate the -h word count, but not the mpileup count. Try
        piping it through an additional samtools view (i.e. without -h) and see
        if the counts change:

        ...

        samtools view -h -f32 -  | \

        samtools view | wc -l


        Also, samtools mpileup by default only considers high-quality bases and
        concordant reads. Try adding a -A to your mpileup line (which stops
        anomalous read pairs from being discarded):

        ...

        samtools mpileup -A -Q0 -B -d 999999 - | wc -l


        Whether or not this is actually a good idea will be dependent on what
        you want to get out of the analysis, and what the downstream programs /
        analyses are expecting.
      - >-
        Question: From SAM Optional Fields Specification the NM field is 


        Edit distance to the reference, including ambiguous bases but excluding
        clipping


        Assuming both the MD and CIGAR are present, is the edit distance simply
        the number of characters [A-Z] appearing in the MD field plus the number
        of bases inserted (xI, if any) from the CIGAR string? Are there any
        other complications? 


        Answer: Assuming both the MD and CIGAR are present and correct, then
        yes, you can parse both to get the edit distance (NM auxiliary tag). One
        big caveat to this is that there's a reason that the samtools calmd
        command exists, since it's historically been the case that not all
        aligners have output correct MD strings. It's rare for the CIGAR string
        to be wrong and that'd be more of a catastrophic error on the part of an
        aligner. For what it's worth, if the NM auxiliary is absent on a given
        alignment but present on others produced by the same aligner then it's
        fair to assume NM:i:0 for a given alignment by default (many aligners
        only produce NM:i:XXX if the edit distance is at least 1).
  - source_sentence: How to read structural variant VCF?
    sentences:
      - >-
        Question: I am calling SNPs from WGS samples produced at my lab. I am
        currently using bwa-mem for mapping Illumina reads as it is recommended
        by GATK best practice. However, bwa is a bit slow. I heard from my
        colleague that SNAP is much faster than bwa. I tried it on a small set
        of reads and it is indeed faster. However, I am not sure how it works
        with downstream SNP callers, so here are my questions: have you used
        SNAP for short-read mapping? What is your experience? Does SNAP work
        well with SNP callers like GATK and freebayes? Thanks!


        Answer: GATK best practices are explicably meant to consume BWA MEM
        generated BAM.  Whilst SNAP may be faster, the Broad will not have
        tested it for compatibility with GATK as such you can't guaranty using
        it won't have unexpected consequences.  

        As such you'd be better off using BWA MEM because I assume accurately
        called variation is always better than fast and incorrectly called
        variation.  The main issue you'll have is ensuring shorter split hits
        and mapping quality are reported in the same way as bwa MEM -M which
        GATK/Picard is expecting.  Ultimately however you'd be better off
        posting this question on the GATK forum. 

        It's also worth noting that the soon to be released GATK 4 will utilise
        bwaspark which can distribute it's alignment processes across Apache
        Spark for increase performance.  Consequently I can't see SNAP being
        adopted anytime soon.
      - >-
        Question: I have a computer engineering background, not biology.

        I started working on a bioinformatics project recently, which involves
        de-novo assembly. I came to know the terms Transcriptome and Genome, but
        I cannot identify the difference between these two.

        I know a transcriptome is the set of all messenger RNA molecules in a
        cell, but am not sure how this is different from a genome.


        Answer: In brief, the  “genome”  is the collection of all  DNA  present 
        in  the  nucleus  and  the  mitochondria of a  somatic  cell. The
        initial product of genome expression is the “transcriptome”, a
        collection of RNA molecules derived from those genes.
      - >-
        Question: The IGSR has a sample for encoding structural variants in the
        VCF 4.0 format.

        An example from the site (the first record):

        #CHROM  POS   ID  REF ALT   QUAL  FILTER  INFO  FORMAT  NA00001

        1 2827693   .
        CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA 
        C . PASS 
        SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66
        GT:GQ 1/1:13.9


        How to read it? From what I can see:


        This is a deletion (SVTYPE=DEL)

        The end position of the variant comes before the starting position
        (reverse strand?)

        The reference starts from 2827693 to 2827680 (13 bases on the reverse
        strand)

        The difference between reference and alternative is 66 bases (SVLEN=-66)


        This doesn't sound right to me. For instance, I don't see where exactly
        the deletion starts. The SVLEN field says 66 bases deleted, but where?
        2827693 to 2827680 only has 13 bases between.

        Q: How to read the deletion correctly from this structural VCF record?
        Where is the missing 66-13=53 bases?


        Answer: I just received a reply from 1000Genomes regarding this. I'll
        post it in its entirety below:


        Looking at the example you mention, I find it difficult to come up with
        an
          interpretation of the information whereby the stated end seems to be correct,
          so believe that this may indeed be an error.
        Since the v4.0 was created, however, new versions of VCF have been
        introduced,
          improving and correcting the specification. The current version is v4.3
          (http://samtools.github.io/hts-specs/). I believe the first record shown on
          page 11 provides an accurate example of this type of deletion.
        I will update the web page to include this information.


        So we can take this as official confirmation that we were all correct in
        suspecting the example was just wrong.
  - source_sentence: Publicly available genome sequence database for viruses?
    sentences:
      - >-
        Question: This question is based on a question on BioStars  posted >2
        years ago by user jack.

        It describes a very frequent problem of generating GO annotations for
        non-model organisms. While it is based on some specific format and
        single application (Ontologizer), it would be useful to have a general
        description of the pathway to getting to a GAF file. 

        Note, that the input format is lacking a bit of essential information,
        like how it was obtained. Therefore, it is har to assign evidence code.
        Therefore, lets assume that the assignments of GO terms were done
        automagically. 


        I want to do the Gene enrichment using Ontologizer without a
          predefined association file(it's not model organism). 
        I have parsed a file with two columns for that organism like this : 

        geneA  GO:0006950,GO:0005737

        geneB 
        GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031

        geneC 
        GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863

        geneD 
        GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528


        I have downloaded the .ob file from Gene ontology file which contain
          this information (from here) : 
        !

        ! GO IDs (primary only) and name text strings

        ! GO:0000000 [tab] text string [tab] F|P|C

        ! where F = molecular function, P = biological process, C = cellular
        component

        !

        GO:0000001  mitochondrion inheritance   P

        GO:0000002  mitochondrial genome maintenance    P

        GO:0000003  reproduction    P

        GO:0000005  ribosomal chaperone activity    F

        GO:0000006  high affinity zinc uptake transmembrane transporter
        activity    F

        GO:0000007  low-affinity zinc ion transmembrane transporter activity   
        F

        GO:0000008  thioredoxin F

        GO:0000009  alpha-1,6-mannosyltransferase activity  F

        GO:0000010  trans-hexaprenyltranstransferase activity   F

        GO:0000011  vacuole inheritance P


        What I need as output is .gaf file in the following format (in the
          format of the files here):
        !gaf-version: 2.0


        !Project_name: Leishmania major GeneDB


        !URL: http://www.genedb.org/leish


        !Contact Email: mb4@sanger.ac.uk

         GeneDB_Lmajor    LmjF.36.4770    LmjF.36.4770        GO:0003723    PMID:22396527    ISO    GeneDB:Tb927.10.10130    F    mitochondrial RNA binding complex 1 subunit, putative    LmjF36.4770    gene    taxon:347515    20120910    GeneDB_Lmajor       
         GeneDB_Lmajor    LmjF.36.4770    LmjF.36.4770        GO:0044429    PMID:20660476    ISS        C    mitochondrial RNA binding complex 1 subunit, putative    LmjF36.4770    gene    taxon:347515    20100803 GeneDB_Lmajor             GeneDB_Lmajor    LmjF.36.4770    LmjF.36.4770        GO:0016554    PMID:22396527    ISO    GeneDB:Tb927.10.10130    P    mitochondrial RNA binding complex 1 subunit, putative    LmjF36.4770    gene   taxon:347515    20120910    GeneDB_Lmajor       
         GeneDB_Lmajor    LmjF.36.4770    LmjF.36.4770        GO:0048255    PMID:22396527    ISO    GeneDB:Tb927.10.10130    P    mitochondrial RNA binding complex 1 subunit, putative    LmjF36.4770    gene    taxon:347515    20120910    GeneDB_Lmajor  

        How to create your own GO association file (gaf)?


        Answer: Here's a Perl script that can do this:

        #!/usr/bin/env perl 

        use strict;

        use warnings;


        ## Change this to whatever taxon you are working with

        my $taxon = 'taxon:1000';

        chomp(my $date = `date +%Y%M%d`);


        my (%aspect, %gos);

        ## Read the GO.terms_and_ids file to get the aspect (sub ontology)

        ## of each GO term. 

        open(my $fh, $ARGV[0]) or die "Need a GO.terms_and_ids file as 1st arg:
        $!\n";

        while (<$fh>) {
            next if /^!/;
            chomp;
            my @fields = split(/\t/);
            ## $aspect{GO:0000001} = 'P'
            $aspect{$fields[0]} = $fields[2];
        }

        close($fh);


        ## Read the list of gene annotations

        open($fh, $ARGV[1]) or die "Need a list of gene annotattions as 2nd arg:
        $!\n";

        while (<$fh>) {
            chomp;
            my ($gene, @terms) = split(/[\s,]+/);
            ## $gos{geneA} = (go1, go2 ... goN)
            $gos{$gene} = [ @terms ];
        }

        close($fh);


        foreach my $gene (keys(%gos)) {
            foreach my $term (@{$gos{$gene}}) {
                ## Warn and skip if there is no aspect for this term
                if (!$aspect{$term}) {
                    print STDERR "Unknown GO term ($term) for gene $gene\n";
                    next;
                }
                ## Build a pseudo GAF line 
                my @out = ('DB', $gene, $gene, ' ', $term, 'PMID:foo', 'TAS', ' ', $aspect{$term},
                                     $gene, ' ', 'protein', $taxon, $date, 'DB', ' ', ' ');
                print join("\t", @out). "\n";
            }
        }


        Make it executable and run it with the GO.terms_and_ids file as the 1st
        argument and the list of gene annotations as the second. Using the
        current GO.terms_and_ids and the example annotations in the question, I
        get:

        $ foo.pl GO.terms_and_ids file.gos 

        DB  geneD   geneD       GO:0005634  PMID:foo    TAS     C   geneD      
        protein taxon:1000  20170308    DB       

        DB  geneD   geneD       GO:0003677  PMID:foo    TAS     F   geneD      
        protein taxon:1000  20170308    DB       

        DB  geneD   geneD       GO:0030154  PMID:foo    TAS     P   geneD      
        protein taxon:1000  20170308    DB       

        Unknown GO term (GO:0006350) for gene geneD

        DB  geneD   geneD       GO:0006355  PMID:foo    TAS     P   geneD      
        protein taxon:1000  20170308    DB       

        DB  geneD   geneD       GO:0007275  PMID:foo    TAS     P   geneD      
        protein taxon:1000  20170308    DB       

        DB  geneD   geneD       GO:0030528  PMID:foo    TAS     F   geneD      
        protein taxon:1000  20170308    DB       

        DB  geneB   geneB       GO:0016020  PMID:foo    TAS     C   geneB      
        protein taxon:1000  20170308    DB       

        DB  geneB   geneB       GO:0005524  PMID:foo    TAS     F   geneB      
        protein taxon:1000  20170308    DB       

        DB  geneB   geneB       GO:0006468  PMID:foo    TAS     P   geneB      
        protein taxon:1000  20170308    DB       

        DB  geneB   geneB       GO:0005737  PMID:foo    TAS     C   geneB      
        protein taxon:1000  20170308    DB       

        DB  geneB   geneB       GO:0004674  PMID:foo    TAS     F   geneB      
        protein taxon:1000  20170308    DB       

        DB  geneB   geneB       GO:0006914  PMID:foo    TAS     P   geneB      
        protein taxon:1000  20170308    DB       

        DB  geneB   geneB       GO:0016021  PMID:foo    TAS     C   geneB      
        protein taxon:1000  20170308    DB       

        DB  geneB   geneB       GO:0015031  PMID:foo    TAS     P   geneB      
        protein taxon:1000  20170308    DB       

        DB  geneA   geneA       GO:0006950  PMID:foo    TAS     P   geneA      
        protein taxon:1000  20170308    DB       

        DB  geneA   geneA       GO:0005737  PMID:foo    TAS     C   geneA      
        protein taxon:1000  20170308    DB       

        DB  geneC   geneC       GO:0003779  PMID:foo    TAS     F   geneC      
        protein taxon:1000  20170308    DB       

        DB  geneC   geneC       GO:0006941  PMID:foo    TAS     P   geneC      
        protein taxon:1000  20170308    DB       

        DB  geneC   geneC       GO:0005524  PMID:foo    TAS     F   geneC      
        protein taxon:1000  20170308    DB       

        DB  geneC   geneC       GO:0003774  PMID:foo    TAS     F   geneC      
        protein taxon:1000  20170308    DB       

        DB  geneC   geneC       GO:0005516  PMID:foo    TAS     F   geneC      
        protein taxon:1000  20170308    DB       

        DB  geneC   geneC       GO:0005737  PMID:foo    TAS     C   geneC      
        protein taxon:1000  20170308    DB       

        DB  geneC   geneC       GO:0005863  PMID:foo    TAS     C   geneC      
        protein taxon:1000  20170308    DB       


        Note that this is very much a pseudo-GAF file since most of the fields
        apart from the gene name, GO term and sub-ontology are fake. It should
        still work for what you need, however.
      - >-
        Question: As a small introductory project, I want to compare genome
        sequences of  different strains of influenza virus.

        What are the publicly available databases of influenza virus gene/genome
        sequences?


        Answer: There area few different influenza virus database resources:


        The Influenza Research Database (IRD) (a.k.a FluDB - based upon URL)


        A NIAID Bioinformatics Resource Center or BRC which highly curates the
        data brought in and integrates it with numerous other relevant data
        types


        The NCBI Influenza Virus Resource


        A sub-project of the NCBI with data curated over and above the GenBank
        data that is part of the NCBI


        The GISAID EpiFlu Database


        A database of sequences from the Global Initiative on Sharing All
        Influenza Data. Has unique data from many countries but requires user
        agree to a data sharing policy.


        The OpenFluDB


        Former GISAID database that contains some sequence data that GenBank
        does not have.


        For those who also may be interested in other virus databases, there
        are:


        Virus Pathogen Resource (VIPR)


        A companion portal to the IRD, which hosts curated and integrated data
        for most other NIAID A-C virus pathogens including (but not limited to)
        Ebola, Zika, Dengue, Enterovirus, and Hepatitis C


        LANL HIV database


        Los Alamos National Laboratory HIV database with HIV data and many
        useful tools for all virus bioinformatics


        PaVE: Papilloma virus genome database (from quintik comment)


        NIAID developed and maintained Papilloma virus bioinformatics portal


        Disclaimer: I used to work for the IRD / VIPR and currently work for
        NIAID.
      - >-
        Question: I have a set of genomic ranges that are potentially
        overlapping. I want to count the amount of ranges at certain positions
        using R. 

        I'm Pretty sure there are good solutions, but I seem to be unable to
        find them. 

        Solutions like cut or findIntervals don't achieve what I want as they
        only count on one vector or accumulate by all values <= break.

        Also countMatches {GenomicRanges} doesn't seem to cover it.

        Probably one could use Bedtools, but I don't want to leave R.

        I could only come up with a hilariously slow solution

        # generate test data

        testdata <- data.frame(chrom = rep(seq(1,10),10),
                               starts = abs(rnorm(100, mean = 1, sd = 1)) * 1000,
                               ends = abs(rnorm(100, mean = 2, sd = 1)) * 2000)

        # make sure that all end coordinates are bigger than start

        # this is a requirement of the original data

        testdata <- testdata[testdata$ends - testdata$starts > 0,]


        # count overlapping ranges on certain positions

        count.data <- lapply(unique(testdata$chrom), function(chromosome){
            tmp.inner <- lapply(seq(1,10000, by = 120), function(i){
                sum(testdata$chrom == chromosome & testdata$starts <= i & testdata$ends >= i)
            })
            return(unlist(tmp.inner))
        })


        # generate a data.frame containing all data

        df.count.data <- ldply(count.data, rbind)


        # ideally the chromosome will be columns and not rows

        t(df.count.data)


        Answer: GenomicRanges::countOverlaps seems to be what you’re after:

        position_range = GRanges(position$chrom, IRanges(position, position,
        width = 1))

        ranges_at_position = countOverlaps(position_ranges, granges)
  - source_sentence: samtools depth print out all positions
    sentences:
      - >-
        Question: I have around ~3,000 short sequences of approximately ~10Kb
        long. What are the best ways to find the motifs among all of these
        sequences? Is there a certain software/method recommended?

        There are several ways to do this. My goal would be to:

        (1) Check for motifs repeated within individual sequences

        (2) Check for motifs shared among all sequences

        (3) Check for the presence of "expected" or known motifs

        With respect to #3, I'm also curious if I find e.g. trinucleotide
        sequences, how does one check the context around these regions?

        Thank you for the recommendations/help!


        Answer: For (3), this page has a lot of links to pattern/motif finding
        tools. Following through the YMF link on that page, I came across the
        University of Washington Motif Discovery section. Of these projection
        seemed to be the only downloadable tool. I find it interesting how old
        all these tools are; maybe the introduction of microarrays and NGS has
        made them all redundant.

        Your sub-problem (2) seems similar to the problem I'm having with
        Nippostrongylus brasiliensis genome sequences, where I'd like to find
        regions of very high homology (length 500bp to 20kb or more, 95-99%
        similar) that are repeated throughout the genome. These sequences are
        killing the assembly.

        The main way I can find these regions is by looking at a coverage plot
        of long nanopore reads mapped to the assembled genome (using GraphMap or
        BWA). Any regions with substantially higher than median coverage are
        likely to be shared repeats.

        I've played around in the past with chopping up the reads to smaller
        sizes, which works better for hitting smaller repeated regions that are
        such a small proportion of most reads that they are never mapped to all
        the repeated locations. I wrote my own script a while back to chop up
        reads (for a different purpose), which produces a FASTA/FASTQ file where
        all reads are exactly the same length. For some unknown reason I took
        the time to document that script "properly" using POD, so here's a short
        summary:


        Converts all sequences in the input FASTA file to the same length.
             Sequences shorter than the target length are dropped, and sequences longer
             than the target length are split into overlapping subsequences covering
             the entire range. This prepares the sequences for use in an
             overlap-consensus assembler requiring constant-length sequences (such as
             edena).

        And here's the syntax:

        $ ./normalise_seqlengths.pl -h

        Usage:
            ./normalise_seqlengths.pl <reads.fa> [options]

          Options:
            -help
              Only display this help message

            -fraglength
              Target fragment length (in base-pairs, default 2000)

            -overlap
              Minimum overlap length (in base-pairs, default 200)

            -short
              Keep short sequences (shorter than fraglength)
      - >-
        Question: Without going into too much background, I just joined up with
        a lab as a bioinformatics intern while I'm completing my masters degree
        in the field. The lab has data from an RNA-seq they outsourced, but the
        only problem is that the only data they have is preprocessed from the
        company that did the sequencing: filtering the reads, aligning them, and
        putting the aligned reads through RSEM. I currently have output from
        RSEM for each of the four samples consisting of: gene id, transcript
        id(s), length, expected count, and FPKM. I am attempting to get the
        FASTQ files from the sequencing, but for now, this is what I have, and
        I'm trying to get something out of it if possible.

        I found this article that talks about how expected read counts can be
        better than raw read counts when analyzing differential expression using
        EBSeq; it's just one guy's opinion, and it's from 2014, so it may be
        wrong or outdated, but I thought I'd give it a try since I have the
        expected counts.

        However, I have just a couple of questions about running EBSeq that I
        can't find the answers to:

        1: In the output RSEM files I have, not all genes are represented in
        each, about 80% of them are, but for the ones that aren't, should I
        remove them before analysis with EBSeq? It runs when I do, but I'm not
        sure if it is correct.

        2: How do I know which normalization factor to use when running EBSeq?
        This is more of a conceptual question rather than a technical question.

        Thanks!


        Answer: Yes, that blog post does represent just one guy's opinion (hi!)
        and it does date all the way back to 2014, which is, like, decades in
        genomics years. :-) By the way, there is quite a bit of literature
        discussing the improvements that expected read counts derived from an
        Expectation Maximization algorithm provide over raw read counts. I'd
        suggest reading the RSEM papers for a start[1][2].

        But your main question is about the mechanics of running RSEM and EBSeq.
        First, RSEM was written explicitly to be compatible with EBSeq, so I'd
        be very surprised if it does not work correctly out-of-the-box. Second,
        EBSeq's MedianNorm function worked very well in my experience for
        normalizing the library counts. Along those lines, the blog you
        mentioned above has another post that you may find useful.

        But all joking aside, these tools are indeed dated. Alignment-free
        RNA-Seq tools provide orders-of-magnitude improvements in runtime over
        the older alignment-based alternatives, with comparable accuracy.
        Sailfish was the first in a growing list of tools that now includes
        Salmon and Kallisto. When starting a new analysis from scratch (i.e. if
        you ever get the original FASTQ files), there's really no good reason
        not to estimate expression using these much faster tools, followed by a
        differential expression analysis with DESeq2, edgeR, or sleuth.


        1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene
        expression estimation with read mapping uncertainty. Bioinformatics,
        26(4):493–500, doi:10.1093/bioinformatics/btp692.

        2Li B, Dewey C (2011) RSEM: accurate transcript quantification from
        RNA-Seq data with or without a reference genome. BMC Bioinformatics,
        12:323, doi:10.1186/1471-2105-12-323.
      - >-
        Question: I am trying to use samtools depth (v1.4) with the -a option
        and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and
        chrM to print out the coverage at every position:

        cat GRCh38.karyo.bed | awk '{print $3}' | datamash sum 1

        3088286401


        I would like to know how to run samtools depth so that it produces
        3,088,286,401 entries when run against a GRCh38 bam file:

        samtools depth -b $bedfile -a $inputfile


        I tried it for a few bam files that were aligned the same way, and I get
        differing number of entries:

        3087003274

        3087005666

        3087007158

        3087009435

        3087009439

        3087009621

        3087009818

        3087010065

        3087010408

        3087010477

        3087010481

        3087012115

        3087013147

        3087013186

        3087013500

        3087149616


        Is there a special flag in samtools depth so that it reports all entries
        from the bed file?

        If samtools depth is not the best tool for this, what would be the
        equivalent with sambamba depth base?

        sambamba depth base --min-coverage=0 --regions $bedfile $inputfile


        Any other options?


        Answer: You might try using bedtools genomecov instead. If you provide
        the -d option, it reports the coverage at every position in the BAM
        file.

        bedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"


        You can also provide a BED file if you just want to calculate in the
        target  region.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on BAAI/bge-small-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-small-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'samtools depth print out all positions',
    'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out the coverage at every position:\ncat GRCh38.karyo.bed | awk \'{print $3}\' | datamash sum 1\n3088286401\n\nI would like to know how to run samtools depth so that it produces 3,088,286,401 entries when run against a GRCh38 bam file:\nsamtools depth -b $bedfile -a $inputfile\n\nI tried it for a few bam files that were aligned the same way, and I get differing number of entries:\n3087003274\n3087005666\n3087007158\n3087009435\n3087009439\n3087009621\n3087009818\n3087010065\n3087010408\n3087010477\n3087010481\n3087012115\n3087013147\n3087013186\n3087013500\n3087149616\n\nIs there a special flag in samtools depth so that it reports all entries from the bed file?\nIf samtools depth is not the best tool for this, what would be the equivalent with sambamba depth base?\nsambamba depth base --min-coverage=0 --regions $bedfile $inputfile\n\nAny other options?\n\nAnswer: You might try using bedtools genomecov instead. If you provide the -d option, it reports the coverage at every position in the BAM file.\nbedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"\n\nYou can also provide a BED file if you just want to calculate in the target  region.',
    "Question: Without going into too much background, I just joined up with a lab as a bioinformatics intern while I'm completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they have is preprocessed from the company that did the sequencing: filtering the reads, aligning them, and putting the aligned reads through RSEM. I currently have output from RSEM for each of the four samples consisting of: gene id, transcript id(s), length, expected count, and FPKM. I am attempting to get the FASTQ files from the sequencing, but for now, this is what I have, and I'm trying to get something out of it if possible.\nI found this article that talks about how expected read counts can be better than raw read counts when analyzing differential expression using EBSeq; it's just one guy's opinion, and it's from 2014, so it may be wrong or outdated, but I thought I'd give it a try since I have the expected counts.\nHowever, I have just a couple of questions about running EBSeq that I can't find the answers to:\n1: In the output RSEM files I have, not all genes are represented in each, about 80% of them are, but for the ones that aren't, should I remove them before analysis with EBSeq? It runs when I do, but I'm not sure if it is correct.\n2: How do I know which normalization factor to use when running EBSeq? This is more of a conceptual question rather than a technical question.\nThanks!\n\nAnswer: Yes, that blog post does represent just one guy's opinion (hi!) and it does date all the way back to 2014, which is, like, decades in genomics years. :-) By the way, there is quite a bit of literature discussing the improvements that expected read counts derived from an Expectation Maximization algorithm provide over raw read counts. I'd suggest reading the RSEM papers for a start[1][2].\nBut your main question is about the mechanics of running RSEM and EBSeq. First, RSEM was written explicitly to be compatible with EBSeq, so I'd be very surprised if it does not work correctly out-of-the-box. Second, EBSeq's MedianNorm function worked very well in my experience for normalizing the library counts. Along those lines, the blog you mentioned above has another post that you may find useful.\nBut all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools provide orders-of-magnitude improvements in runtime over the older alignment-based alternatives, with comparable accuracy. Sailfish was the first in a growing list of tools that now includes Salmon and Kallisto. When starting a new analysis from scratch (i.e. if you ever get the original FASTQ files), there's really no good reason not to estimate expression using these much faster tools, followed by a differential expression analysis with DESeq2, edgeR, or sleuth.\n\n1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692.\n2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 96 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 96 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 6 tokens
    • mean: 14.93 tokens
    • max: 34 tokens
    • min: 103 tokens
    • mean: 397.92 tokens
    • max: 512 tokens
  • Samples:
    sentence_0 sentence_1
    Using shells other than bash Question: As someone who's beginning to delve into bioinformatics, I'm noticing that like biology there are industry standards here, similar to Illumina in genomics and bowtie for alignment, many people use bash as shell.
    Is using a shell besides bash going to cause issues for me?

    Answer: Bioinformatics tools written in shell and other shell scripts generally specify the shell they want to use (via #!/bin/sh or e.g. #!/bin/bash if it matters), so won't be affected by your choice of user shell.
    If you are writing significant shell scripts yourself, there are reasons to do it in a Bourne-style shell. See Csh Programming Considered Harmful and other essays/polemics.
    A Bourne-style shell is pretty much the industry standard, and if you choose a substantially different shell you'll have to translate some of the documentation of your bioinformatics tools. It's not uncommon to have things like

    Set some variables pointing at reference data and add the script to your PATH to run it:
    export...
    Linear models of complex diseases Question: A popular framework to analyze differences between groups, either experiments or diseases, in transcriptomics is using linear models (limma is a popular choice).
    For instance we have a disease D with three stages as defined by clinicians, A, B and C. 10 samples each stage and the healthy H to compare with is RNA-sequenced. A typical linear model would be to observe the three stages~A+B+C independently. The data of each stage is not from the same person. (but for the question assume it isn't)
    My understanding is that such a model would not take into account that stage C appears only on 30% of patients in stage B. And that a healthy patient upon external factors can jump to stage B.
    If we want to find the role of a gene in the disease we should include somehow this information in the model. Which makes me think about mixing linear models and hidden Markov chains.
    How can such a disease be described in terms of linear models with such data and information?

    Answer: There are t...
    Detecting portions of human proteins with high degree of microbial similarity Question: I'm a newcomer to the world of bioinformatics, and in need of help solving a problem.
    My goal is to take a list of human proteins, and identify segments (13-17aa in length) with a high degree of similarity to microbial sequences. Ideally, I would like to start with list of FASTA sequences, and have an easy way to generate an output of the corresponding high similarity segments of each protein.
    Are there existing tools or software that I should be aware of that will make my life easier?
    Thanks in advance.

    Answer: Sounds like precisely the job BLAST was developed for. Now, which flavor will depend on what you want to do and what data you have available. Some options:

    PSI-BLAST: this is usually the best choice if you are trying to find protein homologs. It works by building a hidden markov model describing your query sequence and using that model to query a database of proteins. The advantage is that it is run in multiple iterations, giving you the chance to add or remove resu...
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • num_train_epochs: 1
  • fp16: True
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: round_robin

Framework Versions

  • Python: 3.12.8
  • Sentence Transformers: 3.4.1
  • Transformers: 4.51.3
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.7.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}