metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:96
- loss:MultipleNegativesRankingLoss
base_model: BAAI/bge-small-en-v1.5
widget:
- source_sentence: What are the de facto required fields in a SAM/BAM read group?
sentences:
- >-
Question: Several gene set enrichment methods are available, the most
famous/popular is the Broad Institute tool. Many other tools are
available (See for example the biocView of GSE which list 82 different
packages). There are several parameters in consideration :
the statistic used to order the genes,
if it competitive or self-contained,
if it is supervised or not,
and how is the enrichment score calculated.
I am using the fgsea - Fast Gene Set Enrichment Analysis package to
calculate the enrichment scores and someone told me that the numbers are
different from the ones on the Broad Institute despite all the other
parameters being equivalent.
Are these two methods (fgsea and Broad Institute GSEA) equivalent to
calculate the enrichment score?
I looked to the algorithms of both papers, and they seem fairly similar,
but I don't know if in real datasets they are equivalent or not.
Is there any article reviewing and comparing how does the enrichment
score method affect to the result?
Answer: According to the FGSEA preprint:
We ran reference GSEA with default parameters. The permutation number
was set to 1000, which means that for each input gene set 1000
independent samples were generated. The run took 100 seconds and
resulted in 79 gene sets with GSEA-adjusted FDR q-value of less than
10−2. All significant gene sets were in a positive mode. First, to get
a similar nominal p-values accuracy we ran FGSEA algorithm on 1000
permutations. This took 2 seconds, but resulted in no significant hits
due after multiple testing correction (with FRD ≤ 1%).
Thus, FGSEA and GSEA are not identical.
And again in the conclusion:
Consequently, gene sets can be ranked more precisely in the results
and, which is even more important, standard multiple testing
correction methods can be applied instead of approximate ones as in
[GSEA].
The author argues that FGSEA is more accurate, so it can't be
equivalent.
If you are interested specifically in the enrichment score, that was
addressed by the author in the preprint comments:
Values of enrichment scores and normalized enrichment scores are the
same for both broad version and fgsea.
So that part seems to be the same.
- >-
Question: I am running samtools mpileup (v1.4) on a bam file with very
choppy coverage (ChIP-seq style data). I want to get a first-pass list
of positions with SNVs and their frequency as reported by the read
counts, but no matter what I do, I keep getting all SNVs filtered out as
not passing QC.
What's the magic parameter set for an initial list of SNVs and
frequencies?
EDIT: this is a question I posted on "the other" website, but didn't get
a reply there.
Answer: I used this in the past for ChIP-seq data and it generated SNVs:
samtools mpileup \
--uncompressed --max-depth 10000 --min-MQ 20 --ignore-RG --skip-indels \
--fasta-ref ref.fa file.bam \
| bcftools call --consensus-caller \
> out.vcf
This was samtools 1.3 in case that makes a difference.
- >-
Question: The SAM specification indicates that each read group must have
a unique ID field, but does not mark any other field as required.
I have also discovered that htsjdk throws exceptions if the sample (SM)
field is empty, though there is no indication in the specification that
this is required.
Are there other read group fields that I should expect to be required by
common tools?
Answer: The sample tag (i.e. SM) was a mandatory tag in the initial SAM
spec (see the .pages file; you need a mac to open it). When transitioned
to Latex, this requirement was mysteriously dropped. Picard is
conforming to the initial spec. Anyway, the sample tag is important to
quite a few tools. I would encourage you to add it.
- source_sentence: Is the optional SAM NM field strictly computable from the MD and CIGAR?
sentences:
- >-
Question: I'm looking for tools to check the quality of a VCF I have of
a human genome. I would like to check the VCF against publicly known
variants across other human genomes, e.g. how many SNPs are already in
public databases, whether insertions/deletions are at known positions,
insertion/deletion length distribution, other SNVs/SVs, etc.? I suspect
that there are resources from previous projects to check for known SNPs
and InDels by human subpopulations.
What resources exist for this, and how do I do it?
Answer: To achieve (at least some of) your goals, I would recommend the
Variant Effect Predictor (VEP). It is a flexible tool that provides
several types of annotations on an input .vcf file. I agree that ExAC
is the de facto gold standard catalog for human genetic variation in
coding regions. To see the frequency distribution of variants by global
subpopulation make sure "ExAC allele frequencies" is checked in addition
to the 1000 genomes.
Output in the web-browser:
If you download the annotated .vcf, frequencies will be in the INFO
field:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations
from Ensembl VEP. Format:
Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|TSL|SIFT|PolyPhen|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|AA_AF|EA_AF|ExAC_AF|ExAC_Adj_AF|ExAC_AFR_AF|ExAC_AMR_AF|ExAC_EAS_AF|ExAC_FIN_AF|ExAC_NFE_AF|ExAC_OTH_AF|ExAC_SAS_AF|CLIN_SIG|SOMATIC|PHENO|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE
The previously mentioned Annovar can also annotate with ExAC allele
frequencies. Finally, should mention the newest whole-genome resource,
gnomAD.
- >-
Question: I produced a bam file by aligning reads to a small set of
synthetic sequences using bwa-mem.
I am heavily filtering reads that are not paired and of a certain
orientation.
Applying the filtering, I get a few thousands of reads:
samtools view -h $myfilebam | \
samtools view -h -F4 - | \
samtools view -h -F8 - | \
samtools view -h -F256 - | \
samtools view -h -F512 - | \
samtools view -h -F1024 - | \
samtools view -h -F2048 - | \
samtools view -h -f16 - | \
samtools view -h -f32 - | wc -l
Gives me 89502 reads.
If I then pipe this into samtools mpileup, I get no results:
samtools view -h $myfilebam | \
samtools view -h -F4 - | \
samtools view -h -F8 - | \
samtools view -h -F256 - | \
samtools view -h -F512 - | \
samtools view -h -F1024 - | \
samtools view -h -F2048 - | \
samtools view -h -f16 - | \
samtools view -h -f32 - | \
samtools mpileup --excl-flags 0 -Q0 -B -d 999999 - | wc -l
Returns 0.
I tried different combinations of filtering, and when I do both -f 16
and -f 32 returns empty, but if I do either of those, then it works:
samtools view -h $myfilebam | \
samtools view -h -F4 - | \
samtools view -h -F8 - | \
samtools view -h -F256 - | \
samtools view -h -F512 - | \
samtools view -h -F1024 - | \
samtools view -h -F2048 - | \
samtools view -h -f16 - | \
samtools mpileup --excl-flags 0 -Q0 -B -d 999999 - | wc -l
Returns 1056.
Any ideas why? My thinking was that it would work with --excl-flags 0.
EDIT: substituting mpileup for depth does work, and prints out each
position and the depth as expected.
EDIT2: adding -q 0 to mpileup gives the same empty result.
Thanks in advance
Answer: By using -h in the samtools view command, you're including all
the header lines in your word count. If you happen to have about 89500
reference sequences, then the lengths of those would all appear in the
header and inflate the -h word count, but not the mpileup count. Try
piping it through an additional samtools view (i.e. without -h) and see
if the counts change:
...
samtools view -h -f32 - | \
samtools view | wc -l
Also, samtools mpileup by default only considers high-quality bases and
concordant reads. Try adding a -A to your mpileup line (which stops
anomalous read pairs from being discarded):
...
samtools mpileup -A -Q0 -B -d 999999 - | wc -l
Whether or not this is actually a good idea will be dependent on what
you want to get out of the analysis, and what the downstream programs /
analyses are expecting.
- >-
Question: From SAM Optional Fields Specification the NM field is
Edit distance to the reference, including ambiguous bases but excluding
clipping
Assuming both the MD and CIGAR are present, is the edit distance simply
the number of characters [A-Z] appearing in the MD field plus the number
of bases inserted (xI, if any) from the CIGAR string? Are there any
other complications?
Answer: Assuming both the MD and CIGAR are present and correct, then
yes, you can parse both to get the edit distance (NM auxiliary tag). One
big caveat to this is that there's a reason that the samtools calmd
command exists, since it's historically been the case that not all
aligners have output correct MD strings. It's rare for the CIGAR string
to be wrong and that'd be more of a catastrophic error on the part of an
aligner. For what it's worth, if the NM auxiliary is absent on a given
alignment but present on others produced by the same aligner then it's
fair to assume NM:i:0 for a given alignment by default (many aligners
only produce NM:i:XXX if the edit distance is at least 1).
- source_sentence: How to read structural variant VCF?
sentences:
- >-
Question: I am calling SNPs from WGS samples produced at my lab. I am
currently using bwa-mem for mapping Illumina reads as it is recommended
by GATK best practice. However, bwa is a bit slow. I heard from my
colleague that SNAP is much faster than bwa. I tried it on a small set
of reads and it is indeed faster. However, I am not sure how it works
with downstream SNP callers, so here are my questions: have you used
SNAP for short-read mapping? What is your experience? Does SNAP work
well with SNP callers like GATK and freebayes? Thanks!
Answer: GATK best practices are explicably meant to consume BWA MEM
generated BAM. Whilst SNAP may be faster, the Broad will not have
tested it for compatibility with GATK as such you can't guaranty using
it won't have unexpected consequences.
As such you'd be better off using BWA MEM because I assume accurately
called variation is always better than fast and incorrectly called
variation. The main issue you'll have is ensuring shorter split hits
and mapping quality are reported in the same way as bwa MEM -M which
GATK/Picard is expecting. Ultimately however you'd be better off
posting this question on the GATK forum.
It's also worth noting that the soon to be released GATK 4 will utilise
bwaspark which can distribute it's alignment processes across Apache
Spark for increase performance. Consequently I can't see SNAP being
adopted anytime soon.
- >-
Question: I have a computer engineering background, not biology.
I started working on a bioinformatics project recently, which involves
de-novo assembly. I came to know the terms Transcriptome and Genome, but
I cannot identify the difference between these two.
I know a transcriptome is the set of all messenger RNA molecules in a
cell, but am not sure how this is different from a genome.
Answer: In brief, the “genome” is the collection of all DNA present
in the nucleus and the mitochondria of a somatic cell. The
initial product of genome expression is the “transcriptome”, a
collection of RNA molecules derived from those genes.
- >-
Question: The IGSR has a sample for encoding structural variants in the
VCF 4.0 format.
An example from the site (the first record):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
1 2827693 .
CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA
C . PASS
SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66
GT:GQ 1/1:13.9
How to read it? From what I can see:
This is a deletion (SVTYPE=DEL)
The end position of the variant comes before the starting position
(reverse strand?)
The reference starts from 2827693 to 2827680 (13 bases on the reverse
strand)
The difference between reference and alternative is 66 bases (SVLEN=-66)
This doesn't sound right to me. For instance, I don't see where exactly
the deletion starts. The SVLEN field says 66 bases deleted, but where?
2827693 to 2827680 only has 13 bases between.
Q: How to read the deletion correctly from this structural VCF record?
Where is the missing 66-13=53 bases?
Answer: I just received a reply from 1000Genomes regarding this. I'll
post it in its entirety below:
Looking at the example you mention, I find it difficult to come up with
an
interpretation of the information whereby the stated end seems to be correct,
so believe that this may indeed be an error.
Since the v4.0 was created, however, new versions of VCF have been
introduced,
improving and correcting the specification. The current version is v4.3
(http://samtools.github.io/hts-specs/). I believe the first record shown on
page 11 provides an accurate example of this type of deletion.
I will update the web page to include this information.
So we can take this as official confirmation that we were all correct in
suspecting the example was just wrong.
- source_sentence: Publicly available genome sequence database for viruses?
sentences:
- >-
Question: This question is based on a question on BioStars posted >2
years ago by user jack.
It describes a very frequent problem of generating GO annotations for
non-model organisms. While it is based on some specific format and
single application (Ontologizer), it would be useful to have a general
description of the pathway to getting to a GAF file.
Note, that the input format is lacking a bit of essential information,
like how it was obtained. Therefore, it is har to assign evidence code.
Therefore, lets assume that the assignments of GO terms were done
automagically.
I want to do the Gene enrichment using Ontologizer without a
predefined association file(it's not model organism).
I have parsed a file with two columns for that organism like this :
geneA GO:0006950,GO:0005737
geneB
GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031
geneC
GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
geneD
GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528
I have downloaded the .ob file from Gene ontology file which contain
this information (from here) :
!
! GO IDs (primary only) and name text strings
! GO:0000000 [tab] text string [tab] F|P|C
! where F = molecular function, P = biological process, C = cellular
component
!
GO:0000001 mitochondrion inheritance P
GO:0000002 mitochondrial genome maintenance P
GO:0000003 reproduction P
GO:0000005 ribosomal chaperone activity F
GO:0000006 high affinity zinc uptake transmembrane transporter
activity F
GO:0000007 low-affinity zinc ion transmembrane transporter activity
F
GO:0000008 thioredoxin F
GO:0000009 alpha-1,6-mannosyltransferase activity F
GO:0000010 trans-hexaprenyltranstransferase activity F
GO:0000011 vacuole inheritance P
What I need as output is .gaf file in the following format (in the
format of the files here):
!gaf-version: 2.0
!Project_name: Leishmania major GeneDB
!URL: http://www.genedb.org/leish
!Contact Email: mb4@sanger.ac.uk
GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770 GO:0003723 PMID:22396527 ISO GeneDB:Tb927.10.10130 F mitochondrial RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515 20120910 GeneDB_Lmajor
GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770 GO:0044429 PMID:20660476 ISS C mitochondrial RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515 20100803 GeneDB_Lmajor GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770 GO:0016554 PMID:22396527 ISO GeneDB:Tb927.10.10130 P mitochondrial RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515 20120910 GeneDB_Lmajor
GeneDB_Lmajor LmjF.36.4770 LmjF.36.4770 GO:0048255 PMID:22396527 ISO GeneDB:Tb927.10.10130 P mitochondrial RNA binding complex 1 subunit, putative LmjF36.4770 gene taxon:347515 20120910 GeneDB_Lmajor
How to create your own GO association file (gaf)?
Answer: Here's a Perl script that can do this:
#!/usr/bin/env perl
use strict;
use warnings;
## Change this to whatever taxon you are working with
my $taxon = 'taxon:1000';
chomp(my $date = `date +%Y%M%d`);
my (%aspect, %gos);
## Read the GO.terms_and_ids file to get the aspect (sub ontology)
## of each GO term.
open(my $fh, $ARGV[0]) or die "Need a GO.terms_and_ids file as 1st arg:
$!\n";
while (<$fh>) {
next if /^!/;
chomp;
my @fields = split(/\t/);
## $aspect{GO:0000001} = 'P'
$aspect{$fields[0]} = $fields[2];
}
close($fh);
## Read the list of gene annotations
open($fh, $ARGV[1]) or die "Need a list of gene annotattions as 2nd arg:
$!\n";
while (<$fh>) {
chomp;
my ($gene, @terms) = split(/[\s,]+/);
## $gos{geneA} = (go1, go2 ... goN)
$gos{$gene} = [ @terms ];
}
close($fh);
foreach my $gene (keys(%gos)) {
foreach my $term (@{$gos{$gene}}) {
## Warn and skip if there is no aspect for this term
if (!$aspect{$term}) {
print STDERR "Unknown GO term ($term) for gene $gene\n";
next;
}
## Build a pseudo GAF line
my @out = ('DB', $gene, $gene, ' ', $term, 'PMID:foo', 'TAS', ' ', $aspect{$term},
$gene, ' ', 'protein', $taxon, $date, 'DB', ' ', ' ');
print join("\t", @out). "\n";
}
}
Make it executable and run it with the GO.terms_and_ids file as the 1st
argument and the list of gene annotations as the second. Using the
current GO.terms_and_ids and the example annotations in the question, I
get:
$ foo.pl GO.terms_and_ids file.gos
DB geneD geneD GO:0005634 PMID:foo TAS C geneD
protein taxon:1000 20170308 DB
DB geneD geneD GO:0003677 PMID:foo TAS F geneD
protein taxon:1000 20170308 DB
DB geneD geneD GO:0030154 PMID:foo TAS P geneD
protein taxon:1000 20170308 DB
Unknown GO term (GO:0006350) for gene geneD
DB geneD geneD GO:0006355 PMID:foo TAS P geneD
protein taxon:1000 20170308 DB
DB geneD geneD GO:0007275 PMID:foo TAS P geneD
protein taxon:1000 20170308 DB
DB geneD geneD GO:0030528 PMID:foo TAS F geneD
protein taxon:1000 20170308 DB
DB geneB geneB GO:0016020 PMID:foo TAS C geneB
protein taxon:1000 20170308 DB
DB geneB geneB GO:0005524 PMID:foo TAS F geneB
protein taxon:1000 20170308 DB
DB geneB geneB GO:0006468 PMID:foo TAS P geneB
protein taxon:1000 20170308 DB
DB geneB geneB GO:0005737 PMID:foo TAS C geneB
protein taxon:1000 20170308 DB
DB geneB geneB GO:0004674 PMID:foo TAS F geneB
protein taxon:1000 20170308 DB
DB geneB geneB GO:0006914 PMID:foo TAS P geneB
protein taxon:1000 20170308 DB
DB geneB geneB GO:0016021 PMID:foo TAS C geneB
protein taxon:1000 20170308 DB
DB geneB geneB GO:0015031 PMID:foo TAS P geneB
protein taxon:1000 20170308 DB
DB geneA geneA GO:0006950 PMID:foo TAS P geneA
protein taxon:1000 20170308 DB
DB geneA geneA GO:0005737 PMID:foo TAS C geneA
protein taxon:1000 20170308 DB
DB geneC geneC GO:0003779 PMID:foo TAS F geneC
protein taxon:1000 20170308 DB
DB geneC geneC GO:0006941 PMID:foo TAS P geneC
protein taxon:1000 20170308 DB
DB geneC geneC GO:0005524 PMID:foo TAS F geneC
protein taxon:1000 20170308 DB
DB geneC geneC GO:0003774 PMID:foo TAS F geneC
protein taxon:1000 20170308 DB
DB geneC geneC GO:0005516 PMID:foo TAS F geneC
protein taxon:1000 20170308 DB
DB geneC geneC GO:0005737 PMID:foo TAS C geneC
protein taxon:1000 20170308 DB
DB geneC geneC GO:0005863 PMID:foo TAS C geneC
protein taxon:1000 20170308 DB
Note that this is very much a pseudo-GAF file since most of the fields
apart from the gene name, GO term and sub-ontology are fake. It should
still work for what you need, however.
- >-
Question: As a small introductory project, I want to compare genome
sequences of different strains of influenza virus.
What are the publicly available databases of influenza virus gene/genome
sequences?
Answer: There area few different influenza virus database resources:
The Influenza Research Database (IRD) (a.k.a FluDB - based upon URL)
A NIAID Bioinformatics Resource Center or BRC which highly curates the
data brought in and integrates it with numerous other relevant data
types
The NCBI Influenza Virus Resource
A sub-project of the NCBI with data curated over and above the GenBank
data that is part of the NCBI
The GISAID EpiFlu Database
A database of sequences from the Global Initiative on Sharing All
Influenza Data. Has unique data from many countries but requires user
agree to a data sharing policy.
The OpenFluDB
Former GISAID database that contains some sequence data that GenBank
does not have.
For those who also may be interested in other virus databases, there
are:
Virus Pathogen Resource (VIPR)
A companion portal to the IRD, which hosts curated and integrated data
for most other NIAID A-C virus pathogens including (but not limited to)
Ebola, Zika, Dengue, Enterovirus, and Hepatitis C
LANL HIV database
Los Alamos National Laboratory HIV database with HIV data and many
useful tools for all virus bioinformatics
PaVE: Papilloma virus genome database (from quintik comment)
NIAID developed and maintained Papilloma virus bioinformatics portal
Disclaimer: I used to work for the IRD / VIPR and currently work for
NIAID.
- >-
Question: I have a set of genomic ranges that are potentially
overlapping. I want to count the amount of ranges at certain positions
using R.
I'm Pretty sure there are good solutions, but I seem to be unable to
find them.
Solutions like cut or findIntervals don't achieve what I want as they
only count on one vector or accumulate by all values <= break.
Also countMatches {GenomicRanges} doesn't seem to cover it.
Probably one could use Bedtools, but I don't want to leave R.
I could only come up with a hilariously slow solution
# generate test data
testdata <- data.frame(chrom = rep(seq(1,10),10),
starts = abs(rnorm(100, mean = 1, sd = 1)) * 1000,
ends = abs(rnorm(100, mean = 2, sd = 1)) * 2000)
# make sure that all end coordinates are bigger than start
# this is a requirement of the original data
testdata <- testdata[testdata$ends - testdata$starts > 0,]
# count overlapping ranges on certain positions
count.data <- lapply(unique(testdata$chrom), function(chromosome){
tmp.inner <- lapply(seq(1,10000, by = 120), function(i){
sum(testdata$chrom == chromosome & testdata$starts <= i & testdata$ends >= i)
})
return(unlist(tmp.inner))
})
# generate a data.frame containing all data
df.count.data <- ldply(count.data, rbind)
# ideally the chromosome will be columns and not rows
t(df.count.data)
Answer: GenomicRanges::countOverlaps seems to be what you’re after:
position_range = GRanges(position$chrom, IRanges(position, position,
width = 1))
ranges_at_position = countOverlaps(position_ranges, granges)
- source_sentence: samtools depth print out all positions
sentences:
- >-
Question: I have around ~3,000 short sequences of approximately ~10Kb
long. What are the best ways to find the motifs among all of these
sequences? Is there a certain software/method recommended?
There are several ways to do this. My goal would be to:
(1) Check for motifs repeated within individual sequences
(2) Check for motifs shared among all sequences
(3) Check for the presence of "expected" or known motifs
With respect to #3, I'm also curious if I find e.g. trinucleotide
sequences, how does one check the context around these regions?
Thank you for the recommendations/help!
Answer: For (3), this page has a lot of links to pattern/motif finding
tools. Following through the YMF link on that page, I came across the
University of Washington Motif Discovery section. Of these projection
seemed to be the only downloadable tool. I find it interesting how old
all these tools are; maybe the introduction of microarrays and NGS has
made them all redundant.
Your sub-problem (2) seems similar to the problem I'm having with
Nippostrongylus brasiliensis genome sequences, where I'd like to find
regions of very high homology (length 500bp to 20kb or more, 95-99%
similar) that are repeated throughout the genome. These sequences are
killing the assembly.
The main way I can find these regions is by looking at a coverage plot
of long nanopore reads mapped to the assembled genome (using GraphMap or
BWA). Any regions with substantially higher than median coverage are
likely to be shared repeats.
I've played around in the past with chopping up the reads to smaller
sizes, which works better for hitting smaller repeated regions that are
such a small proportion of most reads that they are never mapped to all
the repeated locations. I wrote my own script a while back to chop up
reads (for a different purpose), which produces a FASTA/FASTQ file where
all reads are exactly the same length. For some unknown reason I took
the time to document that script "properly" using POD, so here's a short
summary:
Converts all sequences in the input FASTA file to the same length.
Sequences shorter than the target length are dropped, and sequences longer
than the target length are split into overlapping subsequences covering
the entire range. This prepares the sequences for use in an
overlap-consensus assembler requiring constant-length sequences (such as
edena).
And here's the syntax:
$ ./normalise_seqlengths.pl -h
Usage:
./normalise_seqlengths.pl <reads.fa> [options]
Options:
-help
Only display this help message
-fraglength
Target fragment length (in base-pairs, default 2000)
-overlap
Minimum overlap length (in base-pairs, default 200)
-short
Keep short sequences (shorter than fraglength)
- >-
Question: Without going into too much background, I just joined up with
a lab as a bioinformatics intern while I'm completing my masters degree
in the field. The lab has data from an RNA-seq they outsourced, but the
only problem is that the only data they have is preprocessed from the
company that did the sequencing: filtering the reads, aligning them, and
putting the aligned reads through RSEM. I currently have output from
RSEM for each of the four samples consisting of: gene id, transcript
id(s), length, expected count, and FPKM. I am attempting to get the
FASTQ files from the sequencing, but for now, this is what I have, and
I'm trying to get something out of it if possible.
I found this article that talks about how expected read counts can be
better than raw read counts when analyzing differential expression using
EBSeq; it's just one guy's opinion, and it's from 2014, so it may be
wrong or outdated, but I thought I'd give it a try since I have the
expected counts.
However, I have just a couple of questions about running EBSeq that I
can't find the answers to:
1: In the output RSEM files I have, not all genes are represented in
each, about 80% of them are, but for the ones that aren't, should I
remove them before analysis with EBSeq? It runs when I do, but I'm not
sure if it is correct.
2: How do I know which normalization factor to use when running EBSeq?
This is more of a conceptual question rather than a technical question.
Thanks!
Answer: Yes, that blog post does represent just one guy's opinion (hi!)
and it does date all the way back to 2014, which is, like, decades in
genomics years. :-) By the way, there is quite a bit of literature
discussing the improvements that expected read counts derived from an
Expectation Maximization algorithm provide over raw read counts. I'd
suggest reading the RSEM papers for a start[1][2].
But your main question is about the mechanics of running RSEM and EBSeq.
First, RSEM was written explicitly to be compatible with EBSeq, so I'd
be very surprised if it does not work correctly out-of-the-box. Second,
EBSeq's MedianNorm function worked very well in my experience for
normalizing the library counts. Along those lines, the blog you
mentioned above has another post that you may find useful.
But all joking aside, these tools are indeed dated. Alignment-free
RNA-Seq tools provide orders-of-magnitude improvements in runtime over
the older alignment-based alternatives, with comparable accuracy.
Sailfish was the first in a growing list of tools that now includes
Salmon and Kallisto. When starting a new analysis from scratch (i.e. if
you ever get the original FASTQ files), there's really no good reason
not to estimate expression using these much faster tools, followed by a
differential expression analysis with DESeq2, edgeR, or sleuth.
1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene
expression estimation with read mapping uncertainty. Bioinformatics,
26(4):493–500, doi:10.1093/bioinformatics/btp692.
2Li B, Dewey C (2011) RSEM: accurate transcript quantification from
RNA-Seq data with or without a reference genome. BMC Bioinformatics,
12:323, doi:10.1186/1471-2105-12-323.
- >-
Question: I am trying to use samtools depth (v1.4) with the -a option
and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and
chrM to print out the coverage at every position:
cat GRCh38.karyo.bed | awk '{print $3}' | datamash sum 1
3088286401
I would like to know how to run samtools depth so that it produces
3,088,286,401 entries when run against a GRCh38 bam file:
samtools depth -b $bedfile -a $inputfile
I tried it for a few bam files that were aligned the same way, and I get
differing number of entries:
3087003274
3087005666
3087007158
3087009435
3087009439
3087009621
3087009818
3087010065
3087010408
3087010477
3087010481
3087012115
3087013147
3087013186
3087013500
3087149616
Is there a special flag in samtools depth so that it reports all entries
from the bed file?
If samtools depth is not the best tool for this, what would be the
equivalent with sambamba depth base?
sambamba depth base --min-coverage=0 --regions $bedfile $inputfile
Any other options?
Answer: You might try using bedtools genomecov instead. If you provide
the -d option, it reports the coverage at every position in the BAM
file.
bedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"
You can also provide a BED file if you just want to calculate in the
target region.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
SentenceTransformer based on BAAI/bge-small-en-v1.5
This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: BAAI/bge-small-en-v1.5
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 384 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'samtools depth print out all positions',
'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out the coverage at every position:\ncat GRCh38.karyo.bed | awk \'{print $3}\' | datamash sum 1\n3088286401\n\nI would like to know how to run samtools depth so that it produces 3,088,286,401 entries when run against a GRCh38 bam file:\nsamtools depth -b $bedfile -a $inputfile\n\nI tried it for a few bam files that were aligned the same way, and I get differing number of entries:\n3087003274\n3087005666\n3087007158\n3087009435\n3087009439\n3087009621\n3087009818\n3087010065\n3087010408\n3087010477\n3087010481\n3087012115\n3087013147\n3087013186\n3087013500\n3087149616\n\nIs there a special flag in samtools depth so that it reports all entries from the bed file?\nIf samtools depth is not the best tool for this, what would be the equivalent with sambamba depth base?\nsambamba depth base --min-coverage=0 --regions $bedfile $inputfile\n\nAny other options?\n\nAnswer: You might try using bedtools genomecov instead. If you provide the -d option, it reports the coverage at every position in the BAM file.\nbedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"\n\nYou can also provide a BED file if you just want to calculate in the target region.',
"Question: Without going into too much background, I just joined up with a lab as a bioinformatics intern while I'm completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they have is preprocessed from the company that did the sequencing: filtering the reads, aligning them, and putting the aligned reads through RSEM. I currently have output from RSEM for each of the four samples consisting of: gene id, transcript id(s), length, expected count, and FPKM. I am attempting to get the FASTQ files from the sequencing, but for now, this is what I have, and I'm trying to get something out of it if possible.\nI found this article that talks about how expected read counts can be better than raw read counts when analyzing differential expression using EBSeq; it's just one guy's opinion, and it's from 2014, so it may be wrong or outdated, but I thought I'd give it a try since I have the expected counts.\nHowever, I have just a couple of questions about running EBSeq that I can't find the answers to:\n1: In the output RSEM files I have, not all genes are represented in each, about 80% of them are, but for the ones that aren't, should I remove them before analysis with EBSeq? It runs when I do, but I'm not sure if it is correct.\n2: How do I know which normalization factor to use when running EBSeq? This is more of a conceptual question rather than a technical question.\nThanks!\n\nAnswer: Yes, that blog post does represent just one guy's opinion (hi!) and it does date all the way back to 2014, which is, like, decades in genomics years. :-) By the way, there is quite a bit of literature discussing the improvements that expected read counts derived from an Expectation Maximization algorithm provide over raw read counts. I'd suggest reading the RSEM papers for a start[1][2].\nBut your main question is about the mechanics of running RSEM and EBSeq. First, RSEM was written explicitly to be compatible with EBSeq, so I'd be very surprised if it does not work correctly out-of-the-box. Second, EBSeq's MedianNorm function worked very well in my experience for normalizing the library counts. Along those lines, the blog you mentioned above has another post that you may find useful.\nBut all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools provide orders-of-magnitude improvements in runtime over the older alignment-based alternatives, with comparable accuracy. Sailfish was the first in a growing list of tools that now includes Salmon and Kallisto. When starting a new analysis from scratch (i.e. if you ever get the original FASTQ files), there's really no good reason not to estimate expression using these much faster tools, followed by a differential expression analysis with DESeq2, edgeR, or sleuth.\n\n1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692.\n2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
Unnamed Dataset
- Size: 96 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 96 samples:
sentence_0 sentence_1 type string string details - min: 6 tokens
- mean: 14.93 tokens
- max: 34 tokens
- min: 103 tokens
- mean: 397.92 tokens
- max: 512 tokens
- Samples:
sentence_0 sentence_1 Using shells other than bash
Question: As someone who's beginning to delve into bioinformatics, I'm noticing that like biology there are industry standards here, similar to Illumina in genomics and bowtie for alignment, many people use bash as shell.
Is using a shell besides bash going to cause issues for me?
Answer: Bioinformatics tools written in shell and other shell scripts generally specify the shell they want to use (via #!/bin/sh or e.g. #!/bin/bash if it matters), so won't be affected by your choice of user shell.
If you are writing significant shell scripts yourself, there are reasons to do it in a Bourne-style shell. See Csh Programming Considered Harmful and other essays/polemics.
A Bourne-style shell is pretty much the industry standard, and if you choose a substantially different shell you'll have to translate some of the documentation of your bioinformatics tools. It's not uncommon to have things like
Set some variables pointing at reference data and add the script to your PATH to run it:
export...Linear models of complex diseases
Question: A popular framework to analyze differences between groups, either experiments or diseases, in transcriptomics is using linear models (limma is a popular choice).
For instance we have a disease D with three stages as defined by clinicians, A, B and C. 10 samples each stage and the healthy H to compare with is RNA-sequenced. A typical linear model would be to observe the three stages~A+B+C independently. The data of each stage is not from the same person. (but for the question assume it isn't)
My understanding is that such a model would not take into account that stage C appears only on 30% of patients in stage B. And that a healthy patient upon external factors can jump to stage B.
If we want to find the role of a gene in the disease we should include somehow this information in the model. Which makes me think about mixing linear models and hidden Markov chains.
How can such a disease be described in terms of linear models with such data and information?
Answer: There are t...Detecting portions of human proteins with high degree of microbial similarity
Question: I'm a newcomer to the world of bioinformatics, and in need of help solving a problem.
My goal is to take a list of human proteins, and identify segments (13-17aa in length) with a high degree of similarity to microbial sequences. Ideally, I would like to start with list of FASTA sequences, and have an easy way to generate an output of the corresponding high similarity segments of each protein.
Are there existing tools or software that I should be aware of that will make my life easier?
Thanks in advance.
Answer: Sounds like precisely the job BLAST was developed for. Now, which flavor will depend on what you want to do and what data you have available. Some options:
PSI-BLAST: this is usually the best choice if you are trying to find protein homologs. It works by building a hidden markov model describing your query sequence and using that model to query a database of proteins. The advantage is that it is run in multiple iterations, giving you the chance to add or remove resu... - Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size
: 32per_device_eval_batch_size
: 32num_train_epochs
: 1fp16
: Truebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: noprediction_loss_only
: Trueper_device_train_batch_size
: 32per_device_eval_batch_size
: 32per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size
: 0fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: round_robin
Framework Versions
- Python: 3.12.8
- Sentence Transformers: 3.4.1
- Transformers: 4.51.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.7.0
- Datasets: 3.2.0
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}