@fulop_dan Indeed, strandedness of the libraries does not (presently) affect alignments.
--soloStrand option is necessary for assigning reads to genes in the single-cell gene expression.
@AnnLorainePhD @nomad421@anshulkundaje Good suggestions from Rob!
And as Anshul pointed out, tweaking parameters could be helpful.
If you have specific examples of stubbornly wrong alignments, please post them on GitHub:
https://t.co/0plPdXOISk
@anshulkundaje@satijalab@nomad421@stephaniehicks@humancellatlas@_hubmap Seconding all responses, good discussion!
An issue with including intronic reads is with the genes whose exons overlap introns of other genes. Reads mapping to such overlapped regions will be considered multi-gene and (typically) excluded.
(1/2)
@APredeus Indeed, we were using the "abridged" 10X annotations that exclude small non-coding RNA and pseudogenes.
We checked it for the full Gencode 37 annotations, and the results were very similar.
STARsolo preprint is out on bioRxiv:
https://t.co/okqCUWIERH
STAR release 2.7.9a:
https://t.co/bjkJskVfnl
The major new feature is quantification of multi-gene (multi-mapping) reads/UMIs, which are necessary to detect expression from overlapping genes and paralogs.
1/5
@alexwstockinger Supertranscripts should work if you can make a set of Supertranscript sequences and a GTF describing spliced/unspliced transcripts with respect to transcsirpts and giving it to the STAR genome generation step.
@alexwstockinger The SuperTranscripts are very cool - but they would require spliced alignments. We were actually looking into that at some point but did not get far.
The redundancy is not a problem, as long as redundant transcripts are assigned to the same gene.
@alexwstockinger This is a good point: for species without genome assembly, mapping to the transcriptome is the only option. You can do it with STARsolo by generating the genome index from transcript sequences instead of chromosomes.
3/3
@alexwstockinger Using simulations, we show the differences are due to Kallisto's lower accuracy, which is caused by the pseudoalignment-to-transcriptome algorithm. It forces intronic reads (abundant in single-cell data) to map to spurious genes.
2/3
@timtriche@manvendr7 @MollyHammell Interesting paper, thanks!
It looks like they are aggregating reads over "meta" TE - they are not doing EM over individual genes.
@bdeonovic@BMirauta@biomonika@lpachter Sure, no disagreement here.
I was thinking about a specific data type, scRNA-seq gene/cell counts: mostly 0s, many 1s, and fewer >=2 elements.
But maybe Lior has something else on his mind, and I am being paranoid.
https://t.co/pqNF3IF3qN
@hypercompetent@lpachter It’s getting late on the East coast, and still no blog from Lior, so I will make my presumptuous guess.
I think Lior is trying to puzzle out why Kallisto to CellRanger correlation is lower in our Fig.4C https://t.co/okqCUWIERH vs. their Fig.2D https://t.co/x55WNVzIDh
1/3
@BMirauta@bdeonovic@biomonika@lpachter And correlation coefficient does not have to be higher than the proportion of equal elements. An even simpler toy example:
x=[0 0 1 1]
y=[0 1 0 1]
corr(x,y)=0 (obviously)
while 50% of the elements agree.
@p_bourguet Right, there are a few features in STARsolo that would be good to have for bulk (e.g., counting only reads that are concordant with transcripts). They are high on my TODO list. Though for multimappers, quantifying with RSEM is still a better (albeit slower) option.
@hypercompetent@lpachter The answer to “why Kallisto to CellRanger correlation is lower in our calculation” is simple.
We used Spearman correlation, while they used Pearson. Pearson correlation, of course, can be inflated by various artifacts
and is not a good choice for RNA-seq data.
3/3
@hypercompetent@lpachter I am still not sure what’s the point of Lior’s toy example. Should we not use correlation as a metric at all? Then why was it used in Kallisto paper?
2/3