-
Notifications
You must be signed in to change notification settings - Fork 531
Open
Labels
Description
Hi,
First of all, thanks for your amazing tool(s), I find STAR and STARsolo extraordinarily useful. This is to report that STARsolo's UB and UR don't have the same length when processing BDRhapsody data. The manual reads (page 13):
CR CY UR UY : sequences and quality scores of cell barcodes and UMIs for the solo* demulti-
plexing, not error corrected.
CB UB : error-corrected cell barcodes and UMIs for solo* demultiplexing.
so I expected UR and UB to share length, perhaps I misunderstood. If they were intended to share length, I might have found a bug. The snippet below shows how to generate an output where their lengths don't match.
STAR 2.7.10b, compiled from source, running on Ubuntu 18.04.6 LTS.
#!/bin/bash
NTHREADS=5
mkdir sandbox; cd $_
## these are whitelists (BDRhapsody)
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS1.txt
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS2.txt
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS3.txt
## fake, short genome
cat << EOF > genome.fa
>ERCC-00002
TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGGGCGTCG
GCATCCAGACCGTCGGCTGATCGTGGTTTTACTAGGCTAGACTAGCGTAC
GAGCACTATGGTCAGTAATTCCTGGAGGAATAGGTACCAAGAAAAAAACG
AACCTTTGGGTTCCAGAGCTGTACGGTCGCACTGAACTCGGATAGGTCTC
AGAAAAACGAAATATAGGCTTACGGTAGGTCCGAATGGCACAAAGCTTGT
TCCGTTAGCTGGCATAAGATTCCATGCCTAGATGTGATACACGTTTCTGG
AAACTGCCTCGTCATGCGACTGTTCCCCGGGGTCAGGGCCGCTGGTATTT
GCTGTAAAGAGGGGCGTTGAGTCCGTCCGACTTCACTGCCCCCTTTCAGC
CTTTTGGGTCCTGTATCCCAATTCTCAGAGGTCCCGCCGTACGCTGAGGA
CCACCTGAAACGGGCATCGTCGCTCTTCGTTGTTCGTCGACTTCTAGTGT
GGAGACGAATTGCCAGAATTATTAACTGCGCAGTTAGGGCAGCGTCTGAG
GAAGTTTGCTGCGGTTTCGCCTTGACCGCGGGAAGGAGACATAACGATAG
CGACTCTGTCTCAGGGGATCTGCATATGTTTGCAGCATACTTTAGGTGGG
CCTTGGCTTCCTTCCGCAGTCAAAACCGCGCAATTATCCCCGTCCTGATT
TACTGGACTCGCAACGTGGGTCCATCAGTTGTCCGTATACCAAGACGTCT
AAGGGCGGTGTACACCCTTTTGAGCAATGATTGCACAACCTGCGATCACC
TTATACAGAATTATCAATCAAGCTCCCCGAGGAGCGGACTTGTAAGGACC
GCCGCTTTCGCTCGGGTCTGCGGGTTATAGCTTTTCAGTCTCGACGGGCT
AGCACACATCTGGTTGACTAGGCGCATAGTCGCCATTCACAGATTTGCTC
GGCAATCAGTACTGGTAGGCGTTAGACCCCGTGACTCGTGGCTGAACGGC
CGTACAACTCGACAGCCGGTGCTTGCGTTTTACCCTTAAAAAAAAAAAAA
AAAAAAAAAAA
>ERCC-00003
CAGCAGCGATTAAGGCAGAGGCGTTTGTATCTGCCATTATAAAGAAGTTT
CCTCCAGCAACTCCTTTCTTAATTCCAAACTTAGCTTCAGTTATAAATTC
CCCTCCCATGATTGGGATTTTATAAACTTTTCTTCCATATAATTCATCTT
TCTTCTCATAACCGTCTCCGAAAAACTTCAACTTAAATCCAACCTTTAAC
TGCTCATCAGCCATGTCTCCCACAGCATCAAAAATAGCAGTTGTTGGACA
TGTTAAGACACACTGCCCCAATCTCTCTAACATTTGATGCTCTAACTCTG
ACTTTTTAGGGTGGCATATCTGTATTATAAATCCTGGTCTTCCATCTGGT
GTTTTTGATGGAGGGACATATTTCTCAATTCCTGCTTCTGCTGGACACAT
TATAACTGAACAACCAAAACCTGTTGCCTCTGTAGCTGCAATCTTAGCCC
ACTTCTTTGTAGCTGCTGTTATTAAAACTCTTGAAACCCATATTGGGAAT
GCTTCTGCAAATGTATCTTCAATATATACTCCATTTATTTCCATAGTTTC
CCTCCATTAAGATTTTAACAATTATAGTTTATCTTAGGGGCTATTAATAT
CTTATCATTTGGTTTTTAATATTCGATAAATCCATAAATAAAAATATATC
AACAATAATTTTAAATAATCTAAGTATAGGTAATATAACAATTAAAAAGA
TTTAGAGGGATAGAATTGAACGGCATTAGGAGAATTGTTTTAGATATATT
GAAGCCGCATGAGCCAAAAATAACAGATATGGCATTAAAATTAACATCAT
TATCAAACATTGATGGGGTTAATATTACAGTCTATGAAATAGATAAAGAG
ACTGAGAATGTTAAAGTTACAATTGAAGGGAATAATTTAGATTTTGATGA
GATTCAGGAAATTATTGAAAGTTTGGGAGGGACTATTCACAGTATAGATG
AGGTTGTTGCAGGTAAAAAGATTATTGAAGAGTTAGAACACCACAAGATA
AAAAAAAAAAAAAAAAAAAAAAA
>ERCC-00004
TCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCATTGGTATCTTTTCCC
CACTTCCAAGCATTTTTTCAACTAATCTTATGTTATTAACCATTTCCTTA
AATTCTTCTGGGTCTGCTGACAAAGCATGATCAGGACCTTCCATATTTTT
ATCTAAGGTAAAGTGCTTCTCAATAACATCCGCTCCTAAGGCAACAGAAA
CTACTGGGGCGAGTATTCCCAATGTATGGTCAGAATATCCCACAGGGATA
TTGAATATACTTTTCAAGGTTTTAATAGCGTTTAAATTGACATCTTCATA
AGGGGTTGGGTAAGATGAAATACAATGCAATAAAATAATATCCCTGCATC
CATTATTTTCTAAAACTTTAACTGCTTCCCAAATTTCCCCAATATCAGAC
ATTCCTGTAGATAAAATCACCGGCTTGCCTGTTTTTGCCACTTTTTCTAA
TAAGGGATAAAAGGTTAAATCACCAGAGGCAATTTTAAATCAGGCACATA
AAAAAAAAAAAAAAAAAAAAAAA
>ERCC-00009
CAATGATAGGCTAGTCTCGCGCAGTACATGGTAGTTCAGCCAATAGATGC
CTAGTACGCTGACGGCATTCAGAGTACGCTGATCGGCTTATGACGTATGT
GACGCAGCTCTTAGCGCAATGTATGTGCTGTTATCGAAGCCTATGGCTGA
GTATGTAACGCTATGGCGTGCTAGTCGTCTCATATACGTCTGATGACCTC
GTATCATGTTATAGGGCTGCGAACTGTCGATGATGGTCACGACTCTGTCG
ATAGCTGTGTGACTCATTCAGAAGGTGTGCAGCCTATATGATACGCAGTC
GCATCCTATCTTACGTGTCAGTACTATGTGTGAGTGCTCCGCCCTAGTGC
TGATGTATGCCCCATAGTGCTCAGTGGAGTCTCTCTTAGCATAGTGTCCG
CTCATACATTAGATGGACGGCTCATTAGTATCATCGTCGGCTGATATAGG
TCGTGGCTCCCTGTATATCGAGGTGAGTCTATCTGGATCAACGTCGCACT
ATGATGTGCAAAGTGTCGTCCATGTATAGACAGTGCGCGTATCATATAGG
ATGCGGCGATCTCATACAGCGTTACGGTCGCTGCGTACTGTATAAGGATG
CTCTGTGAACTGTCATCGGTCCGATCAATTAGTCTAGTGTGCGTTATTCA
GATCGAGTGAGTACATGATTCGTCAGTGTGGATCAATTACAGTTAGGCCG
CTGACACATTAGTAACGTCGGCAAGCACTTAGTCGTGTCGTAAGCCAGTG
TGTCGTGTCTTAGACGACTGTGTGTGATTCTCGAGCGATTTATACATCCG
TGACAGCGCTTATAGTGTGCTGACAGACTGGTTGGTTATCCAATGATCGA
CCTGGAGTCTAATATCTGACCACGCCTTGTAATCGTATGACACGCGCTTG
ACACGACTGAATCCAGCTTAAGAGCCCTGCAACGCGATATACAGGCGCTG
CTACCGATATAAAAAAAAAAAAAAAAAAAAAAAA
EOF
## fake, short canonical transcripts
echo -e 'ERCC-00002\tERCC\texon\t1\t1061\t.\t+\t.\tgene_id "ERCC-00002"; transcript_id "DQ459430";
ERCC-00003\tERCC\texon\t1\t1023\t.\t+\t.\tgene_id "ERCC-00003"; transcript_id "DQ516784";
ERCC-00004\tERCC\texon\t1\t523\t.\t+\t.\tgene_id "ERCC-00004"; transcript_id "DQ516752";
ERCC-00009\tERCC\texon\t1\t984\t.\t+\t.\tgene_id "ERCC-00009"; transcript_id "DQ668364";' > genome.gtf
## index
STAR --runThreadN $NTHREADS \
--runMode genomeGenerate \
--sjdbGTFfile genome.gtf \
--genomeDir a_genome \
--genomeSAindexNbases 4 \
--sjdbOverhang 100 \
--genomeFastaFiles genome.fa
## generate 100 R1s (cDNA) and R2s (CB + UMI)
seq 1 100 | awk ' {print "@"$0"\nTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCATTGGTATCTTTTCCCCACTTCCAAGCATTTTTTCAACTAATCTTATGTTATTAACCATTTCCTTAAATTCTTCTGGGTCTGCTGACAAAGCATGATCAGGACCTTCCATATTTTT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF";}' | gzip -c > r1.fq.gz
seq 1 100 | awk ' {print "@"$0"\nCTTGTACTAGTGATGTTCTCCAGACAGGCTACAGATTTGATGGTTTTTTTTTTTTTTTTT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF";}' | gzip -c > r2.fq.gz
## run starsolo; UB and UR reported explicitly, among other flags
STAR --runThreadN 5 \
--genomeDir a_genome \
--readFilesCommand zcat \
--outFileNamePrefix mapping_test/ \
--readFilesIn r1.fq.gz r2.fq.gz \
--soloType CB_UMI_Complex \
--soloAdapterSequence GTGANNNNNNNNNGACA \
--soloCBposition 2_-9_2_-1 2_4_2_12 2_17_2_25 \
--soloUMIposition 3_10_3_17 \
--soloCBwhitelist BD_CLS1.txt BD_CLS2.txt BD_CLS3.txt \
--soloCBmatchWLtype 1MM \
--soloCellFilter None \
--soloCellReadStats Standard \
--sjdbOverhang 100 \
--outSAMattributes sS sM CB UB CR CY UR UY \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--sjdbGTFfile genome.gtf
## UB looks like UR, but with a prepended AA dinucleotide
samtools view mapping_test/Al*bam | cut -f16- | head
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
# UR:Z:TTTGATGG UY:Z:FFFFFFFF CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA UB:Z:AATTTGATGG
Thanks again for your time,
John