Skip to content

(STARsolo's) UR and UB SAM flags differ in length #1909

@johnmoll

Description

@johnmoll

Hi,

First of all, thanks for your amazing tool(s), I find STAR and STARsolo extraordinarily useful. This is to report that STARsolo's UB and UR don't have the same length when processing BDRhapsody data. The manual reads (page 13):

CR CY UR UY : sequences and quality scores of cell barcodes and UMIs for the solo* demulti-
plexing, not error corrected.
CB UB : error-corrected cell barcodes and UMIs for solo* demultiplexing.

so I expected UR and UB to share length, perhaps I misunderstood. If they were intended to share length, I might have found a bug. The snippet below shows how to generate an output where their lengths don't match.

STAR 2.7.10b, compiled from source, running on Ubuntu 18.04.6 LTS.

#!/bin/bash

NTHREADS=5

mkdir sandbox; cd $_

## these are whitelists (BDRhapsody)
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS1.txt
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS2.txt
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS3.txt

## fake, short genome
cat << EOF > genome.fa
>ERCC-00002
TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGGGCGTCG
GCATCCAGACCGTCGGCTGATCGTGGTTTTACTAGGCTAGACTAGCGTAC
GAGCACTATGGTCAGTAATTCCTGGAGGAATAGGTACCAAGAAAAAAACG
AACCTTTGGGTTCCAGAGCTGTACGGTCGCACTGAACTCGGATAGGTCTC
AGAAAAACGAAATATAGGCTTACGGTAGGTCCGAATGGCACAAAGCTTGT
TCCGTTAGCTGGCATAAGATTCCATGCCTAGATGTGATACACGTTTCTGG
AAACTGCCTCGTCATGCGACTGTTCCCCGGGGTCAGGGCCGCTGGTATTT
GCTGTAAAGAGGGGCGTTGAGTCCGTCCGACTTCACTGCCCCCTTTCAGC
CTTTTGGGTCCTGTATCCCAATTCTCAGAGGTCCCGCCGTACGCTGAGGA
CCACCTGAAACGGGCATCGTCGCTCTTCGTTGTTCGTCGACTTCTAGTGT
GGAGACGAATTGCCAGAATTATTAACTGCGCAGTTAGGGCAGCGTCTGAG
GAAGTTTGCTGCGGTTTCGCCTTGACCGCGGGAAGGAGACATAACGATAG
CGACTCTGTCTCAGGGGATCTGCATATGTTTGCAGCATACTTTAGGTGGG
CCTTGGCTTCCTTCCGCAGTCAAAACCGCGCAATTATCCCCGTCCTGATT
TACTGGACTCGCAACGTGGGTCCATCAGTTGTCCGTATACCAAGACGTCT
AAGGGCGGTGTACACCCTTTTGAGCAATGATTGCACAACCTGCGATCACC
TTATACAGAATTATCAATCAAGCTCCCCGAGGAGCGGACTTGTAAGGACC
GCCGCTTTCGCTCGGGTCTGCGGGTTATAGCTTTTCAGTCTCGACGGGCT
AGCACACATCTGGTTGACTAGGCGCATAGTCGCCATTCACAGATTTGCTC
GGCAATCAGTACTGGTAGGCGTTAGACCCCGTGACTCGTGGCTGAACGGC
CGTACAACTCGACAGCCGGTGCTTGCGTTTTACCCTTAAAAAAAAAAAAA
AAAAAAAAAAA
>ERCC-00003
CAGCAGCGATTAAGGCAGAGGCGTTTGTATCTGCCATTATAAAGAAGTTT
CCTCCAGCAACTCCTTTCTTAATTCCAAACTTAGCTTCAGTTATAAATTC
CCCTCCCATGATTGGGATTTTATAAACTTTTCTTCCATATAATTCATCTT
TCTTCTCATAACCGTCTCCGAAAAACTTCAACTTAAATCCAACCTTTAAC
TGCTCATCAGCCATGTCTCCCACAGCATCAAAAATAGCAGTTGTTGGACA
TGTTAAGACACACTGCCCCAATCTCTCTAACATTTGATGCTCTAACTCTG
ACTTTTTAGGGTGGCATATCTGTATTATAAATCCTGGTCTTCCATCTGGT
GTTTTTGATGGAGGGACATATTTCTCAATTCCTGCTTCTGCTGGACACAT
TATAACTGAACAACCAAAACCTGTTGCCTCTGTAGCTGCAATCTTAGCCC
ACTTCTTTGTAGCTGCTGTTATTAAAACTCTTGAAACCCATATTGGGAAT
GCTTCTGCAAATGTATCTTCAATATATACTCCATTTATTTCCATAGTTTC
CCTCCATTAAGATTTTAACAATTATAGTTTATCTTAGGGGCTATTAATAT
CTTATCATTTGGTTTTTAATATTCGATAAATCCATAAATAAAAATATATC
AACAATAATTTTAAATAATCTAAGTATAGGTAATATAACAATTAAAAAGA
TTTAGAGGGATAGAATTGAACGGCATTAGGAGAATTGTTTTAGATATATT
GAAGCCGCATGAGCCAAAAATAACAGATATGGCATTAAAATTAACATCAT
TATCAAACATTGATGGGGTTAATATTACAGTCTATGAAATAGATAAAGAG
ACTGAGAATGTTAAAGTTACAATTGAAGGGAATAATTTAGATTTTGATGA
GATTCAGGAAATTATTGAAAGTTTGGGAGGGACTATTCACAGTATAGATG
AGGTTGTTGCAGGTAAAAAGATTATTGAAGAGTTAGAACACCACAAGATA
AAAAAAAAAAAAAAAAAAAAAAA
>ERCC-00004
TCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCATTGGTATCTTTTCCC
CACTTCCAAGCATTTTTTCAACTAATCTTATGTTATTAACCATTTCCTTA
AATTCTTCTGGGTCTGCTGACAAAGCATGATCAGGACCTTCCATATTTTT
ATCTAAGGTAAAGTGCTTCTCAATAACATCCGCTCCTAAGGCAACAGAAA
CTACTGGGGCGAGTATTCCCAATGTATGGTCAGAATATCCCACAGGGATA
TTGAATATACTTTTCAAGGTTTTAATAGCGTTTAAATTGACATCTTCATA
AGGGGTTGGGTAAGATGAAATACAATGCAATAAAATAATATCCCTGCATC
CATTATTTTCTAAAACTTTAACTGCTTCCCAAATTTCCCCAATATCAGAC
ATTCCTGTAGATAAAATCACCGGCTTGCCTGTTTTTGCCACTTTTTCTAA
TAAGGGATAAAAGGTTAAATCACCAGAGGCAATTTTAAATCAGGCACATA
AAAAAAAAAAAAAAAAAAAAAAA
>ERCC-00009
CAATGATAGGCTAGTCTCGCGCAGTACATGGTAGTTCAGCCAATAGATGC
CTAGTACGCTGACGGCATTCAGAGTACGCTGATCGGCTTATGACGTATGT
GACGCAGCTCTTAGCGCAATGTATGTGCTGTTATCGAAGCCTATGGCTGA
GTATGTAACGCTATGGCGTGCTAGTCGTCTCATATACGTCTGATGACCTC
GTATCATGTTATAGGGCTGCGAACTGTCGATGATGGTCACGACTCTGTCG
ATAGCTGTGTGACTCATTCAGAAGGTGTGCAGCCTATATGATACGCAGTC
GCATCCTATCTTACGTGTCAGTACTATGTGTGAGTGCTCCGCCCTAGTGC
TGATGTATGCCCCATAGTGCTCAGTGGAGTCTCTCTTAGCATAGTGTCCG
CTCATACATTAGATGGACGGCTCATTAGTATCATCGTCGGCTGATATAGG
TCGTGGCTCCCTGTATATCGAGGTGAGTCTATCTGGATCAACGTCGCACT
ATGATGTGCAAAGTGTCGTCCATGTATAGACAGTGCGCGTATCATATAGG
ATGCGGCGATCTCATACAGCGTTACGGTCGCTGCGTACTGTATAAGGATG
CTCTGTGAACTGTCATCGGTCCGATCAATTAGTCTAGTGTGCGTTATTCA
GATCGAGTGAGTACATGATTCGTCAGTGTGGATCAATTACAGTTAGGCCG
CTGACACATTAGTAACGTCGGCAAGCACTTAGTCGTGTCGTAAGCCAGTG
TGTCGTGTCTTAGACGACTGTGTGTGATTCTCGAGCGATTTATACATCCG
TGACAGCGCTTATAGTGTGCTGACAGACTGGTTGGTTATCCAATGATCGA
CCTGGAGTCTAATATCTGACCACGCCTTGTAATCGTATGACACGCGCTTG
ACACGACTGAATCCAGCTTAAGAGCCCTGCAACGCGATATACAGGCGCTG
CTACCGATATAAAAAAAAAAAAAAAAAAAAAAAA
EOF

## fake, short canonical transcripts
echo -e 'ERCC-00002\tERCC\texon\t1\t1061\t.\t+\t.\tgene_id "ERCC-00002"; transcript_id "DQ459430";
ERCC-00003\tERCC\texon\t1\t1023\t.\t+\t.\tgene_id "ERCC-00003"; transcript_id "DQ516784";
ERCC-00004\tERCC\texon\t1\t523\t.\t+\t.\tgene_id "ERCC-00004"; transcript_id "DQ516752";
ERCC-00009\tERCC\texon\t1\t984\t.\t+\t.\tgene_id "ERCC-00009"; transcript_id "DQ668364";' > genome.gtf

## index
STAR --runThreadN $NTHREADS \
     --runMode genomeGenerate \
     --sjdbGTFfile genome.gtf \
     --genomeDir a_genome \
     --genomeSAindexNbases 4 \
     --sjdbOverhang 100 \
     --genomeFastaFiles genome.fa
     
## generate 100 R1s (cDNA) and R2s (CB + UMI)
seq 1 100 | awk ' {print "@"$0"\nTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCATTGGTATCTTTTCCCCACTTCCAAGCATTTTTTCAACTAATCTTATGTTATTAACCATTTCCTTAAATTCTTCTGGGTCTGCTGACAAAGCATGATCAGGACCTTCCATATTTTT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF";}' | gzip -c > r1.fq.gz

seq 1 100 | awk ' {print "@"$0"\nCTTGTACTAGTGATGTTCTCCAGACAGGCTACAGATTTGATGGTTTTTTTTTTTTTTTTT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF";}' | gzip -c > r2.fq.gz

## run starsolo; UB and UR reported explicitly, among other flags
STAR --runThreadN 5 \
     --genomeDir a_genome \
     --readFilesCommand zcat \
     --outFileNamePrefix mapping_test/ \
     --readFilesIn  r1.fq.gz r2.fq.gz  \
     --soloType CB_UMI_Complex \
     --soloAdapterSequence GTGANNNNNNNNNGACA \
     --soloCBposition 2_-9_2_-1 2_4_2_12 2_17_2_25 \
     --soloUMIposition 3_10_3_17 \
     --soloCBwhitelist BD_CLS1.txt BD_CLS2.txt BD_CLS3.txt \
     --soloCBmatchWLtype 1MM \
     --soloCellFilter None \
     --soloCellReadStats Standard \
     --sjdbOverhang 100 \
     --outSAMattributes sS sM CB UB CR CY UR UY \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode GeneCounts \
     --sjdbGTFfile genome.gtf

## UB looks like UR, but with a prepended AA dinucleotide
samtools view mapping_test/Al*bam | cut -f16- | head
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG

Thanks again for your time,

John

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions