-
Notifications
You must be signed in to change notification settings - Fork 77
Update SR counting and genotyping #553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix inf log-pvals Fix rewrite_SR_coords.py Increase median homozygous count in SR genotyping for INS Update dockers Minor doc fix Fix AddGenotypes index output path Fix SplitVariants Add ins_median_hom_multiplier as an adjustable parameter Update dockers Minor doc change Fix invalid inputs to _test_coord for some edge cases Update dockers Update get_inputs_outputs.py Rework genotyping for INS median_hom 120; fixed gq model Update wdls More script fixes Update dockers Use same SR_count cutoff for one- and two-sided variants Simple probabilities for GQ Delete debugging scripts from wdl Use sv-pipeline docker for sr genotyping Fix maxgq in plot_sv_perSample_distribs.R Add hom_cutoff_multiplier and median_hom_ins parameters Update dockers Revert get_inputs_outputs.py Revert plot_sv_perSample_distribs.R Revert dockers.json Replace grep with awk Modify a comment
Genotype-filtered VCF and QC are available here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix, this is a massive amount of work and should be a big improvement. The QC is looking good. The new code you've added looks clean, and I especially appreciate the added documentation.
I wasn't very familiar with a lot of this code so although I spent a good deal of time reviewing, I still have some questions about your changes - hopefully not too hard to answer and then we can move forward!
src/sv-pipeline/04_variant_resolution/scripts/SR_genotype.opt_part2.sh
Outdated
Show resolved
Hide resolved
|
||
##Pull out variants to genotype## | ||
join -j 1 -a 1 -a 2 -e "." -o 1.1 2.1 1.2 1.3 1.4 1.5 2.2 2.3 2.4 2.5 \ | ||
<(zcat sr.geno.final.txt.gz \ | ||
|fgrep -wf both.pass.txt \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we used to null out SR GTs if the SR background was too high but we are no longer doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has more to do with two-sided support than SR background.
It is a bit complicated, but basically there are two models for genotyping: a two-sided and a one-sided. Before, it would select the model based on whether the site was deemed one- or two-sided. The methodology for flagging a site as two-sided was fine for the purposes of just assigning a two-side support status to the variant (by checking what fraction of non-ref samples have two-sided support, above).
However, the method also used to genotype according to the site-level both-side support status. It turns out that within a single site, there can be a mixture of two-side support among samples, and the cutoff method was not good enough. This wasn't an issue before since it happened that the one- and two-sided models were actually identical. But now that the one-sided model has different cutoffs, this approach was causing havoc in a lot of variants.
Here I have just removed the requirement that two-sided genotypes have two-sided status at the site level.
log_pval_A = float(resultA.log_pval) | ||
log_pval_B = float(resultB.log_pval) | ||
if record.info['SVTYPE'] == 'INS': | ||
if posA > posB + self.ins_window or posA < posB - self.ins_window: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that posA and posB must be within self.ins_window of each other for INS? What about INS with a deletion at the insertion site, or a sequence of microhomology > 50bp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they could be larger, but I was attempting to match the intended behavior of the previous implementation which uses 50bp. Larger window sizes could increase fall positives, e.g. in messy regions you'd be more likely to find significant SR signals by chance. The microhomologous sequences are most frequently ~20bp but they can be larger. This is probably a point worth discussing so let's follow up.
src/svtk/svtk/pesr/sr_test.py
Outdated
# posA is better, so use posA as anchor and check for best posB in valid window | ||
resultB = self._test_coord( | ||
record.stop, record.info['STRANDS'][1], record.info['CHR2'], | ||
called, background, posA, posA + self.window) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that posA could be returned as the optimal 2nd coordinate because it is in the test window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think there's a rare corner case here for intrachromosomal variants with ++
and --
strandedness (INV and some BNDs) that are small (less than twice the window size). In this case with an overlapping window, if the best SR position is within that overlapping window, the algorithm will choose posA=posB, which shouldn't be valid for this variant class.
Luckily this seems to be a rare case, at least in the 1kgp reference panel there were only 2, and there were no differences in the variants between runs on main and in this branch.
I'm still going to add a check for this case, specifically by rerunning _test_coord
for posB but explicitly disallowing it to test at posA.
# vice versa | ||
resultA = self._test_coord( | ||
posB, record.info['STRANDS'][0], record.chrom, | ||
called, background, posB - self.ins_window, posB + self.ins_window) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this return a pair of coordinates that don't meet the requirements you are imposing? It seems that after being updated once they are not checked again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does allow for posB < posA, but that will get fixed later in rewrite_SR_coords.py
. Is there another case you're thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess it's just posA > posB. To clarify, why is that considered invalid on line 52 if it is fixed in rewrite_SR_coords.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the difference here is the window size. In the first test we use self.window
but here we only test over self.ins_window
, which could be smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates and explanations! Approved pending decision about the INS breakpoint window
-+
in addition to+-
strandedness on INS variants. The sr-test strategy also was updated so that it is more likely to find a significant SR signature on each side, and now defaults to the nearest significant position rather than one 50bp away. These changes result in substantially more two-sided support for INS variants, as well as more accurate breakpoint refinement for small variants, particularly insertions.The combination of these updates substantially improves INS genotyping and alleviates some of the het bias of the previous model in all SR-supported variants.
This branch has been tested on the 1kgp reference panel as well as batch_23 from AoU-PhaseI.