Skip to content

Conversation

mwalker174
Copy link
Collaborator

@mwalker174 mwalker174 commented Jun 16, 2023

  • Fixes svtk sr-test and count-sr code to allow for -+ in addition to +- strandedness on INS variants. The sr-test strategy also was updated so that it is more likely to find a significant SR signature on each side, and now defaults to the nearest significant position rather than one 50bp away. These changes result in substantially more two-sided support for INS variants, as well as more accurate breakpoint refinement for small variants, particularly insertions.
  • Updates SR genotyping model. First, INS are now genotyped separately from other balanced variants, as they exhibit a distinctive SR count distribution. Second, separate model parameters are now used for one-sided and two-sided variants. The two models are also applied based on whether the particular sample exhibits one- or two-sided support, rather than the variant as a whole. Third, the SR GQ model has been simplified to compute a p-value of the left-hand tail based on the count z-score. It therefore scores the confidence that the genotype is not over-called rather than that the genotype is correct (i.e. using a two-sided test).

The combination of these updates substantially improves INS genotyping and alleviates some of the het bias of the previous model in all SR-supported variants.

This branch has been tested on the 1kgp reference panel as well as batch_23 from AoU-PhaseI.

Fix inf log-pvals

Fix rewrite_SR_coords.py

Increase median homozygous count in SR genotyping for INS

Update dockers

Minor doc fix

Fix AddGenotypes index output path

Fix SplitVariants

Add ins_median_hom_multiplier as an adjustable parameter

Update dockers

Minor doc change

Fix invalid inputs to _test_coord for some edge cases

Update dockers

Update get_inputs_outputs.py

Rework genotyping for INS

median_hom 120; fixed gq model

Update wdls

More script fixes

Update dockers

Use same SR_count cutoff for one- and two-sided variants

Simple probabilities for GQ

Delete debugging scripts from wdl

Use sv-pipeline docker for sr genotyping

Fix maxgq in plot_sv_perSample_distribs.R

Add hom_cutoff_multiplier and median_hom_ins parameters

Update dockers

Revert get_inputs_outputs.py

Revert plot_sv_perSample_distribs.R

Revert dockers.json

Replace grep with awk

Modify a comment
@mwalker174
Copy link
Collaborator Author

Genotype-filtered VCF and QC are available here: gs://broad-dsde-methods-markw/mw-ins-sr-strands/FilterGenotypes/outputs/

Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix, this is a massive amount of work and should be a big improvement. The QC is looking good. The new code you've added looks clean, and I especially appreciate the added documentation.

I wasn't very familiar with a lot of this code so although I spent a good deal of time reviewing, I still have some questions about your changes - hopefully not too hard to answer and then we can move forward!


##Pull out variants to genotype##
join -j 1 -a 1 -a 2 -e "." -o 1.1 2.1 1.2 1.3 1.4 1.5 2.2 2.3 2.4 2.5 \
<(zcat sr.geno.final.txt.gz \
|fgrep -wf both.pass.txt \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we used to null out SR GTs if the SR background was too high but we are no longer doing so?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has more to do with two-sided support than SR background.

It is a bit complicated, but basically there are two models for genotyping: a two-sided and a one-sided. Before, it would select the model based on whether the site was deemed one- or two-sided. The methodology for flagging a site as two-sided was fine for the purposes of just assigning a two-side support status to the variant (by checking what fraction of non-ref samples have two-sided support, above).

However, the method also used to genotype according to the site-level both-side support status. It turns out that within a single site, there can be a mixture of two-side support among samples, and the cutoff method was not good enough. This wasn't an issue before since it happened that the one- and two-sided models were actually identical. But now that the one-sided model has different cutoffs, this approach was causing havoc in a lot of variants.

Here I have just removed the requirement that two-sided genotypes have two-sided status at the site level.

log_pval_A = float(resultA.log_pval)
log_pval_B = float(resultB.log_pval)
if record.info['SVTYPE'] == 'INS':
if posA > posB + self.ins_window or posA < posB - self.ins_window:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that posA and posB must be within self.ins_window of each other for INS? What about INS with a deletion at the insertion site, or a sequence of microhomology > 50bp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they could be larger, but I was attempting to match the intended behavior of the previous implementation which uses 50bp. Larger window sizes could increase fall positives, e.g. in messy regions you'd be more likely to find significant SR signals by chance. The microhomologous sequences are most frequently ~20bp but they can be larger. This is probably a point worth discussing so let's follow up.

# posA is better, so use posA as anchor and check for best posB in valid window
resultB = self._test_coord(
record.stop, record.info['STRANDS'][1], record.info['CHR2'],
called, background, posA, posA + self.window)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that posA could be returned as the optimal 2nd coordinate because it is in the test window?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think there's a rare corner case here for intrachromosomal variants with ++ and -- strandedness (INV and some BNDs) that are small (less than twice the window size). In this case with an overlapping window, if the best SR position is within that overlapping window, the algorithm will choose posA=posB, which shouldn't be valid for this variant class.

Luckily this seems to be a rare case, at least in the 1kgp reference panel there were only 2, and there were no differences in the variants between runs on main and in this branch.

I'm still going to add a check for this case, specifically by rerunning _test_coord for posB but explicitly disallowing it to test at posA.

# vice versa
resultA = self._test_coord(
posB, record.info['STRANDS'][0], record.chrom,
called, background, posB - self.ins_window, posB + self.ins_window)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this return a pair of coordinates that don't meet the requirements you are imposing? It seems that after being updated once they are not checked again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does allow for posB < posA, but that will get fixed later in rewrite_SR_coords.py. Is there another case you're thinking of?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess it's just posA > posB. To clarify, why is that considered invalid on line 52 if it is fixed in rewrite_SR_coords.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the difference here is the window size. In the first test we use self.window but here we only test over self.ins_window, which could be smaller.

Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates and explanations! Approved pending decision about the INS breakpoint window

@mwalker174 mwalker174 merged commit 0be9cf8 into main Jun 27, 2023
@mwalker174 mwalker174 deleted the mw_ins_sr_strands branch June 27, 2023 14:16
gatk-sv-bot pushed a commit to Genometric/gatk-sv that referenced this pull request Jun 27, 2023
gatk-sv-bot pushed a commit to Genometric/gatk-sv that referenced this pull request Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants