Update SR counting and genotyping #553

mwalker174 · 2023-06-16T18:56:31Z

Fixes svtk sr-test and count-sr code to allow for -+ in addition to +- strandedness on INS variants. The sr-test strategy also was updated so that it is more likely to find a significant SR signature on each side, and now defaults to the nearest significant position rather than one 50bp away. These changes result in substantially more two-sided support for INS variants, as well as more accurate breakpoint refinement for small variants, particularly insertions.
Updates SR genotyping model. First, INS are now genotyped separately from other balanced variants, as they exhibit a distinctive SR count distribution. Second, separate model parameters are now used for one-sided and two-sided variants. The two models are also applied based on whether the particular sample exhibits one- or two-sided support, rather than the variant as a whole. Third, the SR GQ model has been simplified to compute a p-value of the left-hand tail based on the count z-score. It therefore scores the confidence that the genotype is not over-called rather than that the genotype is correct (i.e. using a two-sided test).

The combination of these updates substantially improves INS genotyping and alleviates some of the het bias of the previous model in all SR-supported variants.

This branch has been tested on the 1kgp reference panel as well as batch_23 from AoU-PhaseI.

Fix inf log-pvals Fix rewrite_SR_coords.py Increase median homozygous count in SR genotyping for INS Update dockers Minor doc fix Fix AddGenotypes index output path Fix SplitVariants Add ins_median_hom_multiplier as an adjustable parameter Update dockers Minor doc change Fix invalid inputs to _test_coord for some edge cases Update dockers Update get_inputs_outputs.py Rework genotyping for INS median_hom 120; fixed gq model Update wdls More script fixes Update dockers Use same SR_count cutoff for one- and two-sided variants Simple probabilities for GQ Delete debugging scripts from wdl Use sv-pipeline docker for sr genotyping Fix maxgq in plot_sv_perSample_distribs.R Add hom_cutoff_multiplier and median_hom_ins parameters Update dockers Revert get_inputs_outputs.py Revert plot_sv_perSample_distribs.R Revert dockers.json Replace grep with awk Modify a comment

mwalker174 · 2023-06-22T16:50:09Z

Genotype-filtered VCF and QC are available here: gs://broad-dsde-methods-markw/mw-ins-sr-strands/FilterGenotypes/outputs/

epiercehoffman

Thanks for this fix, this is a massive amount of work and should be a big improvement. The QC is looking good. The new code you've added looks clean, and I especially appreciate the added documentation.

I wasn't very familiar with a lot of this code so although I spent a good deal of time reviewing, I still have some questions about your changes - hopefully not too hard to answer and then we can move forward!

src/sv-pipeline/04_variant_resolution/scripts/SR_genotype.opt_part2.sh

epiercehoffman · 2023-06-22T20:03:29Z

src/sv-pipeline/04_variant_resolution/scripts/SR_genotype.opt_part2.sh


 ##Pull out variants to genotype##
 join -j 1 -a 1 -a 2  -e "." -o 1.1 2.1 1.2 1.3 1.4 1.5 2.2 2.3 2.4 2.5 \
  <(zcat sr.geno.final.txt.gz \
-  |fgrep -wf both.pass.txt \


Does this mean that we used to null out SR GTs if the SR background was too high but we are no longer doing so?

This has more to do with two-sided support than SR background.

It is a bit complicated, but basically there are two models for genotyping: a two-sided and a one-sided. Before, it would select the model based on whether the site was deemed one- or two-sided. The methodology for flagging a site as two-sided was fine for the purposes of just assigning a two-side support status to the variant (by checking what fraction of non-ref samples have two-sided support, above).

However, the method also used to genotype according to the site-level both-side support status. It turns out that within a single site, there can be a mixture of two-side support among samples, and the cutoff method was not good enough. This wasn't an issue before since it happened that the one- and two-sided models were actually identical. But now that the one-sided model has different cutoffs, this approach was causing havoc in a lot of variants.

Here I have just removed the requirement that two-sided genotypes have two-sided status at the site level.

epiercehoffman · 2023-06-22T20:23:02Z

src/svtk/svtk/pesr/sr_test.py

+        log_pval_A = float(resultA.log_pval)
+        log_pval_B = float(resultB.log_pval)
+        if record.info['SVTYPE'] == 'INS':
+            if posA > posB + self.ins_window or posA < posB - self.ins_window:


Does this mean that posA and posB must be within self.ins_window of each other for INS? What about INS with a deletion at the insertion site, or a sequence of microhomology > 50bp?

Yes they could be larger, but I was attempting to match the intended behavior of the previous implementation which uses 50bp. Larger window sizes could increase fall positives, e.g. in messy regions you'd be more likely to find significant SR signals by chance. The microhomologous sequences are most frequently ~20bp but they can be larger. This is probably a point worth discussing so let's follow up.

epiercehoffman · 2023-06-22T20:28:27Z

src/svtk/svtk/pesr/sr_test.py

+                # posA is better, so use posA as anchor and check for best posB in valid window
+                resultB = self._test_coord(
+                    record.stop, record.info['STRANDS'][1], record.info['CHR2'],
+                    called, background, posA, posA + self.window)


Does this mean that posA could be returned as the optimal 2nd coordinate because it is in the test window?

Yes I think there's a rare corner case here for intrachromosomal variants with ++ and -- strandedness (INV and some BNDs) that are small (less than twice the window size). In this case with an overlapping window, if the best SR position is within that overlapping window, the algorithm will choose posA=posB, which shouldn't be valid for this variant class.

Luckily this seems to be a rare case, at least in the 1kgp reference panel there were only 2, and there were no differences in the variants between runs on main and in this branch.

I'm still going to add a check for this case, specifically by rerunning _test_coord for posB but explicitly disallowing it to test at posA.

epiercehoffman · 2023-06-22T20:30:36Z

src/svtk/svtk/pesr/sr_test.py

+                    # vice versa
+                    resultA = self._test_coord(
+                        posB, record.info['STRANDS'][0], record.chrom,
+                        called, background, posB - self.ins_window, posB + self.ins_window)


Could this return a pair of coordinates that don't meet the requirements you are imposing? It seems that after being updated once they are not checked again

It does allow for posB < posA, but that will get fixed later in rewrite_SR_coords.py. Is there another case you're thinking of?

Yes, I guess it's just posA > posB. To clarify, why is that considered invalid on line 52 if it is fixed in rewrite_SR_coords.py?

Ah, the difference here is the window size. In the first test we use self.window but here we only test over self.ins_window, which could be smaller.

epiercehoffman

Thanks for the updates and explanations! Approved pending decision about the INS breakpoint window

mwalker174 requested a review from epiercehoffman June 16, 2023 18:56

mwalker174 mentioned this pull request Jun 22, 2023

Update SV split-read strand validation and clustering broadinstitute/gatk#8378

Merged

epiercehoffman reviewed Jun 22, 2023

View reviewed changes

Remove set -x; address rare corner case in SRTest

d820790

epiercehoffman approved these changes Jun 26, 2023

View reviewed changes

mwalker174 merged commit 0be9cf8 into main Jun 27, 2023

mwalker174 deleted the mw_ins_sr_strands branch June 27, 2023 14:16

gatk-sv-bot pushed a commit to Genometric/gatk-sv that referenced this pull request Jun 27, 2023

Update SR counting and genotyping (broadinstitute#553)

b2ddf02

gatk-sv-bot pushed a commit to Genometric/gatk-sv that referenced this pull request Jun 27, 2023

Update SR counting and genotyping (broadinstitute#553)

98a8a2c

Update SR counting and genotyping #553

Update SR counting and genotyping #553

Uh oh!

Conversation

mwalker174 commented Jun 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwalker174 commented Jun 22, 2023

Uh oh!

epiercehoffman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

epiercehoffman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mwalker174 commented Jun 16, 2023 •

edited

Loading