BQSR: avoid throwing an error when read group is missing in the recal table, and some refactoring. #9020

takutosato · 2024-10-28T13:24:47Z

Addresses #6242.

Current behavior: when all the reads in a read group are filtered in the base recalibration step, the read group is not logged in the recal table. Then ApplyBQSR encounters these reads, can't find the read group in the recal table, and throws an error.

New behavior: if --allow-read-group flag is set to true, then ApplyBQSR outputs the original quantities (after quantizing).

I avoided the alternative approach of collapsing (marginalizing) across the read groups, mostly because it would require a complete overhaul of the code. I also think that using recal data from other read groups might not be a good idea. In any case, using OQ should be good enough; I assume that these "missing" read groups are low enough quality to be filtered out and are likely to be thrown out by downstream tools.

I also refactored the BQSR code, mostly to update the variable and class names to be more accurate and descriptive. For instance:

ReadCovariates.java -> PerReadCovariateMatrix.java
EstimatedQReported -> ReportedQuality

davidbenjamin

Looks good, just needs another pass at some of your comments and TODOs.

davidbenjamin · 2024-11-07T15:57:47Z

build.gradle

@@ -188,7 +188,7 @@ configurations.all {
 }

 tasks.withType(JavaCompile) {
-  options.compilerArgs = ['-proc:none', '-Xlint:all', '-Werror', '-Xdiags:verbose']
+  options.compilerArgs = ['-proc:none', '-Xlint:all', '-Xdiags:verbose']


This is beyond my job description but I trust there's a reason?

davidbenjamin · 2024-11-07T16:04:43Z

src/main/java/org/broadinstitute/hellbender/tools/ApplyBQSRArgumentCollection.java

+     * If set to true, do not throw an error upon encountering a read with a read group that's not in the recalibration table.
+     * Instead, simply set the quantized original base qualities as the recalibrated base qualities.
+     */ // tsato: should this be in the *unique* ApplyBQSRArgumentCollection?
+    @Argument(fullName = ALLOW_MISSING_READ_GROUPS_LONG_NAME, doc = "", optional = true)


empty doc string?

davidbenjamin · 2024-11-14T17:11:21Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/bqsr/ApplyBQSR.java

@@ -90,7 +90,7 @@ public final class ApplyBQSR extends ReadWalker{
     */
    @Argument(fullName=StandardArgumentDefinitions.BQSR_TABLE_LONG_NAME, shortName=StandardArgumentDefinitions.BQSR_TABLE_SHORT_NAME, doc="Input recalibration table for BQSR")
    @WorkflowInput
-    public File BQSR_RECAL_FILE;
+    public File bqsrRecalFile; // tsato: change to GATKPath?


Address this TODO

davidbenjamin · 2024-11-14T17:11:50Z

src/main/java/org/broadinstitute/hellbender/transformers/BQSRReadTransformer.java

@@ -26,14 +28,14 @@

 public final class BQSRReadTransformer implements ReadTransformer {
    private static final long serialVersionUID = 1L;
-
+    // tsato: remove?


davidbenjamin · 2024-11-14T17:15:01Z

src/main/java/org/broadinstitute/hellbender/transformers/BQSRReadTransformer.java


-        //clear indel qualities
+        // clear indel qualities // tsato: why? Is this ok? Do we update them?


davidbenjamin · 2024-11-14T17:22:59Z

src/main/java/org/broadinstitute/hellbender/utils/recalibration/BaseRecalibrationEngine.java

        for ( final NestedIntegerArray.Leaf<RecalDatum> leaf : byQualTable.getAllLeaves() ) {
-            final int rgKey = leaf.keys[0];
-            final int eventIndex = leaf.keys[2];
+            final int rgKey = leaf.keys[readGroupIndex]; // tsato: ??? What are these indices?


davidbenjamin · 2024-11-14T17:24:39Z

src/main/java/org/broadinstitute/hellbender/utils/recalibration/RecalUtils.java

@@ -32,7 +32,7 @@
 public final class RecalUtils {
    public static final String ARGUMENT_REPORT_TABLE_TITLE = "Arguments";
    public static final String QUANTIZED_REPORT_TABLE_TITLE = "Quantized";
-    public static final String READGROUP_REPORT_TABLE_TITLE = "RecalTable0";
+    public static final String READGROUP_REPORT_TABLE_TITLE = "RecalTable0"; // TODO: make them more descriptive


davidbenjamin · 2024-11-14T17:24:58Z

src/main/java/org/broadinstitute/hellbender/utils/recalibration/RecalUtils.java

    private static boolean warnUserNullPlatform = false;

    private static final String SCRIPT_FILE = "BQSR.R";
    public static final int EMPIRICAL_QUAL_DECIMAL_PLACES = 4;
-    public static final int EMPIRICAL_Q_REPORTED_DECIMAL_PLACES = 4;
+    public static final int REPORTED_QUALITY_DECIMAL_PLACES = 4; // tsato: "estimated" q reported...we need to rename (DONE)


davidbenjamin · 2024-11-14T17:25:59Z

...main/java/org/broadinstitute/hellbender/utils/recalibration/covariates/ContextCovariate.java

 public final class ContextCovariate implements Covariate {
    private static final long serialVersionUID = 1L;
    private static final Logger logger = LogManager.getLogger(ContextCovariate.class);

-    private final int mismatchesContextSize;
+    private final int mismatchesContextSize; // TODO: rename "mismatch" here


davidbenjamin · 2024-11-14T17:30:45Z

src/main/java/org/broadinstitute/hellbender/utils/recalibration/RecalDatum.java

@@ -136,27 +152,30 @@ public double getEmpiricalErrorRate() {
            return doubleMismatches / doubleObservations;
        }
    }
-
-    public void setEmpiricalQuality(final double empiricalQuality) {
+    // tsato: int-double


clean up comment

lbergelson

@takutosato I took a look. Definitely want to put back the gradle build to how it was. If you need help with the warning we can help. I'm not sure I totally understand the motivation for changing the internal quality value from a double to an int? I think you changed if to being PHRED scaled, but it seems like that might change the precision of the calculation which could change the results? Or is it always output as a PHRED value before further calculations are made? It's not clear to me.

Thank you for the clarifying comments.

Lots of unaddressed TODOs. Some of them probably should be removed but if you think that some of them are useful to future readers than that's fine.

lbergelson · 2024-10-30T21:15:11Z

build.gradle

@@ -188,7 +188,7 @@ configurations.all {
 }

 tasks.withType(JavaCompile) {
-  options.compilerArgs = ['-proc:none', '-Xlint:all', '-Werror', '-Xdiags:verbose']
+  options.compilerArgs = ['-proc:none', '-Xlint:all', '-Xdiags:verbose']


What warning are we seeing that made you take this out?

Lets fix the warning or supress it instead of turning this off.

This has been reverted.

lbergelson · 2024-10-30T21:17:53Z

src/main/java/org/broadinstitute/hellbender/tools/ApplyBQSRArgumentCollection.java

+     * If set to true, do not throw an error upon encountering a read with a read group that's not in the recalibration table.
+     * Instead, simply set the quantized original base qualities as the recalibrated base qualities.
+     */ // tsato: should this be in the *unique* ApplyBQSRArgumentCollection?
+    @Argument(fullName = ALLOW_MISSING_READ_GROUPS_LONG_NAME, doc = "", optional = true)


Forgot the doc string.

Thanks for this note. I think this argument should be moved up. to ApplyBQSRUniqueArgumentCollection. It is only relevant to the apply stage and not to the recalibration stage so that makes sense for it to be there. Also, I think you need to update the method
ApplyBQSRUniqueArgumentCollection.toApplyBQSRArgumentCollection to pass this argument to the newly created collection.

I think we can do away with this UniqueArgumentCollection construction --- for another time.

lbergelson · 2024-10-30T21:19:31Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/bqsr/ApplyBQSR.java

@@ -90,7 +90,7 @@ public final class ApplyBQSR extends ReadWalker{
     */
    @Argument(fullName=StandardArgumentDefinitions.BQSR_TABLE_LONG_NAME, shortName=StandardArgumentDefinitions.BQSR_TABLE_SHORT_NAME, doc="Input recalibration table for BQSR")
    @WorkflowInput
-    public File BQSR_RECAL_FILE;
+    public File bqsrRecalFile; // tsato: change to GATKPath?


That would be a good thing to do but we could do it in a separate PR.

lbergelson · 2024-10-30T21:27:11Z

src/main/java/org/broadinstitute/hellbender/transformers/BQSRReadTransformer.java

@@ -26,14 +28,14 @@

 public final class BQSRReadTransformer implements ReadTransformer {
    private static final long serialVersionUID = 1L;
-
+    // tsato: remove?


I think it can be removed.

lbergelson · 2024-10-30T21:30:16Z

src/main/java/org/broadinstitute/hellbender/transformers/BQSRReadTransformer.java

    private byte[] staticQuantizedMapping;
    private final CovariateKeyCache keyCache;

+    // tsato: new flag, needs to be moved to apply BQSR argument
+    private boolean allowMissingReadGroup;


allowMissingReadGroup -> allowMissingReadGroups

make this final

I think the comment is resolved.

lbergelson · 2024-11-18T19:56:27Z

src/test/java/org/broadinstitute/hellbender/GATKBaseTest.java

@@ -92,6 +92,8 @@ public abstract class GATKBaseTest extends BaseTest {
    // Variants from a DBSNP 138 VCF form the first 65Mb of chr1
    public static final String dbsnp_138_b37_1_65M_vcf = largeFileTestDir + "dbsnp_138.b37.1.1-65M.vcf";

+    public static final String BQSR_TEST_RESOURCE_DIR = toolsTestDir + "BQSR/";


This seems like the wrong place for this constant. I'd put it in a BQSR test or maybe just use the CommandLineProgram.getTestDataDir() if that works.

lbergelson · 2024-11-18T19:57:18Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/bqsr/ApplyBQSRIntegrationTest.java

-                IllegalStateException.class);
-        spec.executeTest("testMissingReadGroup", this);
+    public void testMissingReadGroup() {
+        // TODO: there are two copies of this file (one under Validation and one under BQSR) in the repo; conslidate.


That's fine. Duplicated test files are alright. Git deduplicates them internally for transfer so they don't really take up meaningful amounts of room.

sounds good!

lbergelson · 2024-11-18T19:58:30Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/bqsr/ApplyBQSRIntegrationTest.java

+            final byte[] oldQuals = ReadUtils.getOriginalBaseQualities(originalRead);
+
+            if (ReadUtils.getPlatformUnit(originalRead, originalBamHeader).equals(readGroupToFilterOut)) {
+                // These are the read groups that are not in teh recal table.


lbergelson · 2024-11-18T19:58:53Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/bqsr/ApplyBQSRIntegrationTest.java

+
+            if (ReadUtils.getPlatformUnit(originalRead, originalBamHeader).equals(readGroupToFilterOut)) {
+                // These are the read groups that are not in teh recal table.
+                // final Random random = new Random();


Can this be deleted?

lbergelson · 2024-11-18T20:00:35Z

src/test/java/org/broadinstitute/hellbender/utils/recalibration/RecalDatumUnitTest.java

@@ -97,12 +97,12 @@ public void testRecalDatumCopyAndCombine(RecalDatumTestProvider cfg) {
    @Test(dataProvider = "RecalDatumTestProvider")
    public void testRecalDatumModification(RecalDatumTestProvider cfg) {
        RecalDatum datum = cfg.makeRecalDatum();
-        datum.setEmpiricalQuality(10.1);
-        Assert.assertEquals(datum.getEmpiricalQuality(), 10.1);
+        datum.setEmpiricalQuality(10); // tsato: 10.1 -> 10. Downstream data might be affected.


Does changing the quality from double -> int potentially change the results of the recalibration table by rounding earlier? I'm not sure what the advantage of that is?

takutosato · 2024-11-19T00:15:46Z

@lbergelson thanks for the review! I will address the changes as soon as I can.

The motivation for the double -> int change is in RecalDatum::bayesianEstimateOfEmpiricalQuality(). In the master branch now, this method always returns a (phred-scaled integer) / RESOLUTION_BINS_PER_QUAL, and RESOLUTION_BINS_PER_QUAL is a final constant set to 1. So this method really returns an integer, and my change attempts to make this clear.

gatk-bot · 2024-11-19T00:27:17Z

Github actions tests reported job failures from actions build 11903509453
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
cloud	17.0.6+10	11903509453.10	logs
unit	17.0.6+10	11903509453.12	logs
integration	17.0.6+10	11903509453.11	logs
unit	17.0.6+10	11903509453.1	logs
integration	17.0.6+10	11903509453.0	logs

gatk-bot · 2025-01-10T04:10:40Z

Github actions tests reported job failures from actions build 12702971619
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	17.0.6+10	12702971619.12	logs
unit	17.0.6+10	12702971619.1	logs

gatk-bot · 2025-02-09T14:49:50Z

Github actions tests reported job failures from actions build 13226664486
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
unit	17.0.6+10	13226664486.12	logs
cloud	17.0.6+10	13226664486.10	logs
integration	17.0.6+10	13226664486.11	logs
unit	17.0.6+10	13226664486.1	logs
conda	17.0.6+10	13226664486.3	logs
integration	17.0.6+10	13226664486.0	logs

This reverts commit a6c2df3.

takutosato · 2025-02-24T19:06:12Z

@lbergelson back to you. I want to address the TODO regarding when an unknown read group is encountered by BaseRecalibrator in a future PR (it should probably throw an error as it did before, but it doesn't.) Let me know if that's ok.

Other than the tests in the repo, I ran the new code on an element bio WGS data and it produced identical output as the old master.

lbergelson

@takutosato Thank you!

* Funcotator Update for Datasource Release V1.8 (#8512) * Fixed Funcotator VCF output renderer to correctly preserve B37 contig names on output for B37 aligned files (#8539) * Fix for events not in minimal representation (#8567) * Ultima.flow annotations.fix (#8442) * hmer ondel must have mon length * Revert "hmer ondel must have mon length" This reverts commit 7852871. * remove superfluous variant type condition * fix error message to actually reflect missing argument * fixed unittest to include variant type * Remove conflict * Removes unnecessary and buggy validation check (#8580) * Additional fix + logging fixes * Added missing initialization * Update our HTSJDK dependency to 4.0.2 (#8584) * Update picard to 3.1.1 (#8585) * Add option to AnalyzeSaturationMutagenesis to keep disjoint mates (#8557) * Add option for keeping disjoint mates in ASM * Better name and fixing reports * Finish fixing report * Fix report name * New/Updated Flow Based Read tools (#8579) New Tool: GroundTruthScorer Update: FlowFeatureMapper * GroundTruthScorer doc update (#8597) * Add a native GATK implementation for 2bit references, and remove the dependency on the ADAM library (#8606) * Add a native GATK implementation for 2bit references, with comprehensive unit tests * For now, this is only hooked up to the Spark codepath, but it could easily be hooked up to ReferenceDataSource and the Walker codepath as well * Remove the dependency on the ADAM library, to resolve conflicts with future dependency upgrades * Update dependencies to address security vulnerabilities, and add a security scanner to build.gradle (#8607) * Updated many GATK dependencies to address known security vulnerabilities * Added a security scanner to build.gradle * There are still some remaining vulnerabilities in GATK dependencies, but this eliminates most of them * Update http-nio and wire its new settings (#8611) * Update http-nio and wire it so it's configured at startup along with GCS setttings. * PrintFileDiagnostics for cram, crai and bai. (#8577) * New experimental tool to print out human readable file diagnostics for cram/crai/bai files. * Allow GenomicsDBImport to connect to az:// files without interference (#8438) * GATK's lack of support for az:// URIs means that although GenomicsDB can natively read them, parts of the java code crash when interacting with them * Adding --avoid-nio and --header arguments These allow disabling all of the java interaction with the az:// links and simply passing them through to genomicsdb This disables some safeguards but allows operating on files in azur * Update GenomicsDB version to 1.5.1 for azure improved support * There are no direct tests on azure since we do not yet have any infrastructure to generate the necessary tokens, there is a disabled test which requires #8612 before we can enable it. --------- Co-authored-by: Nalini Ganapati <nalini@datma.com> Co-authored-by: Nalini Ganapati <nalinigans@github.com> * Disable line-by-line codecov comments (#8613) * Support for custom ploidy regions in HaplotypeCaller (#8609) For having variable ploidy in different regions, like making haploid calls outside the PAR on chrX or chrY, there is now a --ploidy-regions flag. The -ploidy flag sets the default ploidy to use everywhere, and --ploidy-regions should be a .bed or .interval_list with "name" column containing the desired ploidy to use in that region when genotyping. Note that variants near the boundary may not have the matching ploidy since the ploidy used will be determined using the following precedence: * ploidy given in --ploidy-regions for all intervals overlapping the active region when calling your variant with ties broken by using largest ploidy); note ploidy interval may only overlap the active region and determine the ploidy of your variant even if the end coordinate written for your variant lies outside the given region * ploidy given via global -ploidy flag * ploidy determined by the default global built-in constant for humans (2). --------- Co-authored-by: Ty Kay <kay@wm6d8-674.broadinstitute.org> Co-authored-by: rickymagner <81349869+rickymagner@users.noreply.github.com> * Update the GATK base image to a newer LTS ubuntu release (#8610) * Update the GATK base image to the latest Ubuntu LTS release (22.04) * Add some additional useful utilities to the base image * Switch to a newer conda version with a much faster solver * Update the scripts and documentation for building the base image * Update the VETS integration tests to allow for a small epsilon during numeric comparisons, and include the full diff output in exception messages when a mismatch is detected * build_docker_remote: add ability to specify the RELEASE arg to the cloud-based docker build, and add a release script (#8247) * Added a -r argument to build_docker_remote.sh to toggle the RELEASE flag during docker builds * Added a release_prebuilt_docker_image.sh to release a prebuilt docker image to the official repos * Update to htsjdk 4.1.0 (#8620) * update to htsjdk 4.1.0 which enables http-nio in more cases * remove several test cases handling genomicsdb path parsing which were testing nonsensical paths that are now illegal * Fix the Spark version in the GATK jar manifest, and used the right constant in build.gradle (#8625) * Update http-nio to 1.1.0 which implements Path.resolve() methods (#8626) * This should make http access seamless in many places * The way this handles query parameters is not ideal for signed url cases so we'll need to revisit that * Fix GT header in PostprocessGermlineCNVCalls's --output-genotyped-intervals output (#8621) * Write gCNV interval output ID=GT header as Type=String Incorrectly writing this as Type=Integer causes bcftools to misparse the genotype field. * Use correct header types and numbers in test VCF file * Output the new image name at the end of a successful cloud docker build (#8627) * Reduce SVConcordance memory footprint (#8623) * Rewrite complex SV functional annotation in SVAnnotate (#8516) * Improvements to Mutect2's Permutect training data mode (#8663) * include normal seq error log likelihood in Permutect dataset * handle different alelle representations in multiallelic / indel variants for Permutect training data mode * set the default artifact to non-artifact ratio to 1 in Permutect training data mode * normal artifact lod is now defined without the extra minus sign (#8668) * Parameterize the logging frequency for ProgressLogger in GatherVcfsCloud. (#8662) * Handle CTX_INV subtype in SVAnnotate (#8693) * Move to GenomicsDB 1.5.2 which supports M1 macs (#8710) * Support for MacOS universal builds (intel AND M1) * Catch JNI importer exceptions and propagate them as java IOExceptions * Turn off HDFS support by default Co-authored-by: Nalini Ganapati <nalini@datma.com> * Standardize test results directory between normal/docker tests (#8718) Normalize the test results location between docker and non docker tests * fix no data hom refs (#8715) Output GQ0 genotypes from reference blocks as no-call rather than hom-ref * Update the setup_cloud github action (#8651) * update setup-gcloud@v0 -> v2 since v0 is deprecated * Improve failure message in VariantContextTestUtils (#8725) * added --inverted-read-filter argument to allow for selecting reads that fail read filters from the command line easily (#8724) * GatherVcfsCloud is no longer beta (#8680) * Fix to long deletions that overhang into the assembly window causing exceptions in HaplotypeCaller (#8731) * dragstr model in Mutect2 WDL (#8716) * Update README to include list of popular software included in docker image (#8745) * Update README to include list of popular software included in docker image * Now we should be excluding the test folder from code coverage calculations (#8744) * Make M2 haplotype and clustered events filters smarter about germline events (#8717) * M2 bad haplotype filter does not filter variants that share a haplotype with a germline event * two ECNT annotations -- haplotype and region -- and clustered events filter looks at both * Funcotator: suppress a log message about b37 contigs when not doing b37/hg19 conversion (#8758) Don't print the very long and misleading "The following contigs are present in b37 and missing in the input VCF sequence dictionary" log message when we're not even doing b37/hg19 conversion. * SNVQ recalibration tool added for flow based reads (#8697) Co-authored-by: Dror Kessler <dror.kessler@ultimagen.com> * Several GQ0 cleanup changes: (#8741) Set GGVCFs --all-sites GQ0 hom-refs to no-calls Set regular GGVCFs GQ0 hom-refs to no-calls (any DP, PL) for better AF/AN annotations Remove PLs in "no data" case where DP=0 for more accurate QUAL score * Re-commit large files as lfs stubs (#8769) Several files tracked by git lfs were accidentally reimported as normal files. This makes them stubs again. * Enable ReblockGVCF to subset AS annotations that aren't "raw" (pipe-delimited) (#8771) * Enable ReblockGVCF to subset AS annotations that aren't "raw" (i.e. pipe-delimited) * Fix tests by removing AssemblyComplexity from default annotations * Gc getpipeupsummaries use mappingqualityreadfilter (#8781) * Add MappingQualityReadFilter * Added additional warnings for mmq * Fixed doc typo * Add malaria spanning deletion exception regression test with fix (#8802) * Add malaria spanning deletion exception regression test with fix * Disabling codecov. --------- Co-authored-by: Jonn Smith <jonn@broadinstitute.org> * Bug fix in flow allele filtering (#8775) * Fixed a bug that prevented filtering by SOR in many cases * Allow for GT to be a nocall if GQ and PL[0] are zero instead of homref in GenomicsDB (#8759) * Allow for GT to be a nocall if GQ and PL[0] are zero instead of homref in GenomicsDB * Move to 1.5.3 release from snapshot --------- Co-authored-by: Nalini Ganapati <nalini@datma.com> Co-authored-by: Nalini Ganapati <nalinigans@github.com> * Reduced docker layers in GATK image from 44 to 16 (#8808) * Reduced total layers in the GATK docker image from 44 down to 16. * Reduced GATK base image layers from 20 to 3. * This might be a better solution than a full squash down to a single layer, because: If we are hosting this in a premium ACR, the limit is 10,000 readOps per minute. So with 16 layers, you get around 625 pulls per minute. Also, this will be able to still take advantage of parallel pulls (default is 3, but at most 16 threads in this case, I believe) as opposed to one big layer which will not download in parallel. There's the potential of that being a lot slower and subsequent jobs falling into the same "minute" because others are not done, making it easier to hit that 10k readOps limit. Lastly, people using GATK outside data pipelines will not be able to take advantage of layer caching too. Resolves #8684 * VariantFiltration: added arg to write custom mask filter description in VCF header (#8831) Added a --mask-description argument to VariantFiltration to write a custom description for the mask filter in the VCF header * Bigger Permutect tensors and Permutect test datasets can be annotated with truth VCF (#8836) * added 20 more Permutect read features * Permutect test data can, like training data, be annotated with a truth VCF * [BIOIN-1570] Fixed edge case in variant annotation (#8810) * [BIOIN-1570] Fixed edge case in variant annotation when the variant is close to the edge of the reference * Mutect2 germline resource can have split multiallelic format (#8837) * Mutect2 WDL and GetSampleName can handle multiple sample names in BAM headers (#8859) * Permutect dataset engine outputs contig and read group indices, not names (#8860) * Fixed a bug in AlleleFiltering that ignored more than a single sample (#8841) * Fixing bug in ReblockGVCFs when removing annotations (#8870) * fix for gnarly when PLs are null (#8878) * Fix GenotypeGVCFs with mixed ploidy sites (#8862) * Remove deprecated genomes in the cloud docker image that was causing CNN WDL test failures (#8891) Replace usage of a deprecated Genomes-in-the-Cloud docker image that was causing the CNN WDL tests to fail with the GATK base image. * Restore gnarly tests (#8893) * Update no-calls in Gnarly haploid test * Put back turned off tests * Remove header lines in ReblockGVCFs when we remove FORMAT annotations (#8895) * remove header lines when removing annotation in Reblocking * clean up * Update several dependencies to fix vulnerabilities (#8898) * Update http-nio to 1.1.1 (#8889) This release fixes several bugs when using HttpPath resolve methods that have been affecting users. In particular, they were causing issues when trying to construct HTTP URIs for companion files such as fasta/bam indices. * Inverted SoftClippedReadFilter to conform to the standard filtering logic (#8888) * Tool to detect CRAM base corruption caused by GATK issue 8768 (#8819) A new diagnostic tool, CRAMIssue8768Detector, that analyzes a CRAM file to look for possible base corruption caused by #8768 This issue affects GATK versions 4.3.0.0 through 4.5.0.0, and is fixed in GATK 4.6.0.0. This issue also affects Picard versions 2.27.3 through 3.1.1, and is fixed in Picard 3.2.0. The bug is triggered when writing a CRAM file using one of the affected GATK/Picard versions, and both of the following conditions are met: * At least one read is mapped to the very first base of a reference contig * The file contains more than one CRAM container (10,000 reads) with reads mapped to that same reference contig When both of these conditions are met, the resulting CRAM file may have corrupt containers containing reads with an incorrect sequence. This tool writes a report to an output text file indicating whether the CRAM file appears to have read base corruption caused by issue 8768, and listing the affected containers. By default, the output report will have a summary of the average mismatch rate for all suspected bad containers and a few presumed good containers in order to determine if there is a large difference in the base mismatch rate. Optionally, a TSV file with the same information as the textual report, but in tabular form, can be written using the "--output-tsv" argument. --------- Co-authored-by: David Roazen <droazen@broadinstitute.org> * Update HTSJDK to 4.1.1 and Picard to 3.2.0 (#8900) Includes a unit test to check for the presence of the fix in HTSJDK 4.1.1 for the CRAM base corruption bug reported in #8768 Resolves #8768 * Handle CTX_PP/QQ and CTX_PQ/QP CPX_TYPE values in SVConcordance (#8885) * Updated Python and PyMC, removed TensorFlow, and added PyTorch in conda environment. (#8561) * Updated Python and PyMC, removed TensorFlow, and added PyTorch in Docker and conda environments. Notable environment changes: python 3.6.10 -> 3.10.13, pymc 3.1 -> 5.10.0, theano 1.0.4 -> pytensor 2.18.1, added pytorch 2.1.0, removed tensorflow 1.15.0 and other CNN dependencies, added libblas-dev to the base Docker. * Updated gCNV code to account for changes from PyMC3/Theano to PyMC/PyTensor. * Updated gCNV integration tests. * Updated gCNV WDL tests. * Updated other tests and tools affected by environment changes. * Reverted posterior sampling to online implementation. * Updated localDevCondaEnv task in build.gradle. * Addressed review comments and cleaned up TODOs. * Released gatkbase-3.3.0 and updated Dockerfile. * Added DeprecatedFeature tags to CNNScoreVariants, CNNVariantTrain, and CNNVariantWriteTensors. * Complex SV intervals support (#8521) * Add support for CPX variants in SVCluster and SVConcordance * Update hdf5-java-bindings to version 1.2.0-hdf5_2.11.0, which removes log4j 1.x (#8908) * Clarify in the README which git lfs files are required to build GATK (#8914) Resolves #8912 * Modified HaplotypeBasedVariantRecaller to support non-flow reads (#8896) * X_FILTERED_COUNT semantics adjusted in FlowFeatureMapper (#8894) * adding test to match WARP tests edge case (#8928) * Adds VcfComparator tool (#8933) * Systematically compare the outputs of two pipeline versions, especially post-reblocking when many deletion "cleanup" changes occur * Subset annotations to account for different alleles * Refactor and clean up * Check ref blocks * Be explicit about AS_RAW annotation differences * Update changed star attributes from NaNs to missing; add option to ignore * Ignoring star attributes still needs work * Lookin' pretty good... * Add new mode to log all differences before failing * Misplaced brace prevented PL checks * Check on */* genotypes was failing DRAGEN outputs -- removed * clean up after rebase * adding test * style * not sure why we need to cap values for the comparator so removing it * addressing comments * more comments --------- Co-authored-by: Laura Gauthier <gauthier@broadinstitute.org> * Add docs about citing GATK (#8947) * Issue #7159 Create tool for producing genomic regions (as a BED file) (#8942) * Initial commit and basic code to read gtf * add: code to write to bed & integration test * fix: make getAllFeatures public and use the nesting of features to get to transcripts * add: filtering transcripts by basic tag * add: sorts by contig and start (need to fix - sorting lexicographically) * fix: now sorts by contig then start & output is correct * fix: make dictionary an arg * add: comments + simplified CompareGtfInfo * refactor: apply method test: add separate tests for gene and transcript * refactor: onTraversalSuccess and writeToBed * add: more tests * fix: test files in correct dir pt1. (files are too large) * fix: test files in correct dir pt2. * add: compareFiles and ground truth bed files * fix: runGtfToBed assert * add: comments to GtfToBed * fix: error handling for different versions of gtf and dictionary * fix: edited some bad conventions * fix: remove spaces from input file fullName * add: gtf file with MYT1L and MAPK1 * add: many transcripts unit test and refactoring * add: tiebreaker sorting by id * add: make sort by basic optional * add: html doc comment * fix: dictionary arg * fix: add "Gencode" to description * add: sample mouse gencode testing * fix: Remove arg shortnames * fix: rename and move CompareGtfInfo * fix: kebab-case args * fix: update html doc * fix: use IntegrationTestSpec.assertEqualTextFiles() * fix: remove unnecessary test of pik3ca * fix: remove set functions in GtfInfo * fix: style of comparator * fix: style of comparator * fix: use Files.newOutputStream() to write and logger for errors * fix: use getBestAvailableSequenceDictionary() * fix: use dataProvider for integration tests * fix: better encapsulation * fix: move mapk1.gtf to large dir * fix: arg names * fix: rename reference dict. * fix: sequence-dictionary arg javadoc * add: javadoc to GtfInfo * add: dictionary exception and corresponding test * add: test with fasta file as reference arg * add: javadoc for fasta file * fix: javadoc and onTraversalStart exception * Update the large test CRAM files to CRAM v3.0. (#8832) * Update detector output files. (#8971) Repair the CRAM detector/diagnostics test output to reflect the CRAM file name change that was introduced by updating the large CRAM files to v3.0. * Require both overlap and breakend proximity for depth-only SV clustering (#8962) * Adds QD and AS_QD emission from VariantAnnotator on GVS input (#8978) * Updating VariantAnnotator to emit QD and AS_QD on GVS input * clean up * addressing comments * Added new argument "--variant-output-filtering" to variant walkers for customized filtering of output variants to intervals. (#6388) Co-authored-by: James <emeryj@broadinstitute.org> * Swapped mito mode in Mutect to use the mode argument utils, providing better override behavior and documentation (#8986) * Warn instead of throwing when querying intervals that were not in GenomicsDBImport (#8987) * Switch to logging a warning instead of an exception for intervals in a query that were not part of GenomicsDBImport * upgrade GenomicsDB 1.5.3 -> 1.5.4 * fixes #8415 * Updates to VcfComparator (#8973) * updates changes for gq0 comparison more changes some vcfs have no gqs clean up change for NPE clean up whitespace * addressing comments * Update Mutect2.java Documentation (#8999) This commit changes the default --mitochondria-mode parameters to what they are supposed to be in the source code based on the raised question in the forum https://gatk.broadinstitute.org/hc/en-us/community/posts/29969305493147-Incorrect-documentation-for-mitochondria-mode-in-Mutect2 This commit also fixes the mitochondria mode flag to its correct naming in the documentation. * Add dependency submission workflow to monitor vulnerabilities (#9002) * Adds a new github actions workflow to submit our dependencies for analysis * Add more detailed conda setup instructions to the GATK README (#9001) Our README was missing some important details on how to properly setup the GATK conda environment locally. * Port of nvscorevariants into GATK, with a basic tool frontend (#8004) * Port of the NVIDIA-authored nvscorevariants tool into GATK, with a basic tool frontend * This is a direct replacement for the legacy tool CNNScoreVariants. It produces results that are almost identical to that tool, but is implemented on top of a more modern ML library, Pytorch. * The Python code is taken from https://github.com/NVIDIA-Genomics-Research/nvscorevariants, with a few minor modifications necessary to get the tool working on newer versions of the Python libraries. * Added pytorch-lightning to the GATK conda environment, as it's required by this tool * Disabled jacoco in build.gradle, as it was causing strange errors related to jacoco trying to parse the new Pytorch model files in the resources directory --------- Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> * Re-added `--only-output-calls-starting-in-intervals` as a deprecated argument. (#9000) * Adding small warning messages to not to feed any GVCF files to these tools (#9008) * Adding small warning messages to not to feed any GVCF files to these tools. * Update src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/VariantRecalibrator.java Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> * Update src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantAnnotator.java Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> * Update src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantAnnotator.java Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> --------- Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> * Update docker base (#9005) * Rebuild base image to incorporate recent patches * gatkbase-3.3.0 -> 3.3.1 * Added a check for whether files can be created and executed within the configured tmp-dir (#8951) * Update gradle and build.gradle (#8998) * Update gradle and build.gradle * update gradle wrapper 8.2.1 -> 8.10.2 * remove 'versions' plugin because we don't use it * update gradle plugins to new versions * shadow plugin changed maintainers and coordinates com.github.johnrengelman.shadow:8.1.1 -> com.gradleup.shadow:8.3.3 * git-version 0.5.1 -> 3.1.0 * sonatype scan 2.6.1 -> 2.8.3 * download 5.4.0 -> 5.6.0 * use tasks.register() which is the newer style * add gradle.properties to configure daemon memory * update several vulnerable dependencies to safer versions * update htsjdk and picard * replace deprecated override of ThresholdStream.getStream() -> getOutputStream() * Remove CNNScoreVariants, CNNVariantTrain, and CNNVariantWriteTensors (#9009) Removed the code for these tools, now that NVScoreVariants is merged and we've moved to a Python environment that is incompatible with these legacy tools. Also added the tools to the DeprecatedToolsRegistry. * Mark NVScoreVariants as a beta feature (#9010) * Additional Dependency updates (#9006) * Updating dependency management and vulnerable dependencies * Update dependencies to fix vulnerabilities as reported in #8950 * Update our dependency management to make use of some newish gradle features * Add dependency constraints to update transitive dependencies, this allows us to specify versions without making them direct dependencies * Remove most force expressions and replace them where necessary with version strict requirements * Make use of several published bom's to configure consistent dependency versions for platforms like netty and log4j2 * Remove exclude statements that are now handled by variant dependency resolution (like guava android vs jdk) * Exclude the org.bouncycastle:bcprov-jdk15on dependency and replace it with bcprov-jdk18onA This adds an unnecessary testUtilImplementation level dependency on what is really a transitive, but I couldn't get gradle's explicit version replacement to work. replacement logic to work so this is a workaround * Added a '--prefer-mane-transcripts' mode that enforces MANE_Select tagged Gencode transcripts where possible (#9012) * Added a '--prefer-mane-transcripts' mode that enforces MANE_Select tagged Gencode transcripts where possible * Use jetty bom to enforce uniform jetty versions (#9016) * fix dependabot alert for jetty-xml * bom version -> 9.4.56.v2024082 * Change the kryo version specification to use maven style [,) instead of + (#9018) * maven central doesn't like + in version ranges and gradle doesn't translate it * Port of `CallableLoci` from GATK3 (#9031) * This is a port of a tool from GATK which is being used in some active pipelines. * CallableLoci is a a tool which calculates coverage based stats and identifies regions of the genome which should be "callable" or not based on several filters. * Retain all source IDs on VariantContext (#9032) * Retain all source IDs on VariantContext merge In GATK3, when merging variants, the IDs of all the source VCFs were retained. This code path seems like it intended that, since the variantSources set is generated, but it doesnt get used for anything. This PR will use that set to set the source of the resulting merged VC. * Add overloaded method to GATKVariantContextUtils.simpleMerge to preserve existing behavior when storeAllVcfSources=false Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> * Add SVStratify and GroupedSVCluster tools (#8990) * VariantRecalibrator R script fix for recent R versions. (#9046) * change space=rgb to space=lab * change space=rgb to space=lab * VQSR Rscript fix * Update README.md (#9047) * Update README to match our R and Python requirements. * Update brew installation instructions. Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> * Prioritize het calls when merging clustered SVs (#9058) * Updating upload_artifact in github actions (#9061) Update upload_artifact from 3 -> 4 * Close a FeatureReader after use in HaplotypeCallerEngine (#9078) * Bump ch.qos.logback:logback-core from 1.4.14 to 1.5.13 (#9079) * Bump ch.qos.logback:logback-core from 1.4.14 to 1.5.13 Bumps [ch.qos.logback:logback-core](https://github.com/qos-ch/logback) from 1.4.14 to 1.5.13. - [Commits](qos-ch/logback@v_1.4.14...v_1.5.13) --- updated-dependencies: - dependency-name: ch.qos.logback:logback-core dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * also update logback-classic to keep it in sync --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> * Remove unecessary calculation and methods in FlowBasedRead (#9077) * remove the field FlowBaseRead.forwardSequence was calculated at non-trivial cost and then only used for it's length * removed the method seqlength because it was a more convoluted of getLength * fixed some typos and finals because I couldn't help myself * Lots of little changes to Permutect dataset generation (#9094) * alt downsampling is drawn uniformly at random, not matching artifact distribution * fixed the variant distance from end of read to account for clipping * feature range only goes up to 20 since reads can be clipped that small * took out total read mismatches in favor of ranged mismatches * removed STR INFO features now computed in Python; added TLOD / alt count INFO feature * renamed mutect3 and m3 to permutect everywhere * M2 can create test and training datasets at the same time (also in WDL), with distinct truth VCFs * masked intervals, option to skip filltering in M2 WDL * argument for considering only a fraction of germline sites active i.e. partial genotypeGermlineSites mode * Bring in some performance improvements from genomicsdb (#9059) * Bring in some performance improvements from genomicsdb which fix a problematic memory leak * Move to 1.5.5 version of genomicsdb * BQSR: avoid throwing an error when read group is missing in the recal table, and some refactoring. (#9020) * Addressed an edge case in ScoreVariantAnnotations that can occur when one variant type is not present in the input VCF. (#9112) * Update netty version (#9120) * netty 4.1.114.Final -> 4.1.118.Final * includes security patches * Uncomment line excluding bad version of bouncycastle library (#9129) * This seems to have been accidentally commented out for testing and never un-commented * Exclude logback-classic to fix logger configuration warning (#9128) * Updated references to the funcotator datasets bucket to point to the new google bucket (#9131) * Update gradle sonatype plugin (#9133) * 2.8.3 -> 3.0.0 * Bump org.apache.commons:commons-vfs2 from 2.9.0 to 2.10.0 (#9130) * Bump org.apache.commons:commons-vfs2 from 2.9.0 to 2.10.0 * Remove use of newly deprecated methods in test code: RandomStringUtils.randomAlphanumeric() -> RandomStringUtils.insecure().nextAlphanumeric() --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> * Update to org.apache.commons.lang3 from org.apache.commons.lang * Fixed the the ExtractCohortToVcfTest and ExtractCohortToPgenTest * Do not put QD into extracted VCFs * Added the --validation-type-to-exclude ALLELES flag to ValidateVariants to get around issue where GTs with GQ=0 now being no-called in extracted VCF and thus causing some alt alleles to not be present * Update the VDS Creation to transmute LGT so that it is a no-call if the GQ is 0 * Fix for code scanning alert no. 5: Arbitrary file access during archive extraction ("Zip Slip") --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: jamesemery <emeryj@broadinstitute.org> Co-authored-by: meganshand <mshand@broadinstitute.org> Co-authored-by: Dror Kessler <dror27.kessler@gmail.com> Co-authored-by: Ilya Soifer <ilya.soifer@ultimagen.com> Co-authored-by: droazen <droazen@broadinstitute.org> Co-authored-by: Louis Bergelson <louisb@broadinstitute.org> Co-authored-by: Chris Macdonald <17653365+odcambc@users.noreply.github.com> Co-authored-by: Chris Norman <cnorman@broadinstitute.org> Co-authored-by: Nalini Ganapati <nalini@datma.com> Co-authored-by: Nalini Ganapati <nalinigans@github.com> Co-authored-by: Ty Kay <kay@wm6d8-674.broadinstitute.org> Co-authored-by: rickymagner <81349869+rickymagner@users.noreply.github.com> Co-authored-by: John Marshall <jmarshall@hey.com> Co-authored-by: Mark Walker <markw@broadinstitute.org> Co-authored-by: epiercehoffman <epierceh@broadinstitute.org> Co-authored-by: David Benjamin <davidben@broadinstitute.org> Co-authored-by: Nalini Ganapati <35076948+nalinigans@users.noreply.github.com> Co-authored-by: Laura Gauthier <gauthier@broadinstitute.org> Co-authored-by: Kevin Lydon <klydon@broadinstitute.org> Co-authored-by: Dror Kessler <dror.kessler@ultimagen.com> Co-authored-by: Gökalp Çelik <37572619+gokalpcelik@users.noreply.github.com> Co-authored-by: Jonn Smith <jonn@broadinstitute.org> Co-authored-by: Kevin Palis <palis@broadinstitute.org> Co-authored-by: samuelklee <samuelklee@users.noreply.github.com> Co-authored-by: Sana Shah <54493565+sanashah007@users.noreply.github.com> Co-authored-by: Jonn Smith <jonn-smith@users.noreply.github.com> Co-authored-by: bbimber <bbimber@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Takuto Sato <tsato@broadinstitute.org>

takutosato added 8 commits August 9, 2024 08:11

temp

1a574ac

commit and push data before it's too late per Louis

409f928

temporarily disable werror but will need to figure it out at some point

3a213b5

first draft implementation

b0a6019

test passing

4660b00

proof reading 1

499e0e7

proofreading 2

4dd794b

finished proofreading

d372f01

davidbenjamin self-requested a review November 4, 2024 14:55

davidbenjamin self-assigned this Nov 4, 2024

davidbenjamin reviewed Nov 14, 2024

View reviewed changes

review 1

07f5383

lbergelson reviewed Nov 18, 2024

View reviewed changes

takutosato added 2 commits November 18, 2024 19:08

another review checkpoint

c90f3fc

revert build.gradle

109ee36

fix bug in combine

b96d5be

takutosato added 4 commits January 9, 2025 23:39

fix one test

808f0b8

check point review edits

5a983fc

another commit

933d8d0

print gatk version in header

a6c2df3

takutosato added 3 commits February 11, 2025 12:34

Revert "print gatk version in header"

f4a85fe

This reverts commit a6c2df3.

one last doc and remove tsato comments

b6bb583

finish responding to Louis comments

6f9c8f4

lbergelson approved these changes Feb 26, 2025

View reviewed changes

takutosato merged commit 8a032ba into master Feb 26, 2025
20 checks passed

takutosato deleted the ts_bqsr_read_group branch February 26, 2025 19:50

takutosato mentioned this pull request Feb 26, 2025

BQSR can produce tables with missing ReadGroups #6242

Closed


		//clear indel qualities
		// clear indel qualities // tsato: why? Is this ok? Do we update them?

BQSR: avoid throwing an error when read group is missing in the recal table, and some refactoring. #9020

BQSR: avoid throwing an error when read group is missing in the recal table, and some refactoring. #9020

Uh oh!

Conversation

takutosato commented Oct 28, 2024

Uh oh!

davidbenjamin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbergelson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatk-bot commented Nov 19, 2024 •

edited

Loading

gatk-bot commented Jan 10, 2025 •

edited

Loading

gatk-bot commented Feb 9, 2025 •

edited

Loading