Skip to content

bcf_dup() can corrupt records created de novo #1030

@mlin

Description

@mlin

One of those "all day and late into the night" bugs 😎

If you

  1. Fill out a bcf1_t record 'from scratch' (through a bunch of bcf_update_ function calls, rather than reading an existing VCF/BCF file)
  2. bcf_dup() the record
  3. Make further bcf_update_ changes to the record
  4. Write the record out to a file.

The updates made in step 3 are not reflected in the output. I think the reason is that the bcf_dup() in step 2 has the side effect of leaving the record in a 'packed and never unpacked' state, since it packs the fields in order to duplicate them, but nothing was ever unpacked since the record was created de novo. Later, in step 4, bcf1_sync goes to pack the new updates of step 3, but noticing that nothing was ever unpacked, first unpacks the stale data and overwrites the updates!

Here's a dumb repro program,

#include "vcf.h"

int main() {
    /* prelude -- create a toy bcf1_t record from scratch */
    bcf_hdr_t* hdr = bcf_hdr_init("w1");
    bcf1_t* rec = bcf_init();
    bcf_hdr_append(hdr, "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">");
    bcf_hdr_append(hdr, "##contig=<ID=chr1,length=1000>");
    bcf_hdr_add_sample(hdr, "Alice");
    bcf_hdr_sync(hdr);
    rec->rid = 0;
    rec->pos = 99;
    rec->rlen = 1;
    const char *alleles[] = { "A", "G" };
    bcf_update_alleles(hdr, rec, alleles, 2);
    const int32_t gt[] = { bcf_gt_unphased(0), bcf_gt_unphased(0) };
    bcf_update_genotypes(hdr, rec, gt, 2);

    /* here is the bug -- first, perform a no-op bcf_dup*/
    bcf_dup(rec);
    /* then set the record's ID */
    bcf_update_id(hdr, rec, "HelloWorld");

    /* write the header & record out to standard output */
    htsFile *stdout = vcf_open("-", "w0");
    bcf_hdr_write(stdout, hdr);
    bcf_write(stdout, hdr, rec);
    bcf_close(stdout);
}

If the bcf_dup() is there, we get the record with no ID:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=1000>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Alice
chr1	100	.	A	G	0	.	.	GT	0/0

Comment out the bcf_dup() and:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=1000>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Alice
chr1	100	HelloWorld	A	G	0	.	.	GT	0/0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions