Skip to content

Bug in writing best_model.nex file #272

@roblanf

Description

@roblanf

I encountered this error on my own data, so I backed up and followed the online tutorial here: http://www.iqtree.org/doc/Estimating-amino-acid-substitution-models, and got the same error. This tutorial used to work, so I'm not quite sure what's going on.

Using v2.3.5...

To reproduce the error, one can follow the tutorial to the letter:

wget http://www.iqtree.org/doc/data/plant_10alignments.zip
unzip plant_10alignments.zip 
cd plant_10alignments/

# 1st command works fine, as expected...
iqtree2 -seed 1 -T 10 -S train_plant -mset LG,WAG,JTT -cmax 4 -pre train_plant

# 2nd command gives output below, ending in error
iqtree2 -seed 1 -T 10 -S train_plant.best_model.nex -te train_plant.treefile --model-joint GTR20+FO --init-model LG -pre train_plant.GTR20

Output from the final command:

(phylo) rob@rosa:~/Qplant_tutorial/plant_10alignments$ iqtree2 -seed 1 -T 10 -S train_plant.best_model.nex -te train_plant.treefile --model-joint GTR20+FO --init-model LG -pre train_plant.GTR20
IQ-TREE multicore version 2.3.5 for Linux x86 64-bit built Jul  4 2024
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt,
Dominik Schrempf, Michael Woodhams, Ly Trong Nhan, Thomas Wong

Host:    rosa (AVX512, FMA3, 755 GB RAM)
Command: iqtree2 -seed 1 -T 10 -S train_plant.best_model.nex -te train_plant.treefile --model-joint GTR20+FO --init-model LG -pre train_plant.GTR20
Seed:    1 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Wed Jul 10 07:57:56 2024
Kernel:  AVX+FMA - 10 threads (128 CPU cores detected)

Reading partition model file train_plant.best_model.nex ...

Loading 10 partitions...
Reading alignment file train_plant/CDS_OG003327.nex ... Nexus format detected
Alignment has 38 sequences with 229 columns, 170 distinct patterns
112 parsimony-informative, 28 singleton sites, 89 constant sites
     Gap/Ambiguity  Composition  p-value
Analyzing sequences: done in 1.67117e-05 secs
   1  Abi    0.00%    passed     99.57%
...
  38  Zam    0.00%    passed     99.19%
****  TOTAL    0.88%  0 sequences failed composition chi2 test (p-value<5%; df=19)
ERROR: Expecting integer, but found "," instead

Figuring out the bug

That error seems to come from one of two places, both in this block of code:

for (; *endptr == ' '; endptr++) {}
str = endptr;
d = strtol(str, &endptr, 10);
if ((d == 0 && endptr == str) || abs(d) == HUGE_VALL) {
if (str[0] == '.') {
// 2019-06-03: special character '.' for whatever ending position
d = lower-1;
endptr++;
} else {
string err = "Expecting integer, but found \"";
err += str;
err += "\" instead";
throw err;
}
}
//lower = d_save;
upper = d;
// skip blank chars
for (; *endptr == ' '; endptr++) {}
if (*endptr != '\\') return;
// parse the step size of the range
str = endptr+1;
d = strtol(str, &endptr, 10);
if ((d == 0 && endptr == str) || abs(d) == HUGE_VALL) {
string err = "Expecting integer, but found \"";
err += str;
err += "\" instead";
throw err;
}
step_size = d;
}

This seems to be when we are parsing the model file train_plant.best_model.nex. For this anaysis, that file looks like:

#nexus
begin sets;
  charset CDS_OG003327.nex = train_plant/CDS_OG003327.nex: , ;
  charset CDS_OG003719_F_M.nex = train_plant/CDS_OG003719_F_M.nex: , ;
  charset CDS_OG003934_F_M.nex = train_plant/CDS_OG003934_F_M.nex: , ;
  charset CDS_OG005936.nex = train_plant/CDS_OG005936.nex: , ;
  charset CDS_OG006143.nex = train_plant/CDS_OG006143.nex: , ;
  charset CDS_OG006423.nex = train_plant/CDS_OG006423.nex: , ;
  charset CDS_OG006489_F_M.nex = train_plant/CDS_OG006489_F_M.nex: , ;
  charset CDS_OG007591.nex = train_plant/CDS_OG007591.nex: , ;
  charset CDS_OG007779.nex = train_plant/CDS_OG007779.nex: , ;
  charset CDS_OG008045_F.nex = train_plant/CDS_OG008045_F.nex: , ;
  charpartition mymodels =
    JTT+I{0.245661}+G4{0.761203}: CDS_OG003327.nex{4.40403},
    JTT+R3{0.561677,0.129922,0.284212,1.18801,0.154111,3.82437}: CDS_OG003719_F_M.nex{4.09621},
    LG+G4{0.713662}: CDS_OG003934_F_M.nex{5.22796},
    JTT+I{0.422431}+G4{0.8117}: CDS_OG005936.nex{2.50875},
    LG+G4{1.38001}: CDS_OG006143.nex{9.32513},
    JTT+I{0.245657}+G4{1.26698}: CDS_OG006423.nex{3.97915},
    LG+I{0.231278}+G4{1.21671}: CDS_OG006489_F_M.nex{5.49832},
    JTT+I{0.207393}+G4{1.64759}: CDS_OG007591.nex{6.51156},
    JTT+I{0.236365}+G4{0.896023}: CDS_OG007779.nex{5.78573},
    JTT+I{0.101596}+G4{1.19633}: CDS_OG008045_F.nex{7.4561};
end;

Looking at that file, I suspected that the problem was the empty comma-separated list after each filename , i.e. .nex: , ;.

So I re-ran the first command above with some older versions of IQ-TREE.

  1. with v2.3.4 has the best_model file looks the same, and the second command gives the same error.
  2. with v2.1.0 the file looks different! i.e.
#nexus
begin sets;
  charset CDS_OG003327.nex = train_plant/CDS_OG003327.nex: ;
  charset CDS_OG003719_F_M.nex = train_plant/CDS_OG003719_F_M.nex: ;
  charset CDS_OG003934_F_M.nex = train_plant/CDS_OG003934_F_M.nex: ;
  charset CDS_OG005936.nex = train_plant/CDS_OG005936.nex: ;
  charset CDS_OG006143.nex = train_plant/CDS_OG006143.nex: ;
  charset CDS_OG006423.nex = train_plant/CDS_OG006423.nex: ;
  charset CDS_OG006489_F_M.nex = train_plant/CDS_OG006489_F_M.nex: ;
  charset CDS_OG007591.nex = train_plant/CDS_OG007591.nex: ;
  charset CDS_OG007779.nex = train_plant/CDS_OG007779.nex: ;
  charset CDS_OG008045_F.nex = train_plant/CDS_OG008045_F.nex: ;
  charpartition mymodels =
    JTT+I{0.24506}+G4{0.759131}: CDS_OG003327.nex{4.40106},
    JTT+R3{0.561663,0.129876,0.284201,1.18761,0.154136,3.82477}: CDS_OG003719_F_M.nex{4.096},
    LG+G4{0.713662}: CDS_OG003934_F_M.nex{5.22796},
    JTT+I{0.421361}+G4{0.807842}: CDS_OG005936.nex{2.50855},
    LG+G4{1.37992}: CDS_OG006143.nex{9.32506},
    JTT+I{0.24567}+G4{1.26708}: CDS_OG006423.nex{3.97916},
    LG+I{0.23146}+G4{1.21727}: CDS_OG006489_F_M.nex{5.49889},
    JTT+I{0.207144}+G4{1.64671}: CDS_OG007591.nex{6.51053},
    JTT+I{0.236369}+G4{0.896036}: CDS_OG007779.nex{5.78574},
    JTT+I{0.101066}+G4{1.19426}: CDS_OG008045_F.nex{7.45356};
end;

The difference is that v2.1.0 names files like so:

  charset CDS_OG008045_F.nex = train_plant/CDS_OG008045_F.nex: ;

but v2.3.4, 2.3.5, and probably a bunch of others, name them like so, with an extra comma and space ,

  charset CDS_OG008045_F.nex = train_plant/CDS_OG008045_F.nex: , ;

So, if I go back to the original file and remove the offending commas and spaces like this:

sed -i 's/: , ;/: ;/g' train_plant.best_model.nex

that file now looks like this:

#nexus
begin sets;
  charset CDS_OG003327.nex = train_plant/CDS_OG003327.nex: ;
  charset CDS_OG003719_F_M.nex = train_plant/CDS_OG003719_F_M.nex: ;
  charset CDS_OG003934_F_M.nex = train_plant/CDS_OG003934_F_M.nex: ;
  charset CDS_OG005936.nex = train_plant/CDS_OG005936.nex: ;
  charset CDS_OG006143.nex = train_plant/CDS_OG006143.nex: ;
  charset CDS_OG006423.nex = train_plant/CDS_OG006423.nex: ;
  charset CDS_OG006489_F_M.nex = train_plant/CDS_OG006489_F_M.nex: ;
  charset CDS_OG007591.nex = train_plant/CDS_OG007591.nex: ;
  charset CDS_OG007779.nex = train_plant/CDS_OG007779.nex: ;
  charset CDS_OG008045_F.nex = train_plant/CDS_OG008045_F.nex: ;
  charpartition mymodels =
    JTT+I{0.245661}+G4{0.761203}: CDS_OG003327.nex{4.40403},
    JTT+R3{0.561677,0.129922,0.284212,1.18801,0.154111,3.82437}: CDS_OG003719_F_M.nex{4.09621},
    LG+G4{0.713662}: CDS_OG003934_F_M.nex{5.22796},
    JTT+I{0.422431}+G4{0.8117}: CDS_OG005936.nex{2.50875},
    LG+G4{1.38001}: CDS_OG006143.nex{9.32513},
    JTT+I{0.245657}+G4{1.26698}: CDS_OG006423.nex{3.97915},
    LG+I{0.231278}+G4{1.21671}: CDS_OG006489_F_M.nex{5.49832},
    JTT+I{0.207393}+G4{1.64759}: CDS_OG007591.nex{6.51156},
    JTT+I{0.236365}+G4{0.896023}: CDS_OG007779.nex{5.78573},
    JTT+I{0.101596}+G4{1.19633}: CDS_OG008045_F.nex{7.4561};
end;

and the second command runs fine. In other words, the bug is in writing this file with a comma and space which shouldn't be there.

Fixing the bug

So, the bug occurs when we write a model file which refers to files on disk. I.e. when the analysis uses -S.

The problem is that this file includes commas that shouldn't be there.

The fix is to remove these commas when writing that file, i.e. when there are no start and end numbers to the partition, because it's a file on disk not a reference to a larger alignment.

@thomaskf, any chance you could get this fixed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions