-
-
Notifications
You must be signed in to change notification settings - Fork 68
Description
During my research I have encountered what might be a bug or perhaps a problem in the model description in AliSim. The problem arises when simulating sequences specifying a discretized gamma distribution.
The following example is run on Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-148-generic x86_64) with
iqtree2 –version
IQ-TREE multicore version 2.3.6 for Linux x86 64-bit built Aug 1 2024
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt,
Dominik Schrempf, Michael Woodhams, Ly Trong Nhan, Thomas Wong
The following analysis uses the tree:
(((Reference_1:0.5,Reference_2:0.5):0.49,Sample_1:0.79):0.01,Reference_3:1.0);
When simulating sequences using
iqtree2 --alisim ./alignment -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25}+G4{0.1} --length 100000 -af fasta -t ./tree.nwk -seed 1234
and then feeding those sequences into IQ-TREE and estimating branch lengths only with the following command
iqtree2 -s ./alignment.fa -te ./tree.nwk -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25}+G4{0.1} -seed 1234
I obtained a log likelihood ratio of 510.878 when comparing the likelihood under the true model (initial log-likelihood) to that obtained when estimating branch lengths (optimal log-likelihood). I can reject the true simulation model with strong statistical significance when using a likelihood ratio test. Furthermore, the branch length estimates appear biased in repeated simulations.
I have confirmed that the problem relates to the discretization of the gamma distribution, because if I instead simulate directly from the discretized distribution for alpha = 0.1 using the following
iqtree2 --alisim ./alignment_cat1 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 5.265193e-07 -seed 1234
iqtree2 --alisim ./alignment_cat2 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 1.078089e-03 -seed 1234
iqtree2 --alisim ./alignment_cat3 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 9.375338e-02 -seed 1234
iqtree2 --alisim ./alignment_cat4 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 3.905168e+00 -seed 1234
and concatenate the sequences, the likelihood ratio test is no longer significant (4.114). The simulation model then seems to match the estimation model in IQ-TREE.
The estimation in IQ-TREE seems to work well as I have also compared that to other programs such as RAxML and phangorn. I also tested alpha = 1 and found the log likelihood ratio to be significant (408.518), but at alpha = 10 the log likelihood ratio was no longer rejected (9.454). By reducing the rate variation by increasing the alpha parameter, the models between sequence simulation and tree estimation start to match each other better. The observations also seem to hold when I scale the tree to shorter branch lengths.
It seems that AliSim either has some error here or perhaps is not using the discretization using the mean as described in the documentation and stated in the log.
Thank you for your time. Please let me know if there are any other questions/things that need to be clarified.