Gamma discretization in AliSim not consistent with tree estimation

During my research I have encountered what might be a bug or perhaps a problem in the model description in AliSim. The problem arises when simulating sequences specifying a discretized gamma distribution.

The following example is run on Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-148-generic x86_64) with
```
iqtree2 –version
IQ-TREE multicore version 2.3.6 for Linux x86 64-bit built Aug  1 2024
Developed by Bui Quang Minh, Nguyen Lam Tung, Olga Chernomor, Heiko Schmidt,
Dominik Schrempf, Michael Woodhams, Ly Trong Nhan, Thomas Wong
```

The following analysis uses the tree:
(((Reference_1:0.5,Reference_2:0.5):0.49,Sample_1:0.79):0.01,Reference_3:1.0);

When simulating sequences using

`iqtree2 --alisim ./alignment -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25}+G4{0.1} --length 100000 -af fasta -t ./tree.nwk -seed 1234`

and then feeding those sequences into IQ-TREE and estimating branch lengths only with the following command

`iqtree2 -s ./alignment.fa -te ./tree.nwk -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25}+G4{0.1} -seed 1234`

I obtained a log likelihood ratio of 510.878 when comparing the likelihood under the true model (initial log-likelihood) to that obtained when estimating branch lengths (optimal log-likelihood). I can reject the true simulation model with strong statistical significance when using a likelihood ratio test. Furthermore, the branch length estimates appear biased in repeated simulations.

I have confirmed that the problem relates to the discretization of the gamma distribution, because if I instead simulate directly from the discretized distribution for alpha = 0.1 using the following 

```
iqtree2 --alisim ./alignment_cat1 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 5.265193e-07 -seed 1234
iqtree2 --alisim ./alignment_cat2 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 1.078089e-03 -seed 1234
iqtree2 --alisim ./alignment_cat3 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 9.375338e-02 -seed 1234
iqtree2 --alisim ./alignment_cat4 -m GTR{1/1/1/1/1/1}+F{0.25/0.25/0.25/0.25} --length 25000 -af fasta -t ./tree.nwk --branch-scale 3.905168e+00 -seed 1234
```

and concatenate the sequences, the likelihood ratio test is no longer significant (4.114). The simulation model then seems to match the estimation model in IQ-TREE.

The estimation in IQ-TREE seems to work well as I have also compared that to other programs such as RAxML and phangorn. I also tested alpha = 1 and found the log likelihood ratio to be significant (408.518), but at alpha = 10 the log likelihood ratio was no longer rejected (9.454). By reducing the rate variation by increasing the alpha parameter, the models between sequence simulation and tree estimation start to match each other better. The observations also seem to hold when I scale the tree to shorter branch lengths.

It seems that AliSim either has some error here or perhaps is not using the discretization using the mean as described in the documentation and stated in the log.

Thank you for your time. Please let me know if there are any other questions/things that need to be clarified.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Gamma discretization in AliSim not consistent with tree estimation #333

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Gamma discretization in AliSim not consistent with tree estimation #333

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions