It would be nice if we could conduct more extensive robust analyses in terms of the following perspectives: * re-evaluate model on adversarial evaluation set * evaluate the faithfulness of solution chains * more others