Skip to content

Major concern about evaluation #32

@ezhang7423

Description

@ezhang7423

Hi there!
I've found that rolling out ground truth trajectories (labelled by the language annotator) from the dataset is not always evaluated to be successful by the Tasks.get_task_info. This seems to be quite concerning. Perhaps I've done something wrong on my end?
image

To reproduce, I have forked the repo with minimal changes here: #33
The only difference is in line 47 in calvin_modesl/calvin_agent/evaluation/evaluate_policy_singlestep.py, where instead of rolling out the model I roll out the dataset actions.

The exact commands I ran from beginning to end:

# set up environment
git clone git@github.com:ezhang7423/calvin.git --recursive
cd calvin
conda create --name calvin python=3.8
conda activate calvin
pip install setuptools==57.5.0 torchmetrics==0.6.0
./install.sh

# get pretrained weights and fix the config.yaml
cp ./D_D_static_rgb_baseline/.hydra/config.yaml ./tmp.yaml
wget http://calvin.cs.uni-freiburg.de/model_weights/D_D_static_rgb_baseline.zip
unzip D_D_static_rgb_baseline.zip
unzip D_D_static_rgb_baseline.zip
mv ./tmp.yaml ./D_D_static_rgb_baseline/.hydra/config.yaml

# get data
cd dataset
./download_data.sh D
cd ../

# run the evaluation
python calvin_models/calvin_agent/evaluation/evaluate_policy_singlestep.py --dataset_path $DATA_GRAND_CENTRAL/task_D_D/ --train_folder ./D_D_static_rgb_baseline/ --checkpoint D_D_static_rgb_baseline/mcil_baseline.ckpt

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions