Major concern about evaluation

Hi there!
I've found that rolling out ground truth trajectories (labelled by the language annotator) from the dataset is not always evaluated to be successful by the Tasks.get_task_info. This seems to be quite concerning. Perhaps I've done something wrong on my end?
![image](https://user-images.githubusercontent.com/54998055/208305621-b9b52b3b-0ac6-4a79-a44a-5a0ee4c8ef66.png)

To reproduce, I have forked the repo with minimal changes here: https://github.com/mees/calvin/pull/33
The only difference is in line 47 in `calvin_modesl/calvin_agent/evaluation/evaluate_policy_singlestep.py`, where instead of rolling out the model I roll out the dataset actions.

The exact commands I ran from beginning to end: 
```
# set up environment
git clone git@github.com:ezhang7423/calvin.git --recursive
cd calvin
conda create --name calvin python=3.8
conda activate calvin
pip install setuptools==57.5.0 torchmetrics==0.6.0
./install.sh

# get pretrained weights and fix the config.yaml
cp ./D_D_static_rgb_baseline/.hydra/config.yaml ./tmp.yaml
wget http://calvin.cs.uni-freiburg.de/model_weights/D_D_static_rgb_baseline.zip
unzip D_D_static_rgb_baseline.zip
unzip D_D_static_rgb_baseline.zip
mv ./tmp.yaml ./D_D_static_rgb_baseline/.hydra/config.yaml

# get data
cd dataset
./download_data.sh D
cd ../

# run the evaluation
python calvin_models/calvin_agent/evaluation/evaluate_policy_singlestep.py --dataset_path $DATA_GRAND_CENTRAL/task_D_D/ --train_folder ./D_D_static_rgb_baseline/ --checkpoint D_D_static_rgb_baseline/mcil_baseline.ckpt
```







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Major concern about evaluation #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Major concern about evaluation #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions