Update EleutherAI Eval Harness to v0.4.5 #1800

joecummings · 2024-10-10T15:37:29Z

Context:

In v0.4.5, EleutherAI officially added multimodal support to their Eval Harness. We had some hacky code to make sure the user was doing this on their main Github code before this release was public. Now that it is, we should just force people to upgrade their Harness version so that our multimodal code works.

In addition, I changed the way we gate external packages in our recipes. If they don't have lm-eval installed AT ALL, it will throw the usual ModuleNotFound error. This should be sufficient for them to understand that they need this to run the recipe. In addition, b/c we actually want them to use the latest version, we now have another check in the setup part of the workflow to make sure that they are installing v0.4.5 specifically. This happens before any heavy lifting starts.

Changelog:

Removed hacky code
Updated workers to download lm-eval==0.4.5
Added gating logic on version in the setup portion of the recipe
Updated the test to return the "wrong" version of lm-eval in order to ensure we hit the right test

Testing:
Recipe tests: python -m pytest tests/recipes/test_eleuther_eval.py --with-integration
Output from text run:

batch_size: 8
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Llama-2-7b-hf
  checkpoint_files:
  - pytorch_model-00001-of-00002.bin
  - pytorch_model-00002-of-00002.bin
  model_type: LLAMA2
  output_dir: /tmp/Llama-2-7b-hf
device: cuda
dtype: bf16
enable_kv_cache: true
limit: null
max_seq_length: 4096
model:
  _component_: torchtune.models.llama2.llama2_7b
quantizer: null
seed: 1234
tasks:
- truthfulqa_mc2
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  max_seq_len: null
  path: /tmp/Llama-2-7b-hf/tokenizer.model

Model is initialized with precision torch.bfloat16.
2024-10-14:08:58:59,426 INFO     [eleuther_eval.py:505] Model is initialized with precision torch.bfloat16.
2024-10-14:08:58:59,488 INFO     [huggingface.py:129] Using device 'cuda:0'
2024-10-14:08:58:59,607 INFO     [huggingface.py:481] Using model type 'default'
2024-10-14:08:58:59,843 INFO     [huggingface.py:365] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
Running evaluation on the following tasks: ['truthfulqa_mc2']
2024-10-14:08:59:12,366 INFO     [eleuther_eval.py:549] Running evaluation on the following tasks: ['truthfulqa_mc2']
2024-10-14:08:59:12,367 INFO     [task.py:415] Building contexts for truthfulqa_mc2 on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 817/817 [00:01<00:00, 700.97it/s]
2024-10-14:08:59:13,605 INFO     [evaluator.py:489] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████████████████████████████████████████████████████████| 5882/5882 [01:52<00:00, 52.11it/s]
Eval completed in 119.48 seconds.
2024-10-14:09:01:11,843 INFO     [eleuther_eval.py:558] Eval completed in 119.48 seconds.
Max memory allocated: 39.48 GB
2024-10-14:09:01:11,843 INFO     [eleuther_eval.py:559] Max memory allocated: 39.48 GB


|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.3895|±  |0.0136|


2024-10-14:09:01:11,926 INFO     [eleuther_eval.py:563]

|    Tasks     |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------|------:|------|-----:|------|---|-----:|---|-----:|
|truthfulqa_mc2|      2|none  |     0|acc   |↑  |0.3895|±  |0.0136|

Output from multimodal run:

2024-10-14:09:05:40,454 INFO     [_logging.py:101] Running EleutherEvalRecipe with resolved config:

batch_size: 1
checkpointer:
  _component_: torchtune.training.FullModelMetaCheckpointer
  checkpoint_dir: /tmp/Llama-3.2-11B-Vision-Instruct/original
  checkpoint_files:
  - consolidated.pth
  model_type: LLAMA3_VISION
  output_dir: ./
device: cuda
dtype: bf16
enable_kv_cache: true
limit: 3
log_level: INFO
max_seq_length: 8192
model:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
quantizer: null
seed: 1234
tasks:
- mmmu_val_biology
tokenizer:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
  max_seq_len: 8192
  path: /tmp/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model

Model is initialized with precision torch.bfloat16.
2024-10-14:09:05:49,232 INFO     [eleuther_eval.py:505] Model is initialized with precision torch.bfloat16.
Running evaluation on the following tasks: ['mmmu_val_biology']
2024-10-14:09:06:01,712 INFO     [eleuther_eval.py:549] Running evaluation on the following tasks: ['mmmu_val_biology']
2024-10-14:09:06:01,714 INFO     [task.py:415] Building contexts for mmmu_val_biology on rank 0...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 5706.54it/s]
2024-10-14:09:06:01,766 INFO     [evaluator.py:489] Running generate_until requests
Running generate_until requests with text+image input: 100%|█████████████████████████████████████████████| 3/3 [00:34<00:00, 11.61s/it]
Eval completed in 34.93 seconds.
2024-10-14:09:06:36,641 INFO     [eleuther_eval.py:558] Eval completed in 34.93 seconds.
Max memory allocated: 32.01 GB
2024-10-14:09:06:36,642 INFO     [eleuther_eval.py:559] Max memory allocated: 32.01 GB


| Tasks |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-------|------:|------|------|------|---|-----:|---|-----:|
|Biology|      0|none  |None  |acc   |↑  |0.3333|±  |0.3333|


2024-10-14:09:06:36,723 INFO     [eleuther_eval.py:563]

| Tasks |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|-------|------:|------|------|------|---|-----:|---|-----:|
|Biology|      0|none  |None  |acc   |↑  |0.3333|±  |0.3333|

pytorch-bot · 2024-10-10T15:37:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1800

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 13a65f0 with merge base 5de5001 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers · 2024-10-11T19:49:32Z

recipes/eleuther_eval.py

+from lm_eval.evaluator import evaluate, get_task_list
+from lm_eval.models.hf_vlms import HFMultimodalLM
+from lm_eval.models.huggingface import HFLM
+from lm_eval.tasks import get_task_dict, TaskManager
+from lm_eval.utils import make_table


Don't we still need to gate here? lm_eval is still not in our pyproject.toml. Or is the idea that we just directly raise the usual ModuleNotFoundError since we are now on the latest version of lm_eval? (Still seems like it might be good to explicitly say "install lm_eval==0.4.5" or whatever)

Yeah I was going to default to just let the ModuleNotFoundError roll through since it seems explicit enough.

ebsmothers · 2024-10-14T16:59:46Z

recipes/eleuther_eval.py

@@ -469,6 +440,16 @@ class EleutherEvalRecipe(EvalRecipeInterface):
    """

    def __init__(self, cfg: DictConfig) -> None:
+        # Double check we have the right Eval Harness version
+        from importlib.metadata import version


Do they not have __version__ defined?

(joe-torchtune) [jrcummings@devvm050.nha0 ~/projects/joe-torchtune (update-eluther-pin)]$ python Python 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import lm_eval >>> lm_eval.__version__ Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: module 'lm_eval' has no attribute '__version__'

[WIP] Update Eleuther to v0.4.5

033d4b0

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2024

joecummings added 3 commits October 11, 2024 02:17

Update import from lm_eval

dd29105

Update failure message

c9fa99f

Fix test

f9f49fe

ebsmothers reviewed Oct 11, 2024

View reviewed changes

joecummings added 3 commits October 14, 2024 08:23

New error handling for wrong lm-eval version

f1e1e63

Do I need this?

977f597

I'm a genius

13a65f0

joecummings changed the title ~~[WIP] Update Eleuther to v0.4.5~~ Update EleutherAI Eval Harness to v0.4.5 Oct 14, 2024

joecummings requested review from ebsmothers and pbontrager October 14, 2024 16:07

ebsmothers reviewed Oct 14, 2024

View reviewed changes

ebsmothers approved these changes Oct 14, 2024

View reviewed changes

joecummings merged commit 7bbaa89 into pytorch:main Oct 14, 2024
17 checks passed

joecummings deleted the update-eluther-pin branch October 14, 2024 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update EleutherAI Eval Harness to v0.4.5 #1800

Update EleutherAI Eval Harness to v0.4.5 #1800

Uh oh!

joecummings commented Oct 10, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 10, 2024 •

edited

Loading

Uh oh!

ebsmothers Oct 11, 2024 •

edited

Loading

Uh oh!

joecummings Oct 14, 2024

Uh oh!

ebsmothers Oct 14, 2024

Uh oh!

joecummings Oct 14, 2024

Uh oh!

Uh oh!

Uh oh!

Update EleutherAI Eval Harness to v0.4.5 #1800

Update EleutherAI Eval Harness to v0.4.5 #1800

Uh oh!

Conversation

joecummings commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1800

✅ No Failures

Uh oh!

ebsmothers Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

ebsmothers Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

joecummings Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joecummings commented Oct 10, 2024 •

edited

Loading

pytorch-bot bot commented Oct 10, 2024 •

edited

Loading

ebsmothers Oct 11, 2024 •

edited

Loading