Skip to content

Conversation

simveit
Copy link
Contributor

@simveit simveit commented Feb 18, 2025

Motivation

In this PR we introduces reasoning benchmark.

We estimate

$PASS@1 = \frac{1}{N_{question}}\sum_{i=1}^{N_{question}}\frac{1}{N_{tries}}\sum_{j=1}^{N_{tries}}correct_{i,j}$
Where $correct_{i,j}$ is 1 if question i is correctly answered in try j.

In this PR we want to perform benchmarking not only on the accuracy but also on the variance of the answers. For this we use the metric:

$\frac{1}{N_{question}}\sum_{i=1}^{N_{question}}SE_i$
where $SE_i=\frac{1}{\sqrt{N_{tries}}}\sigma_i$, i.e. the standard error of question i.
This means
$\frac{1}{N_{question}}\sum_{i=1}^{N_{question}}SE_i$
should reflect how much we deviate on average from the reported accuracy.

Next steps:

  • Use the provided code to benchmark the standard error on AIME 2024. For instructions how to run the benchmark on AIME please see the provided README. Run multiple times to see how accurate the results are.
  • Report the results from the first step in a plot and include this plot in the README.

@simveit
Copy link
Contributor Author

simveit commented Feb 18, 2025

@zhaochenyang20 maybe someone can take on from here.
The only thing that remains to be done is to run the benchmark multiple times.

@zhaochenyang20
Copy link
Collaborator

@simveit should this be an issue or PR? I can advocate for others to take.

@simveit
Copy link
Contributor Author

simveit commented Feb 19, 2025

Not that you say it maybe its a cleaner way to make this an PR and let me write a seperarte Issue for the benchmarking. This code is working and completed.
What do you think?

@zhaochenyang20
Copy link
Collaborator

@simveit could you send me the issue link and tell others how to do variance measurements, from how to run codes 😂

I find someone interested in this. Also, should we merge this PR now?

@simveit
Copy link
Contributor Author

simveit commented Feb 19, 2025

yes we can merge this PR. I will write the issue later.

@zhaochenyang20 zhaochenyang20 marked this pull request as ready for review February 19, 2025 18:56
@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Feb 19, 2025

@simveit I told yineng to merge it. Thanks! @zhyncs

@zhyncs zhyncs merged commit bb12121 into sgl-project:main Feb 19, 2025
1 check passed
@simveit simveit deleted the feature/evaluate-reasoning-variance branch February 20, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants