Variance measure for reasoning benchmark #3677

simveit · 2025-02-18T19:13:44Z

Motivation

In this PR we introduces reasoning benchmark.

We estimate

$PASS@1 = \frac{1}{N_{question}}\sum_{i=1}^{N_{question}}\frac{1}{N_{tries}}\sum_{j=1}^{N_{tries}}correct_{i,j}$
Where $correct_{i,j}$ is 1 if question i is correctly answered in try j.

In this PR we want to perform benchmarking not only on the accuracy but also on the variance of the answers. For this we use the metric:

$\frac{1}{N_{question}}\sum_{i=1}^{N_{question}}SE_i$
where $SE_i=\frac{1}{\sqrt{N_{tries}}}\sigma_i$, i.e. the standard error of question i.
This means
$\frac{1}{N_{question}}\sum_{i=1}^{N_{question}}SE_i$
should reflect how much we deviate on average from the reported accuracy.

Next steps:

Use the provided code to benchmark the standard error on AIME 2024. For instructions how to run the benchmark on AIME please see the provided README. Run multiple times to see how accurate the results are.
Report the results from the first step in a plot and include this plot in the README.

simveit · 2025-02-18T19:15:14Z

@zhaochenyang20 maybe someone can take on from here.
The only thing that remains to be done is to run the benchmark multiple times.

zhaochenyang20 · 2025-02-19T00:39:33Z

@simveit should this be an issue or PR? I can advocate for others to take.

simveit · 2025-02-19T10:12:22Z

Not that you say it maybe its a cleaner way to make this an PR and let me write a seperarte Issue for the benchmarking. This code is working and completed.
What do you think?

zhaochenyang20 · 2025-02-19T18:54:39Z

@simveit could you send me the issue link and tell others how to do variance measurements, from how to run codes 😂

I find someone interested in this. Also, should we merge this PR now?

simveit · 2025-02-19T18:56:03Z

yes we can merge this PR. I will write the issue later.

zhaochenyang20 · 2025-02-19T18:57:17Z

@simveit I told yineng to merge it. Thanks! @zhyncs

simveit added 2 commits February 18, 2025 19:38

Incuded standard error

238556d

removed redundant package

aaba3ab

Merge branch 'main' into feature/evaluate-reasoning-variance

30c3bfa

zhaochenyang20 marked this pull request as ready for review February 19, 2025 18:56

zhaochenyang20 approved these changes Feb 19, 2025

View reviewed changes

zhyncs merged commit bb12121 into sgl-project:main Feb 19, 2025
1 check passed

simveit mentioned this pull request Feb 20, 2025

Extensive benchmarking of reasoning models including variance #3725

Open

simveit deleted the feature/evaluate-reasoning-variance branch February 20, 2025 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Variance measure for reasoning benchmark #3677

Variance measure for reasoning benchmark #3677

Uh oh!

simveit commented Feb 18, 2025 •

edited

Loading

Uh oh!

simveit commented Feb 18, 2025

Uh oh!

zhaochenyang20 commented Feb 19, 2025

Uh oh!

simveit commented Feb 19, 2025

Uh oh!

zhaochenyang20 commented Feb 19, 2025

Uh oh!

simveit commented Feb 19, 2025

Uh oh!

zhaochenyang20 commented Feb 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Variance measure for reasoning benchmark #3677

Variance measure for reasoning benchmark #3677

Uh oh!

Conversation

simveit commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

simveit commented Feb 18, 2025

Uh oh!

zhaochenyang20 commented Feb 19, 2025

Uh oh!

simveit commented Feb 19, 2025

Uh oh!

zhaochenyang20 commented Feb 19, 2025

Uh oh!

simveit commented Feb 19, 2025

Uh oh!

zhaochenyang20 commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

simveit commented Feb 18, 2025 •

edited

Loading

zhaochenyang20 commented Feb 19, 2025 •

edited

Loading