gpt-5 benchmark #2146

prabhu · 2025-08-08T10:23:49Z

Summary of Spec Category Comparison

Model	Accuracy (%)
`gemini-2.5-pro`	100.00
`deepseek-r1`	98.58
`cdx1-pro-mlx-8bit`	98.30
`gpt-5`	95.17
`qwen3-coder-480B`	90.34
`gpt-oss-120b`	89.20
`cdx1-mlx-8bit`	83.52
`deepthink-r1`	12.36
`gpt-oss-20b`	9.09
`o4-mini-high`	0.00

---
config:
  xyChart:
    width: 1200
---
%%{init: {'theme': 'default'}}%%
xychart-beta
    title "Spec Category Comparison"
    x-axis [cdx1-mlx-8bit, cdx1-pro-mlx-8bit, gemini-2.5-pro, o4-mini-high, qwen3-coder-480B, deepthink-r1, deepseek-r1, gpt-oss-120b, gpt-oss-20b, gpt-5]
    y-axis "Accuracy (%)" 0 --> 100
    bar [83.52, 98.3, 100, 0, 90.34, 12.36, 98.58, 89.2, 9.09, 95.17]

Summary of Logic Category Comparison

Model	Accuracy (%)
`gemini-2.5-pro`	93.60
`deepthink-r1`	89.63
`gpt-5`	83.23
`deepseek-r1`	82.92
`gpt-oss-120b`	80.49
`gpt-oss-20b`	79.27
`cdx1-pro-mlx-8bit`	73.17
`o4-mini-high`	67.99
`qwen3-coder-480B`	48.48
`cdx1-mlx-8bit`	46.04

---
config:
  xyChart:
    width: 1200
---
%%{init: {'theme': 'default'}}%%
xychart-beta
    title "Logic Category Comparison"
    x-axis [cdx1-mlx-8bit, cdx1-pro-mlx-8bit, gemini-2.5-pro, o4-mini-high, qwen3-coder-480B, deepthink-r1, deepseek-r1, gpt-oss-120b, gpt-oss-20b, gpt-5]
    y-axis "Accuracy (%)" 0 --> 100
    bar [46.04, 73.17, 93.6, 67.99, 48.48, 89.63, 82.92, 80.49, 79.27, 83.23]

Exception

gpt-5 technically failed the test and required six separate confirmations to complete the tests!

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

gpt-5 benchmark

0d27348

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

prabhu added the ml label Aug 8, 2025

prabhu merged commit 079e5fc into master Aug 8, 2025
3 checks passed

prabhu deleted the feature/gpt-5-test branch August 8, 2025 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gpt-5 benchmark #2146

gpt-5 benchmark #2146

Uh oh!

prabhu commented Aug 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gpt-5 benchmark #2146

gpt-5 benchmark #2146

Uh oh!

Conversation

prabhu commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Spec Category Comparison

Summary of Logic Category Comparison

Exception

Uh oh!

Uh oh!

Uh oh!

prabhu commented Aug 8, 2025 •

edited

Loading