Skip to content

Conversation

prabhu
Copy link
Collaborator

@prabhu prabhu commented Aug 8, 2025

Summary of Spec Category Comparison

Model Accuracy (%)
gemini-2.5-pro 100.00
deepseek-r1 98.58
cdx1-pro-mlx-8bit 98.30
gpt-5 95.17
qwen3-coder-480B 90.34
gpt-oss-120b 89.20
cdx1-mlx-8bit 83.52
deepthink-r1 12.36
gpt-oss-20b 9.09
o4-mini-high 0.00
---
config:
  xyChart:
    width: 1200
---
%%{init: {'theme': 'default'}}%%
xychart-beta
    title "Spec Category Comparison"
    x-axis [cdx1-mlx-8bit, cdx1-pro-mlx-8bit, gemini-2.5-pro, o4-mini-high, qwen3-coder-480B, deepthink-r1, deepseek-r1, gpt-oss-120b, gpt-oss-20b, gpt-5]
    y-axis "Accuracy (%)" 0 --> 100
    bar [83.52, 98.3, 100, 0, 90.34, 12.36, 98.58, 89.2, 9.09, 95.17]
Loading

Summary of Logic Category Comparison

Model Accuracy (%)
gemini-2.5-pro 93.60
deepthink-r1 89.63
gpt-5 83.23
deepseek-r1 82.92
gpt-oss-120b 80.49
gpt-oss-20b 79.27
cdx1-pro-mlx-8bit 73.17
o4-mini-high 67.99
qwen3-coder-480B 48.48
cdx1-mlx-8bit 46.04
---
config:
  xyChart:
    width: 1200
---
%%{init: {'theme': 'default'}}%%
xychart-beta
    title "Logic Category Comparison"
    x-axis [cdx1-mlx-8bit, cdx1-pro-mlx-8bit, gemini-2.5-pro, o4-mini-high, qwen3-coder-480B, deepthink-r1, deepseek-r1, gpt-oss-120b, gpt-oss-20b, gpt-5]
    y-axis "Accuracy (%)" 0 --> 100
    bar [46.04, 73.17, 93.6, 67.99, 48.48, 89.63, 82.92, 80.49, 79.27, 83.23]
Loading

Exception

gpt-5 technically failed the test and required six separate confirmations to complete the tests!

gpt-5-batch Screenshot 2025-08-08 at 11 02 31

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>
@prabhu prabhu added the ml label Aug 8, 2025
@prabhu prabhu merged commit 079e5fc into master Aug 8, 2025
3 checks passed
@prabhu prabhu deleted the feature/gpt-5-test branch August 8, 2025 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant