Skip to content
#

evaluation-metrics

Here are 12 public repositories matching this topic...

Comprehensive evaluation of Claude 4 Sonnet's mathematical assessment capabilities: 500 original problems revealing JSON-induced errors and systematic patterns in LLM evaluation tasks. Research demonstrates 100% accuracy on incorrect answers but 84.3% on correct ones due to premature decision-making in JSON structure.

  • Updated Jul 7, 2025
  • HTML

Improve this page

Add a description, image, and links to the evaluation-metrics topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-metrics topic, visit your repo's landing page and select "manage topics."

Learn more