llm-evaluation

Here are 10 public repositories matching this topic...

alopatenko / LLMEvaluation

A comprehensive guide to LLM evaluation methods designed to assist in identifying the most suitable evaluation techniques for various use cases, promote the adoption of best practices in LLM assessment, and critically assess the effectiveness of these evaluation methods.

evaluation llm llm-evaluation llm-benchmarking generative-ai-benchmarking

Updated Aug 22, 2025
HTML

cedrickchee / vibe-jet

Star

A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"

game-development flight-simulator evaluation-framework vibe-check llm-evaluation vibe-coding gemini-2-5-pro-exp

Updated Apr 7, 2025
HTML

LRudL / sad

Star

Situational Awareness Dataset

ml llm-evaluation

Updated Dec 14, 2024
HTML

AmanPriyanshu / GPT-OSS-MoE-ExpertFingerprinting

Sponsor

Star

ExpertFingerprinting: Behavioral Pattern Analysis and Specialization Mapping of Experts in GPT-OSS-20B's Mixture-of-Experts Architecture

Updated Aug 13, 2025
HTML

attogram / ai_test_zone

Sponsor

Star

AI Test Zone - compare the same prompt against many open source LLMs

ai-evaluation llm-evaluation attogram-project

Updated Aug 30, 2025
HTML

thabit-ai / thabit

Star

Thabit is platform to evaluate prompts on multiple LLMs to determine the best one for your data

llm llm-evaluation

Updated Aug 2, 2024
HTML

anya-mb / help_center_chatbot

Star

Generative AI RAG Chatbot for Electricity and Gas Company

chatbot evaluation poc chroma rag streamlit gpt-4 llm llama-index nemo-guardrails llm-evaluation ragas

Updated Aug 27, 2024
HTML

shane-reaume / LLM-Finetuning-Sentiment-Analysis

Star

A beginner-friendly project for fine-tuning, testing, and deploying language models for sentiment analysis with a strong emphasis on quality assurance and testing methodologies.

testing unit-testing sentiment-analysis python3 functional-testing performance-testing metrics-gathering memory-test imdb-dataset trainings wsl-ubuntu huggingface distilbert weights-and-biases transformers-models llm-evaluation

Updated Mar 17, 2025
HTML

gsurrel / AiderDashboard

Star

Visualize Aider's performance results

vega-lite aider llm-evaluation

Updated May 6, 2025
HTML

naholav / claude_4_sonnet_math_evaluation

Star

Comprehensive evaluation of Claude 4 Sonnet's mathematical assessment capabilities: 500 original problems revealing JSON-induced errors and systematic patterns in LLM evaluation tasks. Research demonstrates 100% accuracy on incorrect answers but 84.3% on correct ones due to premature decision-making in JSON structure.

nlp benchmark machine-learning research artificial-intelligence dataset nlp-machine-learning evaluation-metrics cognitive-dissonance mathematical-reasoning llm-evaluation ai-assessment claude-4-sonnet json-bias systematic-errors

Updated Jul 7, 2025
HTML

Improve this page

Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evaluation

Here are 10 public repositories matching this topic...

alopatenko / LLMEvaluation

cedrickchee / vibe-jet

LRudL / sad

AmanPriyanshu / GPT-OSS-MoE-ExpertFingerprinting

attogram / ai_test_zone

thabit-ai / thabit

anya-mb / help_center_chatbot

shane-reaume / LLM-Finetuning-Sentiment-Analysis

gsurrel / AiderDashboard

naholav / claude_4_sonnet_math_evaluation

Improve this page

Add this topic to your repo