ai-benchmark

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

GustyCube / ERR-EVAL

Star

Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.

python nlp benchmark machine-learning ai evaluation collaborate ai-safety llm llm-evaluation hallucination-detection ai-benchmark llm-benchmark

Updated Jan 2, 2026
Python

Habitante / gta-benchmark

Star

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

python docker flask benchmark machine-learning puzzle reverse-engineering educational pattern-recognition ctf binary-analysis algorithm-analysis computational-thinking algorithmic-reasoning ai-benchmark

Updated Jan 12, 2025
Python

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba, Moonshot AI, OpenRouter), custom tasks in YAML, and HTML/CSV reports.

Updated Feb 28, 2026
Go

mlcommons / storage_results_v2.0

Star

This repository contains the results and code for the MLPerf™ Storage v2.0 benchmark.

benchmark machine-learning ai ai-benchmark ml-benchmark

Updated Aug 4, 2025
Python

yasarshaikh / SF-bench

Star

The first comprehensive benchmark for evaluating AI coding agents on Salesforce development tasks. Tests Apex, LWC, Flows, and more.

benchmark salesforce apex lwc lightning-web-components ai-benchmark

Updated Jan 27, 2026
Python

awesomelistsio / awesome-ai-benchmarks-evaluation

Sponsor

Star

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.

awesome ai awesome-list awesome-lists ai-benchmarks ai-evaluation ai-benchmark

Updated Jan 16, 2026
Python

sstklen / washin-api-benchmark

Star

From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing from the data, then Anthropic published the same patterns. 3-part series.

Updated Feb 24, 2026

bogdanminko / laperf

Star

La Perf is a framework for AI performance benchmarking — covering LLMs, VLMs, embeddings, with power-metrics collection.

cuda mps mlx nvidia-gpu apple-silicon ollama lmstudio ai-performance ai-benchmark ml-benchmark open-source-benchmark

Updated Dec 1, 2025
Python

MitchellShibilski-Unkel / PyPC

Star

Python Performance Tester & More...

speedtest performance-analysis specs geekbench performance-test computer-specs gpu-testing python-performance python-performance-test pypc-tests mitchell-shibilski-unkel pypc gpu-test ai-benchmark

Updated Feb 15, 2026
Python

trustyai-explainability / llama-stack-provider-lmeval

Star

TrustyAI's LMEval provider for Llama Stack

evaluation provider ai-benchmark llama-stack

Updated Feb 23, 2026
Python

playsaurus-inc / play-bench

Star

PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.

svg chess ai rock-paper-scissors ai-benchmark

Updated Mar 3, 2026
Blade

AgileWoW / best-ai-models-leaderboard

Star

A technical guide and live-tracking repository for the world's top AI models, specialized by coding, reasoning, and multimodal performance.

llm-leaderboard multimodal-ai ai-benchmark chatbot-arena agentic-coding swe-bench swe-bench-pro

Updated Feb 26, 2026

humancto / simulation-crucible

Star

The ultimate benchmark for AI coding agents. Give an AI $500, an empty vending machine, and 90 days — race Claude, Codex, Gemini head-to-head with a live dashboard.

python flask simulation websocket gemini codex vending-machine ai-agents claude race-mode llm-evaluation ai-benchmark

Updated Feb 27, 2026
Python

jontybrook / accx-bench-template

Star

AccX Clocking Kiosk — offline-first tablet app for time tracking and fire safety. Humans: run .harness/harness.sh (pw: changeme)

typescript pwa offline-first bun coding-agents ai-benchmark

Updated Feb 8, 2026
Shell

MetriLLM / metrillm

Star

Benchmark local LLM models: speed, quality, and hardware fitness scoring. CLI, MCP server, and IDE plugins.

cli benchmark typescript mcp llm local-ai ollama ai-benchmark

Updated Mar 3, 2026
TypeScript

Improve this page

Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmark

Here are 39 public repositories matching this topic...

microsoft / WindowsAgentArena

TheAgentCompany / TheAgentCompany

GoodStartLabs / AI_Diplomacy

rungalileo / agent-leaderboard

chaosync-org / awesome-ai-agent-testing

GustyCube / ERR-EVAL

Habitante / gta-benchmark

petmal / MindTrial

mlcommons / storage_results_v2.0

yasarshaikh / SF-bench

awesomelistsio / awesome-ai-benchmarks-evaluation

sstklen / washin-api-benchmark

bogdanminko / laperf

MitchellShibilski-Unkel / PyPC

trustyai-explainability / llama-stack-provider-lmeval

playsaurus-inc / play-bench

AgileWoW / best-ai-models-leaderboard

humancto / simulation-crucible

jontybrook / accx-bench-template

MetriLLM / metrillm

Improve this page

Add this topic to your repo