Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
-
Updated
Feb 11, 2026 - Python
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
An agent benchmark with tasks in a simulated software company.
Frontier Models playing the board game Diplomacy.
Ranking LLMs on agentic tasks
🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba, Moonshot AI, OpenRouter), custom tasks in YAML, and HTML/CSV reports.
This repository contains the results and code for the MLPerf™ Storage v2.0 benchmark.
The first comprehensive benchmark for evaluating AI coding agents on Salesforce development tasks. Tests Apex, LWC, Flows, and more.
A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.
From Benchmarks to Architecture — We tested 30+ AI APIs, designed routing from the data, then Anthropic published the same patterns. 3-part series.
La Perf is a framework for AI performance benchmarking — covering LLMs, VLMs, embeddings, with power-metrics collection.
Python Performance Tester & More...
TrustyAI's LMEval provider for Llama Stack
PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.
A technical guide and live-tracking repository for the world's top AI models, specialized by coding, reasoning, and multimodal performance.
The ultimate benchmark for AI coding agents. Give an AI $500, an empty vending machine, and 90 days — race Claude, Codex, Gemini head-to-head with a live dashboard.
AccX Clocking Kiosk — offline-first tablet app for time tracking and fire safety. Humans: run .harness/harness.sh (pw: changeme)
Benchmark local LLM models: speed, quality, and hardware fitness scoring. CLI, MCP server, and IDE plugins.
Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."