Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC.
This repository contains:
- Benchmark task definitions (SDLC and Org suites with task specs, tests, and metadata)
- Evaluation and run configs (paired baseline vs MCP-enabled execution modes)
- Metrics extraction and reporting pipelines for score/cost/retrieval analysis
- Run artifacts and agent traces (in
runs/and published summaries underdocs/official_results/)
Tasks are executed via the Harbor runner with the Claude Code agent harness.
- Researchers evaluating coding agents on realistic software engineering tasks
- Practitioners comparing baseline vs MCP-enabled agent configurations
You can inspect task definitions, run validation and analysis scripts, and use the metrics/report pipeline on existing Harbor run outputs.
git clone https://github.com/sourcegraph/CodeScaleBench.git
cd CodeScaleBench
# Fast repo sanity check (docs/config refs)
python3 scripts/repo_health.py --quick
# Explore task-based docs navigation
sed -n '1,120p' docs/START_HERE_BY_TASK.md
# Inspect available benchmark suites
ls benchmarksRunning benchmark tasks requires:
- Harbor installed and configured
Our internal default setup often uses:
- Daytona account and API key (preferred in this repo). See
docs/DAYTONA.md - Docker for Daytona-incompatible tasks
- Agent/runtime credentials as needed by your Harbor harness
Recommended pre-run checks:
python3 scripts/check_infra.py
python3 scripts/validate_tasks_preflight.py --allThen start with a dry run:
bash configs/run_selected_tasks.sh --dry-rundocs/START_HERE_BY_TASK.mdfor task-oriented navigationdocs/reference/CONFIGS.mdfor the 2-config evaluation matrixdocs/EVALUATION_PIPELINE.mdfor scoring and reporting outputs
Nine suites organized by software development lifecycle phase:
| Suite | SDLC Phase | Tasks | Description |
|---|---|---|---|
csb_sdlc_fix |
Bug Repair | 26 | Diagnosing and fixing real bugs across production codebases |
csb_sdlc_feature |
Feature Implementation | 23 | New features, interface implementation, big-code features |
csb_sdlc_debug |
Debugging & Investigation | 18 | Root cause tracing, fault localization, provenance |
csb_sdlc_test |
Testing & QA | 18 | Code review, performance testing, code search validation, test generation |
csb_sdlc_refactor |
Cross-File Refactoring | 16 | Cross-file refactoring, enterprise dependency refactoring, rename refactoring |
csb_sdlc_design |
Architecture & Design | 14 | Architecture analysis, dependency graphs, change impact |
csb_sdlc_document |
Documentation | 13 | API references, architecture docs, migration guides, runbooks |
csb_sdlc_secure |
Security & Compliance | 12 | CVE analysis, reachability, governance, access control |
csb_sdlc_understand |
Requirements & Discovery | 10 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
| Total | 150 |
Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
| Suite | Category | Tasks | Description |
|---|---|---|---|
csb_org_onboarding |
Onboarding & Comprehension | 28 | API consumption mapping, end-to-end flow, architecture maps |
csb_org_migration |
Framework Migration | 26 | API migrations, breaking changes across repos |
csb_org_security |
Vulnerability Remediation | 24 | CVE mapping, missing auth middleware across repos |
csb_org_crossrepo_tracing |
Dependency Tracing | 22 | Cross-repo dependency chains, blast radius, symbol resolution |
csb_org_domain |
Domain Lineage | 20 | Config propagation, architecture patterns, domain analysis |
csb_org_incident |
Incident Debugging | 20 | Error-to-code-path tracing across microservices |
csb_org_compliance |
Compliance | 18 | Standards adherence, audit, and provenance workflows |
csb_org_platform |
Platform Knowledge | 18 | Service template discovery and tribal knowledge |
csb_org_crossorg |
Cross-Org Discovery | 15 | Interface implementations and authoritative repo identification across orgs |
csb_org_org |
Organizational Context | 15 | Agentic discovery, org-wide coding correctness |
csb_org_crossrepo |
Cross-Repo Discovery | 14 | Cross-repo search, dependency discovery, impact analysis |
| Total | 220 |
Combined canonical benchmark: 370 tasks (150 SDLC across 9 suites + 220 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform 20-task sizing. An additional 28 backup tasks are archived in benchmarks/backups/.
Both baseline and MCP-Full agents have access to all repos in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
See docs/MCP_UNIQUE_TASKS.md for the full task system, authoring guide, and oracle evaluation framework. See docs/MCP_UNIQUE_CALIBRATION.md for oracle coverage analysis.
All benchmarks are evaluated across two primary configurations (Baseline vs MCP). The concrete run config names differ by task type:
- SDLC suites (
csb_sdlc_feature,csb_sdlc_refactor,csb_sdlc_fix, etc.):baseline-local-direct+mcp-remote-direct - Org suites (
csb_org_*):baseline-local-direct+mcp-remote-direct
At a high level, the distinction is:
| Config Name | Internal MCP mode | MCP Tools Available |
|---|---|---|
| Baseline | none |
None (agent uses only built-in tools) |
| MCP | sourcegraph / artifact (task-dependent) |
All 13 Sourcegraph MCP tools including sg_deepsearch, sg_deepsearch_read |
See docs/reference/CONFIGS.md for the canonical configuration matrix and tool-by-tool breakdown.
benchmarks/ # Task definitions organized by SDLC phase + Org
csb_sdlc_fix/ # Bug Repair (26 tasks)
csb_sdlc_feature/ # Feature Implementation (23 tasks)
csb_sdlc_debug/ # Debugging & Investigation (18 tasks)
csb_sdlc_test/ # Testing & QA (18 tasks)
csb_sdlc_refactor/ # Cross-File Refactoring (16 tasks)
csb_sdlc_design/ # Architecture & Design (14 tasks)
csb_sdlc_document/ # Documentation (13 tasks)
csb_sdlc_secure/ # Security & Compliance (12 tasks)
csb_sdlc_understand/ # Requirements & Discovery (10 tasks)
backups/ # Archived backup tasks (28 total)
csb_org_onboarding/ # Org: onboarding (28 tasks)
csb_org_migration/ # Org: framework migration (26 tasks)
csb_org_security/ # Org: vulnerability remediation (24 tasks)
csb_org_crossrepo_tracing/ # Org: dependency tracing (22 tasks)
csb_org_domain/ # Org: domain lineage (20 tasks)
csb_org_incident/ # Org: incident debugging (20 tasks)
csb_org_compliance/ # Org: compliance & audit (18 tasks)
csb_org_platform/ # Org: platform knowledge (18 tasks)
csb_org_crossorg/ # Org: cross-org discovery (15 tasks)
csb_org_org/ # Org: org context (15 tasks)
csb_org_crossrepo/ # Org: cross-repo discovery (14 tasks)
configs/ # Run configs and task selection
_common.sh # Shared infra: token refresh, parallel execution, multi-account
sdlc_suite_2config.sh # Generic SDLC runner (used by phase wrappers below)
feature_2config.sh # Phase wrapper: Feature (20 tasks)
refactor_2config.sh # Phase wrapper: Refactor (20 tasks)
debug_2config.sh # Phase wrapper: Debug (20 tasks)
design_2config.sh # Phase wrapper: Design (20 tasks)
document_2config.sh # Phase wrapper: Document (20 tasks)
fix_2config.sh # Phase wrapper: Fix (25 tasks)
secure_2config.sh # Phase wrapper: Secure (20 tasks)
test_2config.sh # Phase wrapper: Test (20 tasks)
run_selected_tasks.sh # Unified runner for all tasks
validate_one_per_benchmark.sh # Pre-flight smoke (1 task per suite)
selected_benchmark_tasks.json # Canonical task selection: 370 tasks (150 SDLC + 220 Org)
use_case_registry.json # 100 GTM use cases (Org task source)
archive/ # Pre-SDLC migration scripts (preserved for history)
scripts/ # Metrics extraction, evaluation, and operational tooling
csb_metrics/ # Python package: models, extractors, discovery, judge context
generate_eval_report.py # CLI: deterministic evaluation report generator
aggregate_status.py # Core run scanner (status, errors, watch mode)
status_fingerprints.py # Error classification (12 regex patterns)
validate_tasks_preflight.py # Pre-flight task validation
validate_task_run.py # Post-run validation
check_infra.py # Infrastructure readiness checker
compare_configs.py # Cross-config divergence analysis
cost_report.py # Token/cost aggregation
sync_task_metadata.py # task.toml vs selection registry reconciliation
generate_manifest.py # Rebuild MANIFEST from on-disk results
archive_run.py # Archive old runs to save disk
rerun_failed.py # Generate rerun commands for failed tasks
abc_audit.py # ABC benchmark quality audit framework
abc_score_task.py # Per-task quality scoring
abc_criteria.py # ABC criteria data model (32 criteria)
docs_consistency_check.py # Documentation drift guard
tests/ # Unit tests for scripts/
test_abc_audit.py # Tests for ABC audit framework
test_abc_criteria.py # Tests for ABC criteria data model
test_abc_score_task.py # Tests for task quality scorer
test_extract_task_metrics.py # Tests for metrics extraction
docs/ # Operational documentation
CONFIGS.md # 2-config tool breakdown
ERROR_CATALOG.md # Known error fingerprints, causes, fixes
QA_PROCESS.md # Quality assurance and validation pipeline
EVALUATION_PIPELINE.md # Unified eval: verifier → judge → statistics → report
TASK_CATALOG.md # Detailed per-task reference
TASK_SELECTION.md # Selection criteria, difficulty calibration, MCP scoring
SCORING_SEMANTICS.md # Reward and pass interpretation per benchmark
MCP_UNIQUE_TASKS.md # Org task system, authoring, oracle evaluation
MCP_UNIQUE_CALIBRATION.md # Oracle coverage analysis and threshold calibration
WORKFLOW_METRICS.md # Timing/cost metric definitions
AGENT_INTERFACE.md # Runtime I/O contract for agents
EXTENSIBILITY.md # Safe suite/task/config extension guide
LEADERBOARD.md # Ranking policy
SUBMISSION.md # Submission format specification
skills/ # AI agent skill definitions (operational runbooks)
csb/ # CSB-specific: pre-run, monitoring, triage, analysis, maintenance
general/ # Reusable: workflow tools, agent delegation, dev practices
schemas/ # JSON schemas for MANIFEST.json, task.toml, etc.
Each suite directory contains per-task subdirectories with instruction.md, task.toml, tests/, and ground truth (or solution/). Org tasks additionally include task_spec.json, oracle_answer.json, and Dockerfile variants for baseline/MCP-only execution.
The scripts/ directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output.
Use runs/analysis for active analysis runs (and runs/official when producing publishable exports):
# Generate evaluation report from analysis runs
python3 scripts/generate_eval_report.py \
--runs-dir /path/to/runs/analysis/ \
--output-dir ./eval_reports/
# Generate LLM judge context files
python3 -m scripts.csb_metrics.judge_context \
--runs-dir /path/to/runs/analysis/ \
--benchmarks-dir ./benchmarks/ \
--output-dir ./judge_contexts/The report generator produces:
eval_report.json-- full structured reportREPORT.md-- markdown tables (performance, efficiency, tool utilization)harness_configs.json-- exact harness configuration per run- CSV files per table for downstream analysis
See python3 scripts/generate_eval_report.py --help for all options.
To export official results (valid scored tasks only) with parsed trace summaries and local browsing UI:
python3 scripts/export_official_results.py \
--runs-dir ./runs/official/ \
--output-dir ./docs/official_results/This writes:
docs/official_results/README.md-- run/config score summarydocs/official_results/runs/*.md-- per-run task tablesdocs/official_results/tasks/*.md-- per-task metrics + parsed tool/trace viewdocs/official_results/data/official_results.json-- machine-readable datasetdocs/official_results/audits/*.json-- per-task audit artifacts (checksums + parsed trace events)docs/official_results/traces/*/trajectory.json-- bundled raw trajectory tracesdocs/official_results/index.html-- interactive local browser
Suite summaries are deduplicated to the latest result per
suite + config + task_name; full historical rows remain in
official_results.json under all_tasks.
For SDLC suites, export normalizes legacy config labels:
baseline -> baseline-local-direct, mcp -> mcp-remote-direct.
Serve locally:
python3 scripts/export_official_results.py --serveFor the full multi-layer evaluation pipeline (verifier, LLM judge, statistical analysis, dual-score reporting), see docs/EVALUATION_PIPELINE.md.
This section assumes Harbor is already installed and configured. If not, start with the Quickstart section above and python3 scripts/check_infra.py.
The unified runner executes all 370 canonical tasks across the 2-config matrix:
# Run all 370 tasks across 2 configs
bash configs/run_selected_tasks.sh
# Run only the baseline config
bash configs/run_selected_tasks.sh --baseline-only
# Run a single SDLC phase
bash configs/run_selected_tasks.sh --benchmark csb_sdlc_fix
# Dry run to list tasks without executing
bash configs/run_selected_tasks.sh --dry-runPer-phase runners are also available:
bash configs/fix_2config.sh # 26 Bug Repair tasks
bash configs/feature_2config.sh # 23 Feature Implementation tasks
bash configs/debug_2config.sh # 18 Debugging & Investigation tasks
bash configs/test_2config.sh # 18 Testing & QA tasks
bash configs/refactor_2config.sh # 16 Cross-File Refactoring tasks
bash configs/design_2config.sh # 14 Architecture & Design tasks
bash configs/document_2config.sh # 13 Documentation tasks
bash configs/secure_2config.sh # 12 Security & Compliance tasks
bash configs/understand_2config.sh # 10 Requirements & Discovery tasksAll tasks (SDLC and Org) are in the unified selected_benchmark_tasks.json. Filter by suite with the --benchmark flag:
# Run only Org security tasks
bash configs/run_selected_tasks.sh --benchmark csb_org_security
# Run only SDLC fix tasks
bash configs/run_selected_tasks.sh --benchmark csb_sdlc_fixAll runners support --baseline-only, --full-only, --task TASK_ID, and --parallel N flags.
CodeScaleBench includes a multi-stage QA pipeline to ensure task integrity, reproducible runs, and accurate scoring.
| Phase | Script | Purpose |
|---|---|---|
| Pre-flight | scripts/validate_tasks_preflight.py |
Catches truncated instructions, template placeholders, language/difficulty mismatches, missing test.sh |
| Infra check | scripts/check_infra.py |
Verifies OAuth tokens (all accounts), Docker, disk space, Harbor CLI |
| Error fingerprinting | scripts/status_fingerprints.py |
Classifies failures with 12 regex patterns; auto-retry guidance per pattern |
| Post-run | scripts/validate_task_run.py |
Flags crashes, MCP tool usage anomalies, suspicious scoring |
| Metadata sync | scripts/sync_task_metadata.py |
Keeps task.toml in sync with selected_benchmark_tasks.json; --fix to auto-update |
| Run analysis | scripts/aggregate_status.py |
Scans run dirs, classifies per-task status, writes status.json, supports --watch mode |
The QA methodology uses a 6-dimension audit framework: instruction contamination, reproducibility, verifier correctness, ghost/false-positive detection, error misclassification, and tool effectiveness analysis.
See docs/QA_PROCESS.md for the full pipeline documentation and docs/ERROR_CATALOG.md for the known error catalog.
Key scripts organized by workflow phase:
| Phase | Script | Usage |
|---|---|---|
| Pre-run | validate_tasks_preflight.py |
python3 scripts/validate_tasks_preflight.py [--suite csb_sdlc_fix] [--task sgt-001] |
| Pre-run | check_infra.py |
python3 scripts/check_infra.py |
| During run | aggregate_status.py --since 2h |
python3 scripts/aggregate_status.py --since 2h |
| Post-run | aggregate_status.py |
python3 scripts/aggregate_status.py [--watch] |
| Post-run | validate_task_run.py |
python3 scripts/validate_task_run.py <run_dir> |
| Analysis | compare_configs.py |
python3 scripts/compare_configs.py |
| Analysis | cost_report.py |
python3 scripts/cost_report.py |
| Analysis | generate_manifest.py |
python3 scripts/generate_manifest.py |
| Maintenance | sync_task_metadata.py |
python3 scripts/sync_task_metadata.py [--fix] |
| Maintenance | archive_run.py |
python3 scripts/archive_run.py <run_dir> [--compress] |
| Maintenance | rerun_failed.py |
python3 scripts/rerun_failed.py [--fingerprint timeout] [--suite csb_sdlc_fix] |
The skills/ directory contains structured runbooks for AI coding agents operating on this repository. These encode operational workflows — infrastructure checks, task validation, failure triage, report generation — so any agent (Claude Code, Cursor, Copilot, etc.) can follow them autonomously.
| Category | Skills | Description |
|---|---|---|
| CSB Operations | 20 skills in 6 files | Pre-run checks, monitoring, triage, analysis, maintenance, task authoring |
| General Purpose | 11 skills in 4 files | Session management, agent delegation, search patterns, dev practices |
Skills are plain markdown and tool-agnostic. See skills/README.md for the full index and integration guides for Cursor, Claude Code, and other agents. See docs/SKILLS.md for background on the skills system.
See LICENSE.