CodeScaleBench

Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC.

This repository contains:

Benchmark task definitions (SDLC and Org suites with task specs, tests, and metadata)
Evaluation and run configs (paired baseline vs MCP-enabled execution modes)
Metrics extraction and reporting pipelines for score/cost/retrieval analysis
Run artifacts and agent traces (in runs/ and published summaries under docs/official_results/)

Tasks are executed via the Harbor runner with the Claude Code agent harness.

Quickstart (Public / First-Time Users)

Who this repo is for

Researchers evaluating coding agents on realistic software engineering tasks
Practitioners comparing baseline vs MCP-enabled agent configurations

What you can do without Harbor

You can inspect task definitions, run validation and analysis scripts, and use the metrics/report pipeline on existing Harbor run outputs.

git clone https://github.com/sourcegraph/CodeScaleBench.git
cd CodeScaleBench

# Fast repo sanity check (docs/config refs)
python3 scripts/repo_health.py --quick

# Explore task-based docs navigation
sed -n '1,120p' docs/START_HERE_BY_TASK.md

# Inspect available benchmark suites
ls benchmarks

What requires Harbor (benchmark execution)

Running benchmark tasks requires:

Harbor installed and configured

Our internal default setup often uses:

Daytona account and API key (preferred in this repo). See docs/DAYTONA.md
Docker for Daytona-incompatible tasks
Agent/runtime credentials as needed by your Harbor harness

Recommended pre-run checks:

python3 scripts/check_infra.py
python3 scripts/validate_tasks_preflight.py --all

Then start with a dry run:

bash configs/run_selected_tasks.sh --dry-run

First places to read

docs/START_HERE_BY_TASK.md for task-oriented navigation
docs/reference/CONFIGS.md for the 2-config evaluation matrix
docs/EVALUATION_PIPELINE.md for scoring and reporting outputs

CodeScaleBench-SDLC

Nine suites organized by software development lifecycle phase:

Suite	SDLC Phase	Tasks	Description
`csb_sdlc_fix`	Bug Repair	26	Diagnosing and fixing real bugs across production codebases
`csb_sdlc_feature`	Feature Implementation	23	New features, interface implementation, big-code features
`csb_sdlc_debug`	Debugging & Investigation	18	Root cause tracing, fault localization, provenance
`csb_sdlc_test`	Testing & QA	18	Code review, performance testing, code search validation, test generation
`csb_sdlc_refactor`	Cross-File Refactoring	16	Cross-file refactoring, enterprise dependency refactoring, rename refactoring
`csb_sdlc_design`	Architecture & Design	14	Architecture analysis, dependency graphs, change impact
`csb_sdlc_document`	Documentation	13	API references, architecture docs, migration guides, runbooks
`csb_sdlc_secure`	Security & Compliance	12	CVE analysis, reachability, governance, access control
`csb_sdlc_understand`	Requirements & Discovery	10	Codebase comprehension, onboarding, Q&A, knowledge recovery
Total		150

CodeScaleBench-Org

Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.

Suite	Category	Tasks	Description
`csb_org_onboarding`	Onboarding & Comprehension	28	API consumption mapping, end-to-end flow, architecture maps
`csb_org_migration`	Framework Migration	26	API migrations, breaking changes across repos
`csb_org_security`	Vulnerability Remediation	24	CVE mapping, missing auth middleware across repos
`csb_org_crossrepo_tracing`	Dependency Tracing	22	Cross-repo dependency chains, blast radius, symbol resolution
`csb_org_domain`	Domain Lineage	20	Config propagation, architecture patterns, domain analysis
`csb_org_incident`	Incident Debugging	20	Error-to-code-path tracing across microservices
`csb_org_compliance`	Compliance	18	Standards adherence, audit, and provenance workflows
`csb_org_platform`	Platform Knowledge	18	Service template discovery and tribal knowledge
`csb_org_crossorg`	Cross-Org Discovery	15	Interface implementations and authoritative repo identification across orgs
`csb_org_org`	Organizational Context	15	Agentic discovery, org-wide coding correctness
`csb_org_crossrepo`	Cross-Repo Discovery	14	Cross-repo search, dependency discovery, impact analysis
Total		220

Combined canonical benchmark: 370 tasks (150 SDLC across 9 suites + 220 Org across 11 suites). Suite sizes are DOE-driven (Neyman-optimal allocation) to maximize statistical power per suite rather than uniform 20-task sizing. An additional 28 backup tasks are archived in benchmarks/backups/.

Both baseline and MCP-Full agents have access to all repos in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.

See docs/MCP_UNIQUE_TASKS.md for the full task system, authoring guide, and oracle evaluation framework. See docs/MCP_UNIQUE_CALIBRATION.md for oracle coverage analysis.

2-Config Evaluation Matrix

All benchmarks are evaluated across two primary configurations (Baseline vs MCP). The concrete run config names differ by task type:

SDLC suites (csb_sdlc_feature, csb_sdlc_refactor, csb_sdlc_fix, etc.): baseline-local-direct + mcp-remote-direct
Org suites (csb_org_*): baseline-local-direct + mcp-remote-direct

At a high level, the distinction is:

Config Name	Internal MCP mode	MCP Tools Available
Baseline	`none`	None (agent uses only built-in tools)
MCP	`sourcegraph` / `artifact` (task-dependent)	All 13 Sourcegraph MCP tools including `sg_deepsearch`, `sg_deepsearch_read`

See docs/reference/CONFIGS.md for the canonical configuration matrix and tool-by-tool breakdown.

Repository Structure

benchmarks/              # Task definitions organized by SDLC phase + Org
  csb_sdlc_fix/          #   Bug Repair (26 tasks)
  csb_sdlc_feature/      #   Feature Implementation (23 tasks)
  csb_sdlc_debug/        #   Debugging & Investigation (18 tasks)
  csb_sdlc_test/         #   Testing & QA (18 tasks)
  csb_sdlc_refactor/     #   Cross-File Refactoring (16 tasks)
  csb_sdlc_design/       #   Architecture & Design (14 tasks)
  csb_sdlc_document/     #   Documentation (13 tasks)
  csb_sdlc_secure/       #   Security & Compliance (12 tasks)
  csb_sdlc_understand/   #   Requirements & Discovery (10 tasks)
  backups/               #   Archived backup tasks (28 total)
  csb_org_onboarding/    #   Org: onboarding (28 tasks)
  csb_org_migration/     #   Org: framework migration (26 tasks)
  csb_org_security/      #   Org: vulnerability remediation (24 tasks)
  csb_org_crossrepo_tracing/  #   Org: dependency tracing (22 tasks)
  csb_org_domain/        #   Org: domain lineage (20 tasks)
  csb_org_incident/      #   Org: incident debugging (20 tasks)
  csb_org_compliance/    #   Org: compliance & audit (18 tasks)
  csb_org_platform/      #   Org: platform knowledge (18 tasks)
  csb_org_crossorg/      #   Org: cross-org discovery (15 tasks)
  csb_org_org/           #   Org: org context (15 tasks)
  csb_org_crossrepo/     #   Org: cross-repo discovery (14 tasks)
configs/                 # Run configs and task selection
  _common.sh             #   Shared infra: token refresh, parallel execution, multi-account
  sdlc_suite_2config.sh  #   Generic SDLC runner (used by phase wrappers below)
  feature_2config.sh     #   Phase wrapper: Feature (20 tasks)
  refactor_2config.sh    #   Phase wrapper: Refactor (20 tasks)
  debug_2config.sh       #   Phase wrapper: Debug (20 tasks)
  design_2config.sh      #   Phase wrapper: Design (20 tasks)
  document_2config.sh    #   Phase wrapper: Document (20 tasks)
  fix_2config.sh         #   Phase wrapper: Fix (25 tasks)
  secure_2config.sh      #   Phase wrapper: Secure (20 tasks)
  test_2config.sh        #   Phase wrapper: Test (20 tasks)
  run_selected_tasks.sh  #   Unified runner for all tasks
  validate_one_per_benchmark.sh  # Pre-flight smoke (1 task per suite)
  selected_benchmark_tasks.json  # Canonical task selection: 370 tasks (150 SDLC + 220 Org)
  use_case_registry.json #   100 GTM use cases (Org task source)
  archive/               #   Pre-SDLC migration scripts (preserved for history)
scripts/                 # Metrics extraction, evaluation, and operational tooling
  csb_metrics/           #   Python package: models, extractors, discovery, judge context
  generate_eval_report.py  # CLI: deterministic evaluation report generator
  aggregate_status.py    #   Core run scanner (status, errors, watch mode)
  status_fingerprints.py #   Error classification (12 regex patterns)
  validate_tasks_preflight.py # Pre-flight task validation
  validate_task_run.py   #   Post-run validation
  check_infra.py         #   Infrastructure readiness checker
  compare_configs.py     #   Cross-config divergence analysis
  cost_report.py         #   Token/cost aggregation
  sync_task_metadata.py  #   task.toml vs selection registry reconciliation
  generate_manifest.py   #   Rebuild MANIFEST from on-disk results
  archive_run.py         #   Archive old runs to save disk
  rerun_failed.py        #   Generate rerun commands for failed tasks
  abc_audit.py           #   ABC benchmark quality audit framework
  abc_score_task.py      #   Per-task quality scoring
  abc_criteria.py        #   ABC criteria data model (32 criteria)
  docs_consistency_check.py # Documentation drift guard
tests/                   # Unit tests for scripts/
  test_abc_audit.py      #   Tests for ABC audit framework
  test_abc_criteria.py   #   Tests for ABC criteria data model
  test_abc_score_task.py #   Tests for task quality scorer
  test_extract_task_metrics.py # Tests for metrics extraction
docs/                    # Operational documentation
  CONFIGS.md             #   2-config tool breakdown
  ERROR_CATALOG.md       #   Known error fingerprints, causes, fixes
  QA_PROCESS.md          #   Quality assurance and validation pipeline
  EVALUATION_PIPELINE.md #   Unified eval: verifier → judge → statistics → report
  TASK_CATALOG.md        #   Detailed per-task reference
  TASK_SELECTION.md      #   Selection criteria, difficulty calibration, MCP scoring
  SCORING_SEMANTICS.md   #   Reward and pass interpretation per benchmark
  MCP_UNIQUE_TASKS.md    #   Org task system, authoring, oracle evaluation
  MCP_UNIQUE_CALIBRATION.md # Oracle coverage analysis and threshold calibration
  WORKFLOW_METRICS.md    #   Timing/cost metric definitions
  AGENT_INTERFACE.md     #   Runtime I/O contract for agents
  EXTENSIBILITY.md       #   Safe suite/task/config extension guide
  LEADERBOARD.md         #   Ranking policy
  SUBMISSION.md          #   Submission format specification
skills/                  # AI agent skill definitions (operational runbooks)
  csb/                   #   CSB-specific: pre-run, monitoring, triage, analysis, maintenance
  general/               #   Reusable: workflow tools, agent delegation, dev practices
schemas/                 # JSON schemas for MANIFEST.json, task.toml, etc.

Each suite directory contains per-task subdirectories with instruction.md, task.toml, tests/, and ground truth (or solution/). Org tasks additionally include task_spec.json, oracle_answer.json, and Dockerfile variants for baseline/MCP-only execution.

Metrics Extraction Pipeline

The scripts/ directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output. Use runs/analysis for active analysis runs (and runs/official when producing publishable exports):

# Generate evaluation report from analysis runs
python3 scripts/generate_eval_report.py \
  --runs-dir /path/to/runs/analysis/ \
  --output-dir ./eval_reports/

# Generate LLM judge context files
python3 -m scripts.csb_metrics.judge_context \
  --runs-dir /path/to/runs/analysis/ \
  --benchmarks-dir ./benchmarks/ \
  --output-dir ./judge_contexts/

The report generator produces:

eval_report.json -- full structured report
REPORT.md -- markdown tables (performance, efficiency, tool utilization)
harness_configs.json -- exact harness configuration per run
CSV files per table for downstream analysis

See python3 scripts/generate_eval_report.py --help for all options.

Official Results + Trace Browser

To export official results (valid scored tasks only) with parsed trace summaries and local browsing UI:

python3 scripts/export_official_results.py \
  --runs-dir ./runs/official/ \
  --output-dir ./docs/official_results/

This writes:

docs/official_results/README.md -- run/config score summary
docs/official_results/runs/*.md -- per-run task tables
docs/official_results/tasks/*.md -- per-task metrics + parsed tool/trace view
docs/official_results/data/official_results.json -- machine-readable dataset
docs/official_results/audits/*.json -- per-task audit artifacts (checksums + parsed trace events)
docs/official_results/traces/*/trajectory.json -- bundled raw trajectory traces
docs/official_results/index.html -- interactive local browser

Suite summaries are deduplicated to the latest result per suite + config + task_name; full historical rows remain in official_results.json under all_tasks. For SDLC suites, export normalizes legacy config labels: baseline -> baseline-local-direct, mcp -> mcp-remote-direct.

Serve locally:

python3 scripts/export_official_results.py --serve

For the full multi-layer evaluation pipeline (verifier, LLM judge, statistical analysis, dual-score reporting), see docs/EVALUATION_PIPELINE.md.

Running with Harbor

This section assumes Harbor is already installed and configured. If not, start with the Quickstart section above and python3 scripts/check_infra.py.

SDLC Tasks

The unified runner executes all 370 canonical tasks across the 2-config matrix:

# Run all 370 tasks across 2 configs
bash configs/run_selected_tasks.sh

# Run only the baseline config
bash configs/run_selected_tasks.sh --baseline-only

# Run a single SDLC phase
bash configs/run_selected_tasks.sh --benchmark csb_sdlc_fix

# Dry run to list tasks without executing
bash configs/run_selected_tasks.sh --dry-run

Per-phase runners are also available:

bash configs/fix_2config.sh              # 26 Bug Repair tasks
bash configs/feature_2config.sh          # 23 Feature Implementation tasks
bash configs/debug_2config.sh            # 18 Debugging & Investigation tasks
bash configs/test_2config.sh             # 18 Testing & QA tasks
bash configs/refactor_2config.sh         # 16 Cross-File Refactoring tasks
bash configs/design_2config.sh           # 14 Architecture & Design tasks
bash configs/document_2config.sh         # 13 Documentation tasks
bash configs/secure_2config.sh           # 12 Security & Compliance tasks
bash configs/understand_2config.sh       # 10 Requirements & Discovery tasks

Filtering by Suite

All tasks (SDLC and Org) are in the unified selected_benchmark_tasks.json. Filter by suite with the --benchmark flag:

# Run only Org security tasks
bash configs/run_selected_tasks.sh --benchmark csb_org_security

# Run only SDLC fix tasks
bash configs/run_selected_tasks.sh --benchmark csb_sdlc_fix

All runners support --baseline-only, --full-only, --task TASK_ID, and --parallel N flags.

Quality Assurance & Validation

CodeScaleBench includes a multi-stage QA pipeline to ensure task integrity, reproducible runs, and accurate scoring.

Phase	Script	Purpose
Pre-flight	`scripts/validate_tasks_preflight.py`	Catches truncated instructions, template placeholders, language/difficulty mismatches, missing test.sh
Infra check	`scripts/check_infra.py`	Verifies OAuth tokens (all accounts), Docker, disk space, Harbor CLI
Error fingerprinting	`scripts/status_fingerprints.py`	Classifies failures with 12 regex patterns; auto-retry guidance per pattern
Post-run	`scripts/validate_task_run.py`	Flags crashes, MCP tool usage anomalies, suspicious scoring
Metadata sync	`scripts/sync_task_metadata.py`	Keeps task.toml in sync with `selected_benchmark_tasks.json`; `--fix` to auto-update
Run analysis	`scripts/aggregate_status.py`	Scans run dirs, classifies per-task status, writes status.json, supports `--watch` mode

The QA methodology uses a 6-dimension audit framework: instruction contamination, reproducibility, verifier correctness, ghost/false-positive detection, error misclassification, and tool effectiveness analysis.

See docs/QA_PROCESS.md for the full pipeline documentation and docs/ERROR_CATALOG.md for the known error catalog.

Operational Tooling

Key scripts organized by workflow phase:

Phase	Script	Usage
Pre-run	`validate_tasks_preflight.py`	`python3 scripts/validate_tasks_preflight.py [--suite csb_sdlc_fix] [--task sgt-001]`
Pre-run	`check_infra.py`	`python3 scripts/check_infra.py`
During run	`aggregate_status.py --since 2h`	`python3 scripts/aggregate_status.py --since 2h`
Post-run	`aggregate_status.py`	`python3 scripts/aggregate_status.py [--watch]`
Post-run	`validate_task_run.py`	`python3 scripts/validate_task_run.py <run_dir>`
Analysis	`compare_configs.py`	`python3 scripts/compare_configs.py`
Analysis	`cost_report.py`	`python3 scripts/cost_report.py`
Analysis	`generate_manifest.py`	`python3 scripts/generate_manifest.py`
Maintenance	`sync_task_metadata.py`	`python3 scripts/sync_task_metadata.py [--fix]`
Maintenance	`archive_run.py`	`python3 scripts/archive_run.py <run_dir> [--compress]`
Maintenance	`rerun_failed.py`	`python3 scripts/rerun_failed.py [--fingerprint timeout] [--suite csb_sdlc_fix]`

AI Agent Skills

The skills/ directory contains structured runbooks for AI coding agents operating on this repository. These encode operational workflows — infrastructure checks, task validation, failure triage, report generation — so any agent (Claude Code, Cursor, Copilot, etc.) can follow them autonomously.

Category	Skills	Description
CSB Operations	20 skills in 6 files	Pre-run checks, monitoring, triage, analysis, maintenance, task authoring
General Purpose	11 skills in 4 files	Session management, agent delegation, search patterns, dev practices

Skills are plain markdown and tool-agnostic. See skills/README.md for the full index and integration guides for Cursor, Claude Code, and other agents. See docs/SKILLS.md for background on the skills system.

License

See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeScaleBench

Quickstart (Public / First-Time Users)

Who this repo is for

What you can do without Harbor

What requires Harbor (benchmark execution)

First places to read

CodeScaleBench-SDLC

CodeScaleBench-Org

2-Config Evaluation Matrix

Repository Structure

Metrics Extraction Pipeline

Official Results + Trace Browser

Running with Harbor

SDLC Tasks

Filtering by Suite

Quality Assurance & Validation

Operational Tooling

AI Agent Skills

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 963 Commits
.github/workflows		.github/workflows
agents		agents
base_images		base_images
benchmarks		benchmarks
configs		configs
data		data
docs		docs
fixtures/repo_sets		fixtures/repo_sets
lib		lib
runs		runs
schemas		schemas
scripts		scripts
skills		skills
templates/mcp_unique_task		templates/mcp_unique_task
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CodeScaleBench

Quickstart (Public / First-Time Users)

Who this repo is for

What you can do without Harbor

What requires Harbor (benchmark execution)

First places to read

CodeScaleBench-SDLC

CodeScaleBench-Org

2-Config Evaluation Matrix

Repository Structure

Metrics Extraction Pipeline

Official Results + Trace Browser

Running with Harbor

SDLC Tasks

Filtering by Suite

Quality Assurance & Validation

Operational Tooling

AI Agent Skills

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages