feat: add AutoRL-Bench for RL post-training evaluation by couragec · Pull Request #1336 · microsoft/RD-Agent

couragec · 2026-03-04T05:05:02Z

Add AutoRL-Bench: an evaluation framework for RL post-training agents
Merge latest main into rl-posttraining, resolve 46 conflicts (all accepting main's version)
Includes benchmarks (GSM8K, HumanEval, ALFWorld), agents (OpenHands, OpenCode), and Streamlit UI

📚 Documentation preview 📚: https://RDAgent--1336.org.readthedocs.build/en/1336/

… rendering

…ging

…mark dir

…ne scenario

…dels

- Add agents/opencode/ with config.yaml, start.sh, README.md - Include opencode-rl pipeline code (pipeline/, runner_fsm/, benchmarks/) - Merge opencode-rl dependencies into autorl_bench requirements.txt - Remove separate venv requirement, share main environment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Sync opencode-rl runner_fsm with latest simplifications - Add smith benchmarks integration - Update opencompass configs and server with GPU support + error handling

- Document external repo architecture (opencode-rl as independent plugin) - Add setup instructions for cloning and configuring opencode-rl - Add architecture diagram showing RD-Agent ↔ opencode-rl interaction - Document OPENCODE_RL_ROOT for custom paths

- Add smith/ module for dynamic benchmark discovery from rl-smith - Add PerSampleEvaluator for per-sample scoring via vLLM - Update utils.py to support script-based data download for smith benchmarks - Update opencode agent config

- instructions.md: prohibit SFT, require RL (GRPO/PPO) for all benchmarks - remove agents/opencode/opencode-rl/ (runtime uses external OPENCODE_RL_ROOT) Made-with: Cursor

openai, httpx, python-dotenv, tenacity are for OpenCode agent's separate environment. Keep peft and pydantic as shared deps. Made-with: Cursor

- run.py: replace 2x nested 3-level try/except with shared _kill_process_group() using loop + specific exceptions - server.py: except Exception → except (RuntimeError, ValueError, OSError) - utils.py: except Exception → except requests.ConnectionError Made-with: Cursor

Extract from run.py into core/utils.py so other runners can also use it. Exported via core/__init__.py. Made-with: Cursor

Made-with: Cursor

Use relative paths, forbid cd outside workspace, ignore symlink targets. Made-with: Cursor

…CLI, remove unsupported args Made-with: Cursor

Ensures OpenCode-FSM-Runner writes outputs into the workspace prepared by AutoRL-Bench instead of creating its own runs/ directory. Made-with: Cursor

Ensures LLM agent bash calls (e.g. python3 -c "from trl import ...") resolve to the correct training environment, instead of relying on parent shell conda activation. Made-with: Cursor

…ode-rl Made-with: Cursor

…nfig - Resolve dataset variable names via importlib before generating config, so the template uses `from xxx import datasets` instead of `import *` - Remove the fragile runtime cleanup hack that set leaked modules to None - Increase OpenCompass timeout from 3600s to 7200s - Fix score parsing to average across multiple subdatasets

human-eval package requires clone from open-compass/human-eval with a one-line patch to relax assertion for partial evaluation (test split only). Made-with: Cursor

Merge latest main branch into rl-posttraining. Resolved 46 both-added conflicts by accepting main's version — none of the conflicting files are under rdagent/scenarios/rl/. ## RL post-training changes (from rl-posttraining branch) feat: AutoRL-Bench — an evaluation framework for RL post-training agents - feat: add AutoRL-Bench core framework (run.py, evaluator, server, utils) - feat: add OpenCompass-based evaluator for code/math benchmarks - feat: add GSM8K benchmark (data download, OpenCompass eval) - feat: add HumanEval benchmark (data split, OpenCompass eval, dependency instructions) - feat: add ALFWorld benchmark (rollout-based eval with game environments) - feat: add Smith benchmark discovery and per-sample evaluator - feat: register OpenHands agent into autorl_bench framework - feat: register OpenCode agent into autorl_bench framework - feat: parallel experiment execution with multi-port evaluation servers - feat: baseline caching to avoid redundant evaluation - feat: workspace isolation rules (relative paths, no cd outside workspace) - feat: TRAINING_PYTHON env var for separate agent/training environments - feat: Streamlit UI for experiment monitoring (ui.py) - docs: add study.md (run guide) and stratup.md (deployment guide) ## Key changes from main - feat: add LLM-finetune scenario (#1314) - fix: preserve null end_time in dataset segments template (#1326) - fix: prevent calendar index overflow (#1324) - refactor: unify qlib experiment configs and templates (#1320) - fix: parse package names safely from requirements (#1313) ## Merge conflicts (46 files, all resolved with main's version) - root: .gitignore, Makefile - rdagent/core/: evaluation, evolving_agent, evolving_framework, exception, experiment, proposal, scenario - rdagent/oai/backend/: base, litellm - rdagent/log/: storage, ui/{app,ds_trace,web} - rdagent/components/coder/: CoSTEER/evaluators, finetune/{conf,eval,prompts,unified_validator} - rdagent/scenarios/finetune/: benchmark/, datasets/, dev/, loop, proposal/, train/ - rdagent/scenarios/qlib/: docker/, experiment/, factor_experiment_loader/ - rdagent/scenarios/data_science/: loop, proposal/ - rdagent/app/: finetune/, qlib_rd_loop/ - rdagent/utils/: env, workflow/loop - test/: oai/, utils/

rdagent/app/rl/ui/app.py

rdagent/scenarios/rl/autorl_bench/core/server.py

+        return jsonify(result)
+    except (RuntimeError, ValueError, OSError) as e:
+        logger.error(f"[SUBMIT] Error: {e}")
+        return jsonify({"error": str(e)}), 500


In general, the fix is to avoid returning the raw exception message from the server to the client. Instead, log the detailed error (possibly including the stack trace) on the server side and respond with a generic error message that does not reveal internal details.

The best targeted fix here is to change the except block in submit() so that:

The server logs the detailed exception (and, if desired, stack trace) using the existing logger.

The HTTP response uses a generic, localized error message, without including str(e).

Because we must not change imports except to add well-known libraries and we already have a logger, we can simply modify the returned JSON. A minimal change is to replace {"error": str(e)} with a fixed message such as {"error": "Internal server error during submission"} (or similar). If more diagnostics are needed for clients, we can also include a non-sensitive error code like {"error": "internal_error", "message": "Submission failed due to internal error"}. This keeps functionality the same from a control-flow perspective while removing sensitive information leakage. All of this happens in rdagent/scenarios/rl/autorl_bench/core/server.py within the submit function’s except block.

rdagent/app/rl/ui/app.py

+        return safe_root
+    try:
+        normalized = os.path.normpath(user_input)
+        path_obj = Path(normalized).expanduser()


General fix: further validate the user-provided path string before converting it into a Path, and ensure that paths containing traversal sequences (..), absolute path indicators, or other clearly unsafe content are rejected early. This keeps the logic of constraining everything under safe_root while making the taint flow obviously safe to CodeQL and to human reviewers.

Best concrete fix here:

Extend _safe_resolve to:

Reject path strings containing NUL bytes (which can cause surprising behavior in some filesystem APIs).

Split the normalized path into components and reject any component equal to "..".

Avoid constructing Path from unfiltered user input; instead, build the path by iterating over allowed components using Path() / .joinpath(...).

Keep existing checks that reject absolute paths and ensure the final resolved path is under safe_root.

Specific changes in rdagent/app/rl/ui/app.py:

Modify _safe_resolve (lines 23–36) to:

Normalize the input string.

Inspect the components of the normalized path (using Path(normalized).parts).

Reject ".." components before forming the path relative to safe_root.

Build path_obj from the filtered parts (rather than directly from the user string).

Keep the absolute path check and relative_to(safe_root) guard.

No new imports are strictly necessary; we can rely on the existing os and Path imports.

Jensen246 and others added 30 commits December 22, 2025 08:46

docs: readme for llm finetune

a963bfe

feat: download raw data directly, with post-process function

4a6a4fe

feat: analyze raw dataset

b26e72c

suppress litellm debug info

3d71857

feat(ui): summary page

0d1fd17

feat: run multi-jobs

473cfe5

feat: improve ui

fe3374c

feat: add path and checkout options to LLM finetune loop entrypoint

60c3e75

feat: add FinanceIQ_ppl benchmark with auto-download and dataset desc…

37c7804

… rendering

refactor: remove unused imports and dead code, fix session folder log…

37147c4

…ging

feat: enable tablebench and tableInstruct dataset

1000aa0

refine dataset readme, and coder prompt

3d18e0a

Merge branch 'finetune' of github.com:microsoft/RD-Agent into finetune

93ecc78

refine proposal and coder prompt

5b88eac

fix: ui path (default log path)

d830351

feat: add automatic LoRA model merging for benchmarking with vLLM

a225fd5

refactor: reorganize finetune benchmark and merge modules under bench…

90c621d

…mark dir

refactor: modularize benchmark config and error extraction for finetu…

7cc2a8a

…ne scenario

fix: update benchmark import paths and disable env cache for device info

d232af0

refactor docke&conda env and fix import bugs

bc0742b

Merge branch 'finetune' of github.com:microsoft/RD-Agent into finetune

46743d0

modify init python file

18f85be

feat: add FinanceIQ dataset split utility and integrate with pipeline

97e2f4c

feat: set weak and strong model by env, distribute workload across mo…

e73ffc6

…dels

Merge branch 'finetune' of github.com:microsoft/RD-Agent into finetune

18e7207

feat: sample dataset and rm params for tensorboard, wandb

325d0d2

update script to run jobs

66a5677

refine proposal prompt, remove specific dataset name

b6b967f

fix(ui): auto switch log folder

2c9fbc2

fix: estimate the processed full data after sample

9143e6f

shatianming5 and others added 21 commits March 2, 2026 13:56

Update opencode agent, benchmarks, and eval configs

f542ca2

- Sync opencode-rl runner_fsm with latest simplifications - Add smith benchmarks integration - Update opencompass configs and server with GPU support + error handling

enforce RL-only in instructions.md; remove embedded opencode-rl

088f4b7

- instructions.md: prohibit SFT, require RL (GRPO/PPO) for all benchmarks - remove agents/opencode/opencode-rl/ (runtime uses external OPENCODE_RL_ROOT) Made-with: Cursor

comment out OpenCode-only deps in requirements.txt

6bf943e

openai, httpx, python-dotenv, tenacity are for OpenCode agent's separate environment. Keep peft and pydantic as shared deps. Made-with: Cursor

move kill_process_group to core/utils for reuse

ca520db

Extract from run.py into core/utils.py so other runners can also use it. Exported via core/__init__.py. Made-with: Cursor

add comments to run.py for workspace isolation and signal handling

e2ae657

Made-with: Cursor

remove OpenCode-only deps from requirements.txt entirely

83ff188

Made-with: Cursor

allow SFT in instructions, RL as ultimate goal

02c1068

Made-with: Cursor

add workspace isolation rules to instructions.md

3ac4f8c

Use relative paths, forbid cd outside workspace, ignore symlink targets. Made-with: Cursor

update opencode start.sh: use OPENCODE_PYTHON, add PATH for opencode …

278308a

…CLI, remove unsupported args Made-with: Cursor

opencode start.sh: pass --run-dir to use AutoRL-Bench workspace

0683730

Ensures OpenCode-FSM-Runner writes outputs into the workspace prepared by AutoRL-Bench instead of creating its own runs/ directory. Made-with: Cursor

opencode start.sh: prepend training env bin to PATH

5007063

Ensures LLM agent bash calls (e.g. python3 -c "from trl import ...") resolve to the correct training environment, instead of relying on parent shell conda activation. Made-with: Cursor

opencode start.sh: restore --max-retries and --eval-timeout for openc…

56fbab3

…ode-rl Made-with: Cursor

add humaneval benchmark

90d665b

refine opencompass config file generating

fe66ce8

add humaneval benchmark dependency instructions

304bdf1

human-eval package requires clone from open-compass/human-eval with a one-line patch to relax assertion for partial evaluation (test split only). Made-with: Cursor