feat: add AutoRL-Bench for RL post-training evaluation#1336
feat: add AutoRL-Bench for RL post-training evaluation#1336
Conversation
- Add agents/opencode/ with config.yaml, start.sh, README.md - Include opencode-rl pipeline code (pipeline/, runner_fsm/, benchmarks/) - Merge opencode-rl dependencies into autorl_bench requirements.txt - Remove separate venv requirement, share main environment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sync opencode-rl runner_fsm with latest simplifications - Add smith benchmarks integration - Update opencompass configs and server with GPU support + error handling
- Document external repo architecture (opencode-rl as independent plugin) - Add setup instructions for cloning and configuring opencode-rl - Add architecture diagram showing RD-Agent ↔ opencode-rl interaction - Document OPENCODE_RL_ROOT for custom paths
- Add smith/ module for dynamic benchmark discovery from rl-smith - Add PerSampleEvaluator for per-sample scoring via vLLM - Update utils.py to support script-based data download for smith benchmarks - Update opencode agent config
- instructions.md: prohibit SFT, require RL (GRPO/PPO) for all benchmarks - remove agents/opencode/opencode-rl/ (runtime uses external OPENCODE_RL_ROOT) Made-with: Cursor
openai, httpx, python-dotenv, tenacity are for OpenCode agent's separate environment. Keep peft and pydantic as shared deps. Made-with: Cursor
- run.py: replace 2x nested 3-level try/except with shared _kill_process_group() using loop + specific exceptions - server.py: except Exception → except (RuntimeError, ValueError, OSError) - utils.py: except Exception → except requests.ConnectionError Made-with: Cursor
Extract from run.py into core/utils.py so other runners can also use it. Exported via core/__init__.py. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Use relative paths, forbid cd outside workspace, ignore symlink targets. Made-with: Cursor
…CLI, remove unsupported args Made-with: Cursor
Ensures OpenCode-FSM-Runner writes outputs into the workspace prepared by AutoRL-Bench instead of creating its own runs/ directory. Made-with: Cursor
Ensures LLM agent bash calls (e.g. python3 -c "from trl import ...") resolve to the correct training environment, instead of relying on parent shell conda activation. Made-with: Cursor
…ode-rl Made-with: Cursor
…nfig - Resolve dataset variable names via importlib before generating config, so the template uses `from xxx import datasets` instead of `import *` - Remove the fragile runtime cleanup hack that set leaked modules to None - Increase OpenCompass timeout from 3600s to 7200s - Fix score parsing to average across multiple subdatasets
human-eval package requires clone from open-compass/human-eval with a one-line patch to relax assertion for partial evaluation (test split only). Made-with: Cursor
Merge latest main branch into rl-posttraining. Resolved 46 both-added conflicts by accepting main's version — none of the conflicting files are under rdagent/scenarios/rl/. ## RL post-training changes (from rl-posttraining branch) feat: AutoRL-Bench — an evaluation framework for RL post-training agents - feat: add AutoRL-Bench core framework (run.py, evaluator, server, utils) - feat: add OpenCompass-based evaluator for code/math benchmarks - feat: add GSM8K benchmark (data download, OpenCompass eval) - feat: add HumanEval benchmark (data split, OpenCompass eval, dependency instructions) - feat: add ALFWorld benchmark (rollout-based eval with game environments) - feat: add Smith benchmark discovery and per-sample evaluator - feat: register OpenHands agent into autorl_bench framework - feat: register OpenCode agent into autorl_bench framework - feat: parallel experiment execution with multi-port evaluation servers - feat: baseline caching to avoid redundant evaluation - feat: workspace isolation rules (relative paths, no cd outside workspace) - feat: TRAINING_PYTHON env var for separate agent/training environments - feat: Streamlit UI for experiment monitoring (ui.py) - docs: add study.md (run guide) and stratup.md (deployment guide) ## Key changes from main - feat: add LLM-finetune scenario (#1314) - fix: preserve null end_time in dataset segments template (#1326) - fix: prevent calendar index overflow (#1324) - refactor: unify qlib experiment configs and templates (#1320) - fix: parse package names safely from requirements (#1313) ## Merge conflicts (46 files, all resolved with main's version) - root: .gitignore, Makefile - rdagent/core/: evaluation, evolving_agent, evolving_framework, exception, experiment, proposal, scenario - rdagent/oai/backend/: base, litellm - rdagent/log/: storage, ui/{app,ds_trace,web} - rdagent/components/coder/: CoSTEER/evaluators, finetune/{conf,eval,prompts,unified_validator} - rdagent/scenarios/finetune/: benchmark/, datasets/, dev/, loop, proposal/, train/ - rdagent/scenarios/qlib/: docker/, experiment/, factor_experiment_loader/ - rdagent/scenarios/data_science/: loop, proposal/ - rdagent/app/: finetune/, qlib_rd_loop/ - rdagent/utils/: env, workflow/loop - test/: oai/, utils/
| return jsonify(result) | ||
| except (RuntimeError, ValueError, OSError) as e: | ||
| logger.error(f"[SUBMIT] Error: {e}") | ||
| return jsonify({"error": str(e)}), 500 |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 3 hours ago
In general, the fix is to avoid returning the raw exception message from the server to the client. Instead, log the detailed error (possibly including the stack trace) on the server side and respond with a generic error message that does not reveal internal details.
The best targeted fix here is to change the except block in submit() so that:
- The server logs the detailed exception (and, if desired, stack trace) using the existing
logger. - The HTTP response uses a generic, localized error message, without including
str(e).
Because we must not change imports except to add well-known libraries and we already have a logger, we can simply modify the returned JSON. A minimal change is to replace {"error": str(e)} with a fixed message such as {"error": "Internal server error during submission"} (or similar). If more diagnostics are needed for clients, we can also include a non-sensitive error code like {"error": "internal_error", "message": "Submission failed due to internal error"}. This keeps functionality the same from a control-flow perspective while removing sensitive information leakage. All of this happens in rdagent/scenarios/rl/autorl_bench/core/server.py within the submit function’s except block.
| @@ -210,7 +210,7 @@ | ||
| return jsonify(result) | ||
| except (RuntimeError, ValueError, OSError) as e: | ||
| logger.error(f"[SUBMIT] Error: {e}") | ||
| return jsonify({"error": str(e)}), 500 | ||
| return jsonify({"error": "Internal server error during submission"}), 500 | ||
|
|
||
|
|
||
| @app.route("/health", methods=["GET"]) |
| return safe_root | ||
| try: | ||
| normalized = os.path.normpath(user_input) | ||
| path_obj = Path(normalized).expanduser() |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 3 hours ago
General fix: further validate the user-provided path string before converting it into a Path, and ensure that paths containing traversal sequences (..), absolute path indicators, or other clearly unsafe content are rejected early. This keeps the logic of constraining everything under safe_root while making the taint flow obviously safe to CodeQL and to human reviewers.
Best concrete fix here:
- Extend
_safe_resolveto:- Reject path strings containing NUL bytes (which can cause surprising behavior in some filesystem APIs).
- Split the normalized path into components and reject any component equal to
"..". - Avoid constructing
Pathfrom unfiltered user input; instead, build the path by iterating over allowed components usingPath()/.joinpath(...).
- Keep existing checks that reject absolute paths and ensure the final resolved path is under
safe_root.
Specific changes in rdagent/app/rl/ui/app.py:
- Modify
_safe_resolve(lines 23–36) to:- Normalize the input string.
- Inspect the components of the normalized path (using
Path(normalized).parts). - Reject
".."components before forming the path relative tosafe_root. - Build
path_objfrom the filtered parts (rather than directly from the user string). - Keep the
absolute pathcheck andrelative_to(safe_root)guard.
No new imports are strictly necessary; we can rely on the existing os and Path imports.
| @@ -25,13 +25,28 @@ | ||
| if not user_input: | ||
| return safe_root | ||
| try: | ||
| # Normalize input to collapse any ".." segments, etc. | ||
| normalized = os.path.normpath(user_input) | ||
| path_obj = Path(normalized).expanduser() | ||
| if path_obj.is_absolute(): | ||
| # Basic sanity checks on the raw string | ||
| if "\x00" in normalized: | ||
| raise ValueError("Null bytes are not allowed in paths") | ||
| # Build the path from safe components only | ||
| candidate = safe_root | ||
| for part in Path(normalized).parts: | ||
| # Reject explicit parent-directory traversal | ||
| if part == "..": | ||
| raise ValueError("Path traversal is not allowed") | ||
| # Skip current-directory references | ||
| if part in ("", "."): | ||
| continue | ||
| candidate = candidate / part | ||
| # Disallow absolute paths even after normalization | ||
| if Path(normalized).is_absolute(): | ||
| raise ValueError("Absolute paths are not allowed") | ||
| candidate = (safe_root / path_obj).resolve(strict=False) | ||
| candidate.relative_to(safe_root) | ||
| return candidate | ||
| # Resolve without requiring the path to exist, then ensure it is under safe_root | ||
| resolved = candidate.resolve(strict=False) | ||
| resolved.relative_to(safe_root) | ||
| return resolved | ||
| except (OSError, ValueError) as exc: | ||
| raise ValueError(f"Invalid path outside of allowed root: {user_input}") from exc | ||
|
|
📚 Documentation preview 📚: https://RDAgent--1336.org.readthedocs.build/en/1336/