Skip to content

feat: add AutoRL-Bench for RL post-training evaluation#1336

Open
couragec wants to merge 511 commits intomainfrom
rl-posttraining
Open

feat: add AutoRL-Bench for RL post-training evaluation#1336
couragec wants to merge 511 commits intomainfrom
rl-posttraining

Conversation

@couragec
Copy link
Collaborator

@couragec couragec commented Mar 4, 2026

  • Add AutoRL-Bench: an evaluation framework for RL post-training agents
  • Merge latest main into rl-posttraining, resolve 46 conflicts (all accepting main's version)
  • Includes benchmarks (GSM8K, HumanEval, ALFWorld), agents (OpenHands, OpenCode), and Streamlit UI

📚 Documentation preview 📚: https://RDAgent--1336.org.readthedocs.build/en/1336/

Jensen246 and others added 30 commits December 22, 2025 08:46
shatianming5 and others added 21 commits March 2, 2026 13:56
- Add agents/opencode/ with config.yaml, start.sh, README.md
- Include opencode-rl pipeline code (pipeline/, runner_fsm/, benchmarks/)
- Merge opencode-rl dependencies into autorl_bench requirements.txt
- Remove separate venv requirement, share main environment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sync opencode-rl runner_fsm with latest simplifications
- Add smith benchmarks integration
- Update opencompass configs and server with GPU support + error handling
- Document external repo architecture (opencode-rl as independent plugin)
- Add setup instructions for cloning and configuring opencode-rl
- Add architecture diagram showing RD-Agent ↔ opencode-rl interaction
- Document OPENCODE_RL_ROOT for custom paths
- Add smith/ module for dynamic benchmark discovery from rl-smith
- Add PerSampleEvaluator for per-sample scoring via vLLM
- Update utils.py to support script-based data download for smith benchmarks
- Update opencode agent config
- instructions.md: prohibit SFT, require RL (GRPO/PPO) for all benchmarks
- remove agents/opencode/opencode-rl/ (runtime uses external OPENCODE_RL_ROOT)

Made-with: Cursor
openai, httpx, python-dotenv, tenacity are for OpenCode agent's
separate environment. Keep peft and pydantic as shared deps.

Made-with: Cursor
- run.py: replace 2x nested 3-level try/except with shared
  _kill_process_group() using loop + specific exceptions
- server.py: except Exception → except (RuntimeError, ValueError, OSError)
- utils.py: except Exception → except requests.ConnectionError

Made-with: Cursor
Extract from run.py into core/utils.py so other runners
can also use it. Exported via core/__init__.py.

Made-with: Cursor
Use relative paths, forbid cd outside workspace, ignore symlink targets.

Made-with: Cursor
…CLI, remove unsupported args

Made-with: Cursor
Ensures OpenCode-FSM-Runner writes outputs into the workspace prepared
by AutoRL-Bench instead of creating its own runs/ directory.

Made-with: Cursor
Ensures LLM agent bash calls (e.g. python3 -c "from trl import ...")
resolve to the correct training environment, instead of relying on
parent shell conda activation.

Made-with: Cursor
…nfig

- Resolve dataset variable names via importlib before generating config,
  so the template uses `from xxx import datasets` instead of `import *`
- Remove the fragile runtime cleanup hack that set leaked modules to None
- Increase OpenCompass timeout from 3600s to 7200s
- Fix score parsing to average across multiple subdatasets
human-eval package requires clone from open-compass/human-eval with
a one-line patch to relax assertion for partial evaluation (test split only).

Made-with: Cursor
Merge latest main branch into rl-posttraining. Resolved 46 both-added
conflicts by accepting main's version — none of the conflicting files
are under rdagent/scenarios/rl/.

## RL post-training changes (from rl-posttraining branch)

feat: AutoRL-Bench — an evaluation framework for RL post-training agents

- feat: add AutoRL-Bench core framework (run.py, evaluator, server, utils)
- feat: add OpenCompass-based evaluator for code/math benchmarks
- feat: add GSM8K benchmark (data download, OpenCompass eval)
- feat: add HumanEval benchmark (data split, OpenCompass eval, dependency instructions)
- feat: add ALFWorld benchmark (rollout-based eval with game environments)
- feat: add Smith benchmark discovery and per-sample evaluator
- feat: register OpenHands agent into autorl_bench framework
- feat: register OpenCode agent into autorl_bench framework
- feat: parallel experiment execution with multi-port evaluation servers
- feat: baseline caching to avoid redundant evaluation
- feat: workspace isolation rules (relative paths, no cd outside workspace)
- feat: TRAINING_PYTHON env var for separate agent/training environments
- feat: Streamlit UI for experiment monitoring (ui.py)
- docs: add study.md (run guide) and stratup.md (deployment guide)

## Key changes from main

- feat: add LLM-finetune scenario (#1314)
- fix: preserve null end_time in dataset segments template (#1326)
- fix: prevent calendar index overflow (#1324)
- refactor: unify qlib experiment configs and templates (#1320)
- fix: parse package names safely from requirements (#1313)

## Merge conflicts (46 files, all resolved with main's version)

- root: .gitignore, Makefile
- rdagent/core/: evaluation, evolving_agent, evolving_framework,
  exception, experiment, proposal, scenario
- rdagent/oai/backend/: base, litellm
- rdagent/log/: storage, ui/{app,ds_trace,web}
- rdagent/components/coder/: CoSTEER/evaluators, finetune/{conf,eval,prompts,unified_validator}
- rdagent/scenarios/finetune/: benchmark/, datasets/, dev/, loop, proposal/, train/
- rdagent/scenarios/qlib/: docker/, experiment/, factor_experiment_loader/
- rdagent/scenarios/data_science/: loop, proposal/
- rdagent/app/: finetune/, qlib_rd_loop/
- rdagent/utils/: env, workflow/loop
- test/: oai/, utils/
return jsonify(result)
except (RuntimeError, ValueError, OSError) as e:
logger.error(f"[SUBMIT] Error: {e}")
return jsonify({"error": str(e)}), 500

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI about 3 hours ago

In general, the fix is to avoid returning the raw exception message from the server to the client. Instead, log the detailed error (possibly including the stack trace) on the server side and respond with a generic error message that does not reveal internal details.

The best targeted fix here is to change the except block in submit() so that:

  • The server logs the detailed exception (and, if desired, stack trace) using the existing logger.
  • The HTTP response uses a generic, localized error message, without including str(e).

Because we must not change imports except to add well-known libraries and we already have a logger, we can simply modify the returned JSON. A minimal change is to replace {"error": str(e)} with a fixed message such as {"error": "Internal server error during submission"} (or similar). If more diagnostics are needed for clients, we can also include a non-sensitive error code like {"error": "internal_error", "message": "Submission failed due to internal error"}. This keeps functionality the same from a control-flow perspective while removing sensitive information leakage. All of this happens in rdagent/scenarios/rl/autorl_bench/core/server.py within the submit function’s except block.

Suggested changeset 1
rdagent/scenarios/rl/autorl_bench/core/server.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/rdagent/scenarios/rl/autorl_bench/core/server.py b/rdagent/scenarios/rl/autorl_bench/core/server.py
--- a/rdagent/scenarios/rl/autorl_bench/core/server.py
+++ b/rdagent/scenarios/rl/autorl_bench/core/server.py
@@ -210,7 +210,7 @@
         return jsonify(result)
     except (RuntimeError, ValueError, OSError) as e:
         logger.error(f"[SUBMIT] Error: {e}")
-        return jsonify({"error": str(e)}), 500
+        return jsonify({"error": "Internal server error during submission"}), 500
 
 
 @app.route("/health", methods=["GET"])
EOF
@@ -210,7 +210,7 @@
return jsonify(result)
except (RuntimeError, ValueError, OSError) as e:
logger.error(f"[SUBMIT] Error: {e}")
return jsonify({"error": str(e)}), 500
return jsonify({"error": "Internal server error during submission"}), 500


@app.route("/health", methods=["GET"])
Copilot is powered by AI and may make mistakes. Always verify output.
return safe_root
try:
normalized = os.path.normpath(user_input)
path_obj = Path(normalized).expanduser()

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.
This path depends on a
user-provided value
.

Copilot Autofix

AI about 3 hours ago

General fix: further validate the user-provided path string before converting it into a Path, and ensure that paths containing traversal sequences (..), absolute path indicators, or other clearly unsafe content are rejected early. This keeps the logic of constraining everything under safe_root while making the taint flow obviously safe to CodeQL and to human reviewers.

Best concrete fix here:

  • Extend _safe_resolve to:
    • Reject path strings containing NUL bytes (which can cause surprising behavior in some filesystem APIs).
    • Split the normalized path into components and reject any component equal to "..".
    • Avoid constructing Path from unfiltered user input; instead, build the path by iterating over allowed components using Path() / .joinpath(...).
  • Keep existing checks that reject absolute paths and ensure the final resolved path is under safe_root.

Specific changes in rdagent/app/rl/ui/app.py:

  • Modify _safe_resolve (lines 23–36) to:
    • Normalize the input string.
    • Inspect the components of the normalized path (using Path(normalized).parts).
    • Reject ".." components before forming the path relative to safe_root.
    • Build path_obj from the filtered parts (rather than directly from the user string).
    • Keep the absolute path check and relative_to(safe_root) guard.

No new imports are strictly necessary; we can rely on the existing os and Path imports.


Suggested changeset 1
rdagent/app/rl/ui/app.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/rdagent/app/rl/ui/app.py b/rdagent/app/rl/ui/app.py
--- a/rdagent/app/rl/ui/app.py
+++ b/rdagent/app/rl/ui/app.py
@@ -25,13 +25,28 @@
     if not user_input:
         return safe_root
     try:
+        # Normalize input to collapse any ".." segments, etc.
         normalized = os.path.normpath(user_input)
-        path_obj = Path(normalized).expanduser()
-        if path_obj.is_absolute():
+        # Basic sanity checks on the raw string
+        if "\x00" in normalized:
+            raise ValueError("Null bytes are not allowed in paths")
+        # Build the path from safe components only
+        candidate = safe_root
+        for part in Path(normalized).parts:
+            # Reject explicit parent-directory traversal
+            if part == "..":
+                raise ValueError("Path traversal is not allowed")
+            # Skip current-directory references
+            if part in ("", "."):
+                continue
+            candidate = candidate / part
+        # Disallow absolute paths even after normalization
+        if Path(normalized).is_absolute():
             raise ValueError("Absolute paths are not allowed")
-        candidate = (safe_root / path_obj).resolve(strict=False)
-        candidate.relative_to(safe_root)
-        return candidate
+        # Resolve without requiring the path to exist, then ensure it is under safe_root
+        resolved = candidate.resolve(strict=False)
+        resolved.relative_to(safe_root)
+        return resolved
     except (OSError, ValueError) as exc:
         raise ValueError(f"Invalid path outside of allowed root: {user_input}") from exc
 
EOF
@@ -25,13 +25,28 @@
if not user_input:
return safe_root
try:
# Normalize input to collapse any ".." segments, etc.
normalized = os.path.normpath(user_input)
path_obj = Path(normalized).expanduser()
if path_obj.is_absolute():
# Basic sanity checks on the raw string
if "\x00" in normalized:
raise ValueError("Null bytes are not allowed in paths")
# Build the path from safe components only
candidate = safe_root
for part in Path(normalized).parts:
# Reject explicit parent-directory traversal
if part == "..":
raise ValueError("Path traversal is not allowed")
# Skip current-directory references
if part in ("", "."):
continue
candidate = candidate / part
# Disallow absolute paths even after normalization
if Path(normalized).is_absolute():
raise ValueError("Absolute paths are not allowed")
candidate = (safe_root / path_obj).resolve(strict=False)
candidate.relative_to(safe_root)
return candidate
# Resolve without requiring the path to exist, then ensure it is under safe_root
resolved = candidate.resolve(strict=False)
resolved.relative_to(safe_root)
return resolved
except (OSError, ValueError) as exc:
raise ValueError(f"Invalid path outside of allowed root: {user_input}") from exc

Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants