Skip to content

[tune] Fix _schedule_trial_pause caching STOP instead of PAUSE when trial is saving#61448

Open
sjp611 wants to merge 1 commit intoray-project:masterfrom
sjp611:fix/tune-pause-cached-stop-57906
Open

[tune] Fix _schedule_trial_pause caching STOP instead of PAUSE when trial is saving#61448
sjp611 wants to merge 1 commit intoray-project:masterfrom
sjp611:fix/tune-pause-cached-stop-57906

Conversation

@sjp611
Copy link
Contributor

@sjp611 sjp611 commented Mar 3, 2026

Description

When synch PBT calls pause_trial(trial, should_checkpoint=False) on a trial that is currently saving, the following happens:

  1. _schedule_trial_pause calls _schedule_trial_stop internally
  2. _schedule_trial_stop sees is_saving=True → caches a STOP decision
  3. When save completes, the cached STOP is executed → trial terminated
Before (bug) After (fix)
Cached decision STOP PAUSE
Trial outcome after save Terminated Paused

This fix overrides the cached decision from STOP to PAUSE after _schedule_trial_stop returns, so that the trial is correctly paused when the save completes.

Related issues

Closes #57906

…rial is saving

When _schedule_trial_pause(should_checkpoint=False) is called on a saving
trial, _schedule_trial_stop caches a STOP decision. Override it with PAUSE
so the trial is correctly paused when the save completes.

Closes ray-project#57906

Signed-off-by: Sung Joon Park <sjp611@gmail.com>
@sjp611 sjp611 requested a review from a team as a code owner March 3, 2026 13:21
@ray-gardener ray-gardener bot added tune Tune-related issues community-contribution Contributed by the community labels Mar 3, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where trials were incorrectly terminated instead of paused. The scenario occurs when _schedule_trial_pause is called with should_checkpoint=False on a trial that is currently saving. The change in python/ray/tune/execution/tune_controller.py modifies _schedule_trial_pause to override the cached STOP decision with a PAUSE decision. A corresponding regression test has been added in python/ray/tune/tests/test_trial_scheduler_pbt.py to validate the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community tune Tune-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[tune] Population based Training (PBT) always terminates last trial instead of pausing it

1 participant