[tune] Fix _schedule_trial_pause caching STOP instead of PAUSE when trial is saving#61448
Open
sjp611 wants to merge 1 commit intoray-project:masterfrom
Open
[tune] Fix _schedule_trial_pause caching STOP instead of PAUSE when trial is saving#61448sjp611 wants to merge 1 commit intoray-project:masterfrom
sjp611 wants to merge 1 commit intoray-project:masterfrom
Conversation
…rial is saving When _schedule_trial_pause(should_checkpoint=False) is called on a saving trial, _schedule_trial_stop caches a STOP decision. Override it with PAUSE so the trial is correctly paused when the save completes. Closes ray-project#57906 Signed-off-by: Sung Joon Park <sjp611@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where trials were incorrectly terminated instead of paused. The scenario occurs when _schedule_trial_pause is called with should_checkpoint=False on a trial that is currently saving. The change in python/ray/tune/execution/tune_controller.py modifies _schedule_trial_pause to override the cached STOP decision with a PAUSE decision. A corresponding regression test has been added in python/ray/tune/tests/test_trial_scheduler_pbt.py to validate the fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When synch PBT calls
pause_trial(trial, should_checkpoint=False)on a trial that is currently saving, the following happens:_schedule_trial_pausecalls_schedule_trial_stopinternally_schedule_trial_stopseesis_saving=True→ caches aSTOPdecisionSTOPis executed → trial terminatedSTOPPAUSEThis fix overrides the cached decision from
STOPtoPAUSEafter_schedule_trial_stopreturns, so that the trial is correctly paused when the save completes.Related issues
Closes #57906