[serve][6/n] Gang scheduling -- autoscaling#61467
[serve][6/n] Gang scheduling -- autoscaling#61467jeffreywang-anyscale wants to merge 1 commit intomasterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables autoscaling for gang-scheduled deployments, a significant feature enhancement. The changes correctly remove the previous restriction and introduce logic to align replica counts with gang_size during scaling operations, ensuring gang boundaries are respected. The implementation is clean, with rounding up for scale-ups and rounding down for scale-downs. Schema validations are also updated to enforce that autoscaling bounds (min_replicas, max_replicas, initial_replicas) are multiples of gang_size. The addition of comprehensive unit and integration tests is excellent and covers various scenarios, including unaligned scaling targets. Overall, this is a high-quality contribution. I have one minor suggestion to improve a comment for clarity.
Note: Security Review did not run due to the size of the PR.
| and isinstance(new_deployment_config.num_replicas, int) | ||
| and new_deployment_config.autoscaling_config is None | ||
| ): | ||
| # When autoscaling is enabled, num_replicas defaults to 1 |
There was a problem hiding this comment.
The comment on this line is a bit confusing. It says "When autoscaling is enabled...", but this code block is executed only when new_deployment_config.autoscaling_config is None (i.e., when autoscaling is disabled). This might confuse future readers.
A clearer comment could explain that this check is for fixed-replica deployments.
| # When autoscaling is enabled, num_replicas defaults to 1 | |
| # For fixed-replica deployments, num_replicas must be a multiple of gang_size. |
| autoscaling_config={"min_replicas": 4, "max_replicas": 8}, | ||
| ) | ||
| assert f2._deployment_config.autoscaling_config is not None | ||
| assert f._deployment_config.gang_scheduling_config.gang_size == 4 |
There was a problem hiding this comment.
Test asserts on wrong variable, missing coverage for f2
Medium Severity
In test_gang_scheduling_config_auto_num_replicas_via_options, line 848 asserts f._deployment_config.gang_scheduling_config.gang_size == 4 on the original deployment f instead of the new deployment f2 returned by .options(). The test is meant to verify that gang_scheduling_config is preserved through .options(), but checking the original f always passes trivially. If .options() ever regressed and dropped the gang config from the copy, this test would not catch it.
| raise ValueError( | ||
| f"num_replicas ({num_replicas}) must be a multiple of " | ||
| f"gang_size ({v.gang_size})." | ||
| f"num_replicas ({values['num_replicas']}) must be a multiple of gang_size ({v.gang_size})." |
There was a problem hiding this comment.
Missing autoscaling bounds gang_size validation in Python API
Medium Severity
The validate_gang_scheduling_config validator in DeploymentConfig skips all checks when autoscaling_config is present but never validates that min_replicas, max_replicas, and initial_replicas are multiples of gang_size. This validation only exists in schema.py (REST API path), so users of the Python API (@serve.deployment() or .options()) can create misaligned configurations. When apply_bounds later clips to non-gang-aligned bounds, the resulting replica count won't be a multiple of gang_size, defeating gang scheduling.
Allow num_replicas="auto" with gang_scheduling_config. The autoscaler's apply_bounds() now aligns replica counts to gang_size multiples: ceil for upscale, floor for downscale. Schema validation ensures min/max/initial replicas are multiples of gang_size. Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
e588662 to
200f128
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| raise ValueError( | ||
| f"autoscaling_config.{field} ({val}) must be a " | ||
| f"multiple of gang_size ({gang_config.gang_size})." | ||
| ) |
There was a problem hiding this comment.
Schema validation misses default autoscaling bound values
Medium Severity
The gang-size validation in schema.py uses autoscaling_config.get(field) on the raw user-provided dict, which returns None for fields the user didn't explicitly set. Default values like min_replicas=1 and max_replicas=1 from AutoscalingConfig are never checked against gang_size. A user providing autoscaling_config={"target_ongoing_requests": 5} with gang_size=4 would bypass validation entirely, yet the effective defaults (1, 1) aren't multiples of 4.


Description
Adds gang-aware autoscaling: Enable
num_replicas="auto"with gang scheduling so deployments can autoscale while respecting gang boundaries.Approach
apply_bounds()now aligns replica counts togang_sizemultiples: rounds up when scaling up (to ensure enough capacity) and rounds down when scaling down (to release only complete gangs).min_replicas,max_replicas, andinitial_replicasare multiples ofgang_size.Test Plan
Unit Tests
TestDeploymentSchema ::test_gang_scheduling_config_auto_replicas_acceptednum_replicas="auto"accepted with valid autoscaling boundsTestDeploymentSchema ::test_gang_scheduling_config_auto_replicas_invalid_boundsmin_replicas,max_replicas,initial_replicasnot divisible bygang_sizeTestGangSchedulingConfig ::test_gang_scheduling_config_auto_num_replicasnum_replicas="auto"accepted via@serve.deploymentTestGangSchedulingConfig ::test_gang_scheduling_config_auto_num_replicas_via_optionsnum_replicas="auto"accepted via.options()TestGangSchedulingAutoscaling ::test_scale_up_rounds_upgang_sizemultipleTestGangSchedulingAutoscaling ::test_scale_down_rounds_downgang_sizemultipleTestGangSchedulingAutoscaling ::test_scale_down_respects_minmin_replicasTestGangSchedulingAutoscaling ::test_scale_up_respects_maxmax_replicasTestGangSchedulingAutoscaling ::test_no_gangIntegration Tests
TestGangScaling ::test_gang_autoscalinggang_sizeTestGangScaling ::test_gang_autoscaling_unaligned_upscaleTestGangScaling ::test_gang_autoscaling_unaligned_downscaleRelated issues
RFC: #60873
Precedent: #61215