[Serve] Add RAY_SERVE_HAPROXY_TCP_NODELAY env var#61468
Open
eicherseiji wants to merge 5 commits intoray-project:masterfrom
Open
[Serve] Add RAY_SERVE_HAPROXY_TCP_NODELAY env var#61468eicherseiji wants to merge 5 commits intoray-project:masterfrom
eicherseiji wants to merge 5 commits intoray-project:masterfrom
Conversation
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
'http-request set-nodelay' / 'http-response set-nodelay' are not valid actions in HAProxy 2.8.x. Replace with 'option http-no-delay' which has been available since HAProxy 1.5+ and forces TCP_NODELAY on every outgoing segment for both client and server connections. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces the RAY_SERVE_HAPROXY_DISABLE_NAGLE environment variable to disable Nagle's algorithm in HAProxy, which can improve latency for certain workloads. The implementation correctly adds the necessary HAProxy directives. My feedback includes a couple of suggestions to enhance code consistency and simplify the HAProxy configuration.
Comment on lines
+674
to
+676
| RAY_SERVE_HAPROXY_DISABLE_NAGLE = ( | ||
| os.environ.get("RAY_SERVE_HAPROXY_DISABLE_NAGLE", "0") == "1" | ||
| ) |
Contributor
There was a problem hiding this comment.
For consistency with other boolean environment variables in this file (e.g., RAY_SERVE_LOG_TO_STDERR), it would be better to use the get_env_bool utility function. This also makes the check more robust by handling values like "true" in addition to "1".
RAY_SERVE_HAPROXY_DISABLE_NAGLE = get_env_bool("RAY_SERVE_HAPROXY_DISABLE_NAGLE", "0")Positive framing using the canonical socket option name. More grep-friendly and avoids the double-negative of "disable Nagle = 1". Signed-off-by: Seiji Eicher <seiji@anyscale.com>
76ff873 to
9587bc5
Compare
Verifies that when tcp_nodelay=True, the generated config contains 'option http-no-delay' and HAProxy starts successfully. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
a8ad9b5 to
9ac2709
Compare
abrarsheikh
approved these changes
Mar 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
RAY_SERVE_HAPROXY_TCP_NODELAYenvironment variable (default0) that, when set to1, addsoption http-no-delayto the HAProxy config defaults section. This setsTCP_NODELAYon both client and server connections, disabling Nagle's algorithm.Why this matters
HAProxy's default behavior leaves Nagle's algorithm enabled, which buffers small writes before flushing to the wire. For LLM serving workloads where responses are streamed token-by-token as small SSE frames, this introduces artificial latency — the kernel holds back each token waiting for more data to coalesce, directly inflating TTFT and tail ITL.
Benchmark results
Setup: Qwen2.5-0.5B-Instruct, 256 requests, max concurrency 32, prefix-heavy prompt (640 input tokens/req)
TCP_NODELAY)Nagle causes ~3x mean TTFT regression and ~5x P99 ITL regression with no meaningful throughput benefit. The effect will be more pronounced at lower concurrency / interactive use cases.
Code change
The implementation is a one-liner in the Jinja template, gated behind an env var:
Test plan
haproxy.cfgcontainsoption http-no-delaywhen env var is settest_controller_haproxy.py)test_start_with_tcp_nodelay— verifies HAProxy starts successfully withtcp_nodelay=Trueand config contains the directive