Fix client-disconnect session leaks in PyTorch MP engine#4655
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets leaked backend sessions and stale session mappings when API clients disconnect (or cancel) during PyTorch MP serving, by making MP streaming startup cancellation-safe and making serve-side cleanup idempotent.
Changes:
- Add a stream-startup barrier (
SessionState.init_done) and make MP streaming startup robust to cancellation (ZMQ/Ray), including abandoned stream cleanup and backend-death wakeups. - Make session cleanup idempotent across engine/generator/API wrapper paths, and tighten
AsyncEngine.generate()session removal behavior under cancellations/errors. - Fix TP-local Q/KV head metadata usage for FlashAttention/FlashMLA and correct KV metadata handling for last-chunk spec-decode input rewriting; add regression tests.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/serve/test_session_cleanup.py | New regression tests for idempotent session-map cleanup across normal exit/cancel paths. |
| tests/pytorch/spec_decode/test_spec_agent.py | Update regression test to ensure long-context KV metadata stays aligned after input rewriting. |
| tests/pytorch/paging/test_state_manager.py | New tests for reserved state-cache row and num_state_caches=None behavior. |
| tests/pytorch/engine/test_zmq_rpc.py | Add tests for stream-start barrier, cancel safety, backend-death wakeups, and idempotent stream output handling. |
| tests/pytorch/engine/test_ray_mp_engine.py | Add tests for Ray stream startup cancellation and idempotent result retrieval after drop. |
| tests/pytorch/config/test_model_config.py | New tests for TP-local head count computation and dist_config preservation. |
| lmdeploy/serve/openai/api_server.py | Add streaming/non-streaming request wrapper that ensures generator/session cleanup on disconnect/cancel. |
| lmdeploy/serve/managers/session_manager.py | Add request-exit-driven session removal and make SessionManager.remove() idempotent/stale-safe. |
| lmdeploy/serve/core/async_engine.py | Make safe_run cancellation-safe; ensure session removal happens consistently on prompt/cancel/error paths. |
| lmdeploy/pytorch/spec_decode/spec_agent.py | Keep aggregate KV metadata unchanged for last-chunk input rewriting. |
| lmdeploy/pytorch/paging/state_manager.py | Ensure reserved state row is excluded from allocatable IDs; handle num_state_caches=None. |
| lmdeploy/pytorch/models/utils/cudagraph.py | Use TP-local head counts for FlashAttention/FlashMLA metadata. |
| lmdeploy/pytorch/envs.py | Add LMDEPLOY_FAKE_CUDA_GRAPH_CAPTURE env flag. |
| lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py | Add backend-death handling, cancellation-safe stream startup, abandoned stream drop, and streaming startup barrier plumbing. |
| lmdeploy/pytorch/engine/mp_engine/zmq_engine.py | Wire backend liveness callbacks/sentinel; make port-wait robust to early backend exit. |
| lmdeploy/pytorch/engine/mp_engine/ray_engine.py | Add cancellation-safe Ray stream startup and abandoned-stream drop support. |
| lmdeploy/pytorch/engine/mp_engine/base.py | Replace is_exists with init_done barrier; make async_end() wait for startup completion. |
| lmdeploy/pytorch/engine/mp_engine/base_worker.py | Add EngineOutputGather.discard() to drop abandoned stream buffers. |
| lmdeploy/pytorch/config.py | Add dist_config to ModelConfig; implement get_num_qkv_head_by_tp() and preserve dist_config in from_hf_config(). |
| lmdeploy/pytorch/backends/cuda/op_backend.py | Use TP-local head counts when building FlashAttention/FlashMLA metadata. |
| lmdeploy/pytorch/backends/cuda/graph_runner.py | Add “fake capture” path to bypass actual CUDA graph capture for debugging/padding behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
9de66db to
38c9bc7
Compare
| except (asyncio.CancelledError, GeneratorExit): | ||
| remove_session_once() | ||
| raise |
There was a problem hiding this comment.
Can it be merged to except Exception branch for the sake of metrics_processor
There was a problem hiding this comment.
Good point. I kept the CancelledError/GeneratorExit branch separate from the generic Exception branch so cancellation still propagates correctly instead of being converted into an error GenOut.
But the metrics concern is valid: generate() has already incremented num_total_reqs, so prompt-processing cancellation should also update failed metrics. I added metrics_processor.increase_failed_requests('cancel') before cleanup/re-raise, and added a unit test to verify num_total_reqs, num_cancelled_reqs, and num_uncompleted_reqs stay balanced after prompt cancellation.
Powered by codex
# Conflicts: # lmdeploy/serve/anthropic/endpoints/messages.py # tests/test_lmdeploy/serve/anthropic/test_endpoints.py
Summary
Fix session leaks when API clients disconnect or Ctrl+C during PyTorch MP serving.
This patch makes MP stream startup cancellation-safe and makes serve-side terminal session cleanup idempotent, so dropped requests no longer leave backend sessions or API session mappings alive forever.
Bug to fix
Changes
SessionState.init_done.async_end()waits until backendADD_MESSAGEhas been enqueued before sendingEND_SESSION.INFO; real exception paths still log tracebacks.Requirement