Description
When using the in-process Python API with a model served by the Python backend, Server.stop() deadlocks: the core enters the unload loop and repeatedly logs "<model> v<ver>: UNLOADING" every second until exit_timeout elapses, at which point it force-terminates the stub and raises tritonserver.InternalError: Exit timeout expired. Exiting immediately..
During the entire unload-polling window, the stub process is alive and idle; instrumentation proves the stub never receives the finalize request: our finalize() log only fires after the server issues the non-graceful termination. This looks like the same class of deadlock addressed by core#381, but on the Python-backend model-unload path rather than the request-release callback path.
model.infer(...) itself works correctly; forward pass returns the expected output. The bug is strictly in the graceful-teardown path of Server.stop().
Triton Information
- Triton version: 2.67.0
- Image:
nvcr.io/nvidia/tritonserver:26.03-py3
- Python: 3.12.3 (system Python inside the image)
- Container: using the official NGC container, not a custom build.
- Host: GCE
g2-standard-4 (1× NVIDIA L4, 16 GB RAM, Ubuntu 24.04 in the container).
To Reproduce
Minimal reproducer — creates a single Python-backend model whose execute does a small numpy matmul, and whose finalize logs so we can tell whether it was invoked.
import os, tempfile, textwrap, time
import numpy as np
import tritonserver
CFG = textwrap.dedent("""
name: "simple"
backend: "python"
max_batch_size: 0
input [ { name: "INPUT" data_type: TYPE_FP32 dims: [ 4 ] } ]
output [ { name: "OUTPUT" data_type: TYPE_FP32 dims: [ 4 ] } ]
instance_group [ { kind: KIND_CPU } ]
""").strip()
MDL = textwrap.dedent("""
import numpy as np, triton_python_backend_utils as pb_utils
class TritonPythonModel:
def initialize(self, args):
self.W = np.random.default_rng(42).standard_normal((4,4)).astype(np.float32)
def execute(self, requests):
return [pb_utils.InferenceResponse(output_tensors=[
pb_utils.Tensor("OUTPUT",
(pb_utils.get_input_tensor_by_name(r,"INPUT").as_numpy() @ self.W).astype(np.float32))
]) for r in requests]
def finalize(self):
pb_utils.Logger.log_info("FINALIZE_ENTERED")
""").strip()
with tempfile.TemporaryDirectory() as repo:
os.makedirs(os.path.join(repo, "simple", "1"))
open(os.path.join(repo, "simple", "config.pbtxt"), "w").write(CFG)
open(os.path.join(repo, "simple", "1", "model.py"), "w").write(MDL)
srv = tritonserver.Server(
model_repository=repo,
log_info=True, log_verbose=1,
exit_timeout=5,
).start(wait_until_ready=True)
x = np.array([1, 2, 3, 4], dtype=np.float32)
for r in srv.model("simple").infer(inputs={"INPUT": x}):
print("infer ok:", np.from_dlpack(r.outputs["OUTPUT"]))
t0 = time.time()
try:
srv.stop()
print(f"stop OK in {time.time()-t0:.2f}s")
except Exception as e:
print(f"stop raised after {time.time()-t0:.2f}s: {type(e).__name__}: {e}")
Observed output
stdout from the reproducer + the relevant log_verbose=2 Triton log lines (timestamps kept so the 5-second stall is visible; unrelated Triton startup banners elided):
I0422 18:39:46.027531 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
I0422 18:39:46.027561 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
I0422 18:39:46.027610 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING"
I0422 18:39:46.063623 7399 infer_response.cc:193] "add response output: output: OUTPUT, type: FP32, shape: [4]"
I0422 18:39:46.063851 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from EXECUTING to RELEASED"
I0422 18:39:46.063873 7399 python_be.cc:2662] "TRITONBACKEND_ModelInstanceExecute: model instance name simple_0_0 released 1 requests"
infer ok: [-3.3836339 -1.6945097 5.5143633 -0.7957144]
==> stub PID: 7479
I0422 18:39:46.271175 7399 server.cc:312] "Waiting for in-flight requests to complete."
I0422 18:39:46.271221 7399 server.cc:328] "Timeout 5: Found 0 model versions that have in-flight inferences"
I0422 18:39:46.271286 7399 server.cc:343] "All models are stopped, unloading models"
I0422 18:39:46.271302 7399 server.cc:352] "Timeout 5: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:46.271308 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:47.271454 7399 server.cc:352] "Timeout 4: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:47.271461 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:48.271760 7399 server.cc:352] "Timeout 3: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:48.271771 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:49.271893 7399 server.cc:352] "Timeout 2: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:49.271900 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:50.272023 7399 server.cc:352] "Timeout 1: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:50.272028 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:51.272157 7399 server.cc:352] "Timeout 0: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:51.272164 7399 server.cc:358] "simple v1: UNLOADING"
==> stop raised after 5.00s: InternalError: Exit timeout expired. Exiting immediately.
I0422 18:39:53.509387 7479 pb_stub.cc:2177] Non-graceful termination detected.
I0422 18:39:53.549340 7479 model.py:11] "FINALIZE_ENTERED"
Additional evidence
- The Python backend stub process remains alive throughout the entire UNLOADING poll loop, it is not stuck inside user code; it is idle waiting for a message from the main process.
- Calling
Server.stop() from a separate thread (so the main thread can release the GIL periodically) does not help; the stub still does not receive a finalize request until the force-termination path runs.
model_control_mode = ModelControlMode.EXPLICIT (vs. the default NONE) does not help either, same UNLOADING poll loop.
- Increasing
exit_timeout only increases the wait before the force-kill. finalize() still doesn't run until the force-kill triggers.
Expected behavior
Server.stop() should:
- Invoke
finalize() on each loaded Python-backend model within a bounded time window.
- Reap the backend stub processes gracefully.
- Return without raising
InternalError: Exit timeout expired. in the common case.
Description
When using the in-process Python API with a model served by the Python backend,
Server.stop()deadlocks: the core enters the unload loop and repeatedly logs"<model> v<ver>: UNLOADING"every second untilexit_timeoutelapses, at which point it force-terminates the stub and raisestritonserver.InternalError: Exit timeout expired. Exiting immediately..During the entire unload-polling window, the stub process is alive and idle; instrumentation proves the stub never receives the finalize request: our
finalize()log only fires after the server issues the non-graceful termination. This looks like the same class of deadlock addressed by core#381, but on the Python-backend model-unload path rather than the request-release callback path.model.infer(...)itself works correctly; forward pass returns the expected output. The bug is strictly in the graceful-teardown path ofServer.stop().Triton Information
nvcr.io/nvidia/tritonserver:26.03-py3g2-standard-4(1× NVIDIA L4, 16 GB RAM, Ubuntu 24.04 in the container).To Reproduce
Minimal reproducer — creates a single Python-backend model whose
executedoes a small numpy matmul, and whosefinalizelogs so we can tell whether it was invoked.Observed output
stdoutfrom the reproducer + the relevantlog_verbose=2Triton log lines (timestamps kept so the 5-second stall is visible; unrelated Triton startup banners elided):Additional evidence
Server.stop()from a separate thread (so the main thread can release the GIL periodically) does not help; the stub still does not receive a finalize request until the force-termination path runs.model_control_mode = ModelControlMode.EXPLICIT(vs. the defaultNONE) does not help either, same UNLOADING poll loop.exit_timeoutonly increases the wait before the force-kill.finalize()still doesn't run until the force-kill triggers.Expected behavior
Server.stop()should:finalize()on each loaded Python-backend model within a bounded time window.InternalError: Exit timeout expired.in the common case.