Skip to content

In-process Python API: Server.stop() hangs until exit_timeout and never invokes Python-backend finalize() #8755

@bilelomrani1

Description

@bilelomrani1

Description

When using the in-process Python API with a model served by the Python backend, Server.stop() deadlocks: the core enters the unload loop and repeatedly logs "<model> v<ver>: UNLOADING" every second until exit_timeout elapses, at which point it force-terminates the stub and raises tritonserver.InternalError: Exit timeout expired. Exiting immediately..

During the entire unload-polling window, the stub process is alive and idle; instrumentation proves the stub never receives the finalize request: our finalize() log only fires after the server issues the non-graceful termination. This looks like the same class of deadlock addressed by core#381, but on the Python-backend model-unload path rather than the request-release callback path.

model.infer(...) itself works correctly; forward pass returns the expected output. The bug is strictly in the graceful-teardown path of Server.stop().

Triton Information

  • Triton version: 2.67.0
  • Image: nvcr.io/nvidia/tritonserver:26.03-py3
  • Python: 3.12.3 (system Python inside the image)
  • Container: using the official NGC container, not a custom build.
  • Host: GCE g2-standard-4 (1× NVIDIA L4, 16 GB RAM, Ubuntu 24.04 in the container).

To Reproduce

Minimal reproducer — creates a single Python-backend model whose execute does a small numpy matmul, and whose finalize logs so we can tell whether it was invoked.

import os, tempfile, textwrap, time
import numpy as np
import tritonserver

CFG = textwrap.dedent("""
    name: "simple"
    backend: "python"
    max_batch_size: 0
    input  [ { name: "INPUT"  data_type: TYPE_FP32 dims: [ 4 ] } ]
    output [ { name: "OUTPUT" data_type: TYPE_FP32 dims: [ 4 ] } ]
    instance_group [ { kind: KIND_CPU } ]
""").strip()

MDL = textwrap.dedent("""
    import numpy as np, triton_python_backend_utils as pb_utils
    class TritonPythonModel:
        def initialize(self, args):
            self.W = np.random.default_rng(42).standard_normal((4,4)).astype(np.float32)
        def execute(self, requests):
            return [pb_utils.InferenceResponse(output_tensors=[
                pb_utils.Tensor("OUTPUT",
                    (pb_utils.get_input_tensor_by_name(r,"INPUT").as_numpy() @ self.W).astype(np.float32))
            ]) for r in requests]
        def finalize(self):
            pb_utils.Logger.log_info("FINALIZE_ENTERED")
""").strip()

with tempfile.TemporaryDirectory() as repo:
    os.makedirs(os.path.join(repo, "simple", "1"))
    open(os.path.join(repo, "simple", "config.pbtxt"), "w").write(CFG)
    open(os.path.join(repo, "simple", "1", "model.py"), "w").write(MDL)

    srv = tritonserver.Server(
        model_repository=repo,
        log_info=True, log_verbose=1,
        exit_timeout=5,
    ).start(wait_until_ready=True)

    x = np.array([1, 2, 3, 4], dtype=np.float32)
    for r in srv.model("simple").infer(inputs={"INPUT": x}):
        print("infer ok:", np.from_dlpack(r.outputs["OUTPUT"]))

    t0 = time.time()
    try:
        srv.stop()
        print(f"stop OK in {time.time()-t0:.2f}s")
    except Exception as e:
        print(f"stop raised after {time.time()-t0:.2f}s: {type(e).__name__}: {e}")

Observed output

stdout from the reproducer + the relevant log_verbose=2 Triton log lines (timestamps kept so the 5-second stall is visible; unrelated Triton startup banners elided):

I0422 18:39:46.027531 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
I0422 18:39:46.027561 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
I0422 18:39:46.027610 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING"
I0422 18:39:46.063623 7399 infer_response.cc:193] "add response output: output: OUTPUT, type: FP32, shape: [4]"
I0422 18:39:46.063851 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from EXECUTING to RELEASED"
I0422 18:39:46.063873 7399 python_be.cc:2662] "TRITONBACKEND_ModelInstanceExecute: model instance name simple_0_0 released 1 requests"
infer ok: [-3.3836339 -1.6945097  5.5143633 -0.7957144]
==> stub PID: 7479
I0422 18:39:46.271175 7399 server.cc:312] "Waiting for in-flight requests to complete."
I0422 18:39:46.271221 7399 server.cc:328] "Timeout 5: Found 0 model versions that have in-flight inferences"
I0422 18:39:46.271286 7399 server.cc:343] "All models are stopped, unloading models"
I0422 18:39:46.271302 7399 server.cc:352] "Timeout 5: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:46.271308 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:47.271454 7399 server.cc:352] "Timeout 4: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:47.271461 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:48.271760 7399 server.cc:352] "Timeout 3: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:48.271771 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:49.271893 7399 server.cc:352] "Timeout 2: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:49.271900 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:50.272023 7399 server.cc:352] "Timeout 1: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:50.272028 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:51.272157 7399 server.cc:352] "Timeout 0: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:51.272164 7399 server.cc:358] "simple v1: UNLOADING"
==> stop raised after 5.00s: InternalError: Exit timeout expired. Exiting immediately.
I0422 18:39:53.509387 7479 pb_stub.cc:2177]  Non-graceful termination detected.
I0422 18:39:53.549340 7479 model.py:11]  "FINALIZE_ENTERED"

Additional evidence

  • The Python backend stub process remains alive throughout the entire UNLOADING poll loop, it is not stuck inside user code; it is idle waiting for a message from the main process.
  • Calling Server.stop() from a separate thread (so the main thread can release the GIL periodically) does not help; the stub still does not receive a finalize request until the force-termination path runs.
  • model_control_mode = ModelControlMode.EXPLICIT (vs. the default NONE) does not help either, same UNLOADING poll loop.
  • Increasing exit_timeout only increases the wait before the force-kill. finalize() still doesn't run until the force-kill triggers.

Expected behavior

Server.stop() should:

  1. Invoke finalize() on each loaded Python-backend model within a bounded time window.
  2. Reap the backend stub processes gracefully.
  3. Return without raising InternalError: Exit timeout expired. in the common case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions