In-process Python API: `Server.stop()` hangs until `exit_timeout` and never invokes Python-backend `finalize()`

**Description**

When using the in-process Python API with a model served by the Python backend, `Server.stop()` deadlocks: the core enters the unload loop and repeatedly logs `"<model> v<ver>: UNLOADING"` every second until `exit_timeout` elapses, at which point it force-terminates the stub and raises `tritonserver.InternalError: Exit timeout expired. Exiting immediately.`.

During the entire unload-polling window, the stub process is alive and idle; instrumentation proves the stub **never receives the finalize request**: our `finalize()` log only fires *after* the server issues the non-graceful termination. This looks like the same class of deadlock addressed by [core#381](https://github.com/triton-inference-server/core/pull/381), but on the Python-backend model-unload path rather than the request-release callback path.

`model.infer(...)` itself works correctly; forward pass returns the expected output. The bug is strictly in the graceful-teardown path of `Server.stop()`.

**Triton Information**
- **Triton version:** 2.67.0
- **Image:** `nvcr.io/nvidia/tritonserver:26.03-py3`
- **Python:** 3.12.3 (system Python inside the image)
- **Container:** using the official NGC container, not a custom build.
- **Host:** GCE `g2-standard-4` (1× NVIDIA L4, 16 GB RAM, Ubuntu 24.04 in the container).

**To Reproduce**

Minimal reproducer — creates a single Python-backend model whose `execute` does a small numpy matmul, and whose `finalize` logs so we can tell whether it was invoked.

```python
import os, tempfile, textwrap, time
import numpy as np
import tritonserver

CFG = textwrap.dedent("""
    name: "simple"
    backend: "python"
    max_batch_size: 0
    input  [ { name: "INPUT"  data_type: TYPE_FP32 dims: [ 4 ] } ]
    output [ { name: "OUTPUT" data_type: TYPE_FP32 dims: [ 4 ] } ]
    instance_group [ { kind: KIND_CPU } ]
""").strip()

MDL = textwrap.dedent("""
    import numpy as np, triton_python_backend_utils as pb_utils
    class TritonPythonModel:
        def initialize(self, args):
            self.W = np.random.default_rng(42).standard_normal((4,4)).astype(np.float32)
        def execute(self, requests):
            return [pb_utils.InferenceResponse(output_tensors=[
                pb_utils.Tensor("OUTPUT",
                    (pb_utils.get_input_tensor_by_name(r,"INPUT").as_numpy() @ self.W).astype(np.float32))
            ]) for r in requests]
        def finalize(self):
            pb_utils.Logger.log_info("FINALIZE_ENTERED")
""").strip()

with tempfile.TemporaryDirectory() as repo:
    os.makedirs(os.path.join(repo, "simple", "1"))
    open(os.path.join(repo, "simple", "config.pbtxt"), "w").write(CFG)
    open(os.path.join(repo, "simple", "1", "model.py"), "w").write(MDL)

    srv = tritonserver.Server(
        model_repository=repo,
        log_info=True, log_verbose=1,
        exit_timeout=5,
    ).start(wait_until_ready=True)

    x = np.array([1, 2, 3, 4], dtype=np.float32)
    for r in srv.model("simple").infer(inputs={"INPUT": x}):
        print("infer ok:", np.from_dlpack(r.outputs["OUTPUT"]))

    t0 = time.time()
    try:
        srv.stop()
        print(f"stop OK in {time.time()-t0:.2f}s")
    except Exception as e:
        print(f"stop raised after {time.time()-t0:.2f}s: {type(e).__name__}: {e}")
```

### Observed output

`stdout` from the reproducer + the relevant `log_verbose=2` Triton log lines (timestamps kept so the 5-second stall is visible; unrelated Triton startup banners elided):

```text
I0422 18:39:46.027531 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
I0422 18:39:46.027561 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from INITIALIZED to PENDING"
I0422 18:39:46.027610 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from PENDING to EXECUTING"
I0422 18:39:46.063623 7399 infer_response.cc:193] "add response output: output: OUTPUT, type: FP32, shape: [4]"
I0422 18:39:46.063851 7399 infer_request.cc:133] "[request id: <id_unknown>] Setting state from EXECUTING to RELEASED"
I0422 18:39:46.063873 7399 python_be.cc:2662] "TRITONBACKEND_ModelInstanceExecute: model instance name simple_0_0 released 1 requests"
infer ok: [-3.3836339 -1.6945097  5.5143633 -0.7957144]
==> stub PID: 7479
I0422 18:39:46.271175 7399 server.cc:312] "Waiting for in-flight requests to complete."
I0422 18:39:46.271221 7399 server.cc:328] "Timeout 5: Found 0 model versions that have in-flight inferences"
I0422 18:39:46.271286 7399 server.cc:343] "All models are stopped, unloading models"
I0422 18:39:46.271302 7399 server.cc:352] "Timeout 5: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:46.271308 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:47.271454 7399 server.cc:352] "Timeout 4: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:47.271461 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:48.271760 7399 server.cc:352] "Timeout 3: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:48.271771 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:49.271893 7399 server.cc:352] "Timeout 2: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:49.271900 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:50.272023 7399 server.cc:352] "Timeout 1: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:50.272028 7399 server.cc:358] "simple v1: UNLOADING"
I0422 18:39:51.272157 7399 server.cc:352] "Timeout 0: Found 1 live models and 0 in-flight non-inference requests"
I0422 18:39:51.272164 7399 server.cc:358] "simple v1: UNLOADING"
==> stop raised after 5.00s: InternalError: Exit timeout expired. Exiting immediately.
I0422 18:39:53.509387 7479 pb_stub.cc:2177]  Non-graceful termination detected.
I0422 18:39:53.549340 7479 model.py:11]  "FINALIZE_ENTERED"
```

### Additional evidence

- The Python backend stub process remains alive throughout the entire UNLOADING poll loop, it is **not** stuck inside user code; it is idle waiting for a message from the main process.
- Calling `Server.stop()` from a separate thread (so the main thread can release the GIL periodically) does **not** help; the stub still does not receive a finalize request until the force-termination path runs.
- `model_control_mode = ModelControlMode.EXPLICIT` (vs. the default `NONE`) does not help either, same UNLOADING poll loop.
- Increasing `exit_timeout` only increases the wait before the force-kill. `finalize()` still doesn't run until the force-kill triggers.

**Expected behavior**

`Server.stop()` should:

1. Invoke `finalize()` on each loaded Python-backend model within a bounded time window.
2. Reap the backend stub processes gracefully.
3. Return without raising `InternalError: Exit timeout expired.` in the common case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-process Python API: `Server.stop()` hangs until `exit_timeout` and never invokes Python-backend `finalize()` #8755

Observed output

Additional evidence

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

In-process Python API: Server.stop() hangs until exit_timeout and never invokes Python-backend finalize() #8755

Description

Observed output

Additional evidence

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

In-process Python API: `Server.stop()` hangs until `exit_timeout` and never invokes Python-backend `finalize()` #8755