refactor(sdk-py): wrap relaycast-sdk in hosted communicate transport#1187
Conversation
Replace RelayTransport's hand-rolled aiohttp HTTP/WS client with calls into the published relaycast-sdk (PyPI `relaycast-sdk`, module `relay_sdk`): - register/list/unregister agents via AsyncRelay (workspace key); unregister uses the client escape hatch since the agents namespace has no delete() - post/reply/dm/inbox via the AsyncAgentClient from relay.as_agent(token) - WebSocket via relay_sdk WsClient with on(event, handler); a minimal local pong workaround answers server keepalive pings (WsClient has no pong primitive). check_inbox reads the raw /v1/inbox payload through the agent http client because relay_sdk's typed InboxResponse drops the deliverable `messages` array. Envelope unwrap, 5xx retry and 401 handling now live in relay_sdk; RelayError is translated to the SDK's RelayAuthError / RelayConnectionError. The default base URL (https://api.relaycast.dev) is preserved and passed explicitly into the relay_sdk clients so behavior is unchanged. Public RelayTransport method signatures and the agent_id/token attributes are kept stable; RelayConfig and AgentRelayClient are untouched. Add relaycast-sdk to the communicate/dev extras (aiohttp stays for the A2A modules). Update the mock relay server to return SDK-contract-compliant payloads and adapt the retry/reconnect transport tests to the SDK's behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe communicate transport now delegates agent registration, messaging, WebSocket handling, and HTTP routing to ChangesRelayTransport migration to relaycast-sdk
Sequence Diagram(s)sequenceDiagram
participant RelayTransport
participant AsyncRelay
participant AsyncAgentClient
participant WsClient
RelayTransport->>AsyncRelay: agents.register()
AsyncRelay-->>RelayTransport: agent_id, token
RelayTransport->>AsyncRelay: as_agent(token)
AsyncRelay-->>RelayTransport: AsyncAgentClient
RelayTransport->>WsClient: connect()
WsClient-->>RelayTransport: open
RelayTransport->>AsyncAgentClient: dm() / send() / reply()
RelayTransport->>WsClient: disconnect()
RelayTransport->>AsyncRelay: agents.unregister()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2743f00e7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| with suppress(asyncio.TimeoutError): | ||
| await asyncio.wait_for(self._ws_open.wait(), timeout=5) |
There was a problem hiding this comment.
Restore polling fallback when WebSocket never opens
When HTTP registration succeeds but the WebSocket endpoint is blocked or unavailable, this suppressed timeout lets RelayTransport.connect() return successfully even though no socket is open. Relay._ensure_connected() treats any successful return as _ws_connected=True and therefore never starts the polling fallback, so inbox()/peek() stop polling instead of receiving messages in the exact no-WebSocket environment the fallback is meant to support.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 3bd2928. connect() now waits for the first WebSocket open event and distinguishes "opened then later closed" from "never opened": if the socket never opens (endpoint blocked/unavailable, or _ws_run() errors out before emitting open), it tears the socket down and raises RelayConnectionError instead of returning successfully. Relay._ensure_connected() then catches it, keeps the HTTP registration in place, and starts the polling loop. Added a regression test (test_connect_raises_when_websocket_never_opens) using a mock server that refuses the WS upgrade.
| self.agent_id = None | ||
| self.token = None | ||
| await self._close_session_if_idle() | ||
| self._agent = None |
There was a problem hiding this comment.
Close the agent client before clearing it
disconnect() calls unregister_agent() before _close_clients(), so setting self._agent = None here discards the AsyncAgentClient created by relay.as_agent() before _close_clients() can await agent.client.close(). Repeated communicate sessions will leak the underlying httpx.AsyncClient/connections and produce unclosed-client warnings even when cleanup otherwise succeeds.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 3bd2928. disconnect() now closes the per-agent client before unregister_agent() clears self._agent. Extracted a _close_agent_client() helper and call it first; the workspace relay client is deliberately kept alive at that point because unregister_agent() still needs it to issue the cleanup DELETE, then _close_clients() closes the relay client afterward. No more leaked httpx.AsyncClient.
| # path, so swallow any error (including the empty-body decode error | ||
| # relay_sdk raises when the server replies 204 No Content). | ||
| with suppress(Exception): | ||
| await relay._client.delete(f"/v1/agents/{self.agent_name}") |
There was a problem hiding this comment.
Encode agent names in cleanup requests
For agent names containing path/query-reserved characters such as /, ?, or #, embedding the raw name in the DELETE path changes the route that the HTTP client sends; the previous implementation percent-encoded the name. Because this cleanup path suppresses all exceptions, those agents are silently left registered after auto-cleanup instead of being removed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 3bd2928. The cleanup DELETE now uses urllib.parse.quote(self.agent_name, safe="") so reserved characters (/, ?, #, etc.) are percent-encoded and the request hits /v1/agents/{encoded} instead of silently changing the route. Added a regression test (test_unregister_agent_percent_encodes_name) that registers an agent named team/agent?x#y, asserts it is actually removed, and checks the raw wire path contains team%2Fagent%3Fx%23y.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
packages/sdk-py/src/agent_relay/communicate/transport.py (1)
303-312: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win
_ws_runswallows all connect failures silently.
except Exception: passdiscards every websocket failure with no log. Sinceconnect()only waits 5s for "open" and then proceeds regardless, a persistent connect error becomes invisible — operators can't distinguish "WS healthy" from "WS dead, silently falling back to polling". Log at debug/warning before swallowing.🔧 Add observability before swallowing
try: await ws.connect() except asyncio.CancelledError: raise - except Exception: - pass + except Exception: + logger.debug("relay websocket run loop exited with error", exc_info=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/sdk-py/src/agent_relay/communicate/transport.py` around lines 303 - 312, The `_ws_run` method silently swallows all websocket connection exceptions without any logging, making it impossible to diagnose connection failures. In the except block that catches generic Exception (after the asyncio.CancelledError check), add a logger call at debug or warning level to record the exception details before the pass statement, so operators can observe when and why websocket connections are failing rather than having failures disappear silently.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/sdk-py/src/agent_relay/communicate/transport.py`:
- Around line 23-25: The import statement is attempting to load symbols that do
not exist in relay_sdk 0.1.1, specifically AsyncRelay, RelayError, WsClient, and
AsyncAgentClient. Replace these imports with the correct symbols from relay_sdk
0.1.1 where Relay is the main entry point and async functionality is integrated
into standard client classes rather than provided as separate Async-prefixed
variants. Verify the correct class names and error handling mechanisms available
in relay_sdk 0.1.1 documentation and update the import statement accordingly.
- Around line 337-340: The asyncio.ensure_future(result) call in the
isawaitable(result) block schedules the coroutine but discards the returned task
object, which can lead to garbage collection before the coroutine completes due
to weak references in the event loop. Store the task returned by
asyncio.ensure_future(result) in an instance variable or a persistent collection
(such as a set of pending tasks), and add a done callback to the task that
removes it from the collection when it completes. This ensures the task
reference is retained throughout its execution lifecycle.
- Around line 319-326: In the `_on_ws_ping` method, the task returned by
`asyncio.ensure_future()` is not retained, allowing it to be garbage-collected
before the pong coroutine executes, especially under load. Fix this by
initializing a `self._pending_tasks` set in the class `__init__` method to hold
strong references to pending tasks, then in `_on_ws_ping`, create the task from
`ws._send_json()` and add it to this set, then attach a callback to the task
that removes it from the set when completed using `task.add_done_callback(lambda
t: self._pending_tasks.discard(t))`. This ensures the pong message is guaranteed
to be sent before the task reference is released.
- Line 119: When agent deletion fails during best-effort cleanup (the exception
handler around line 138 where the broad exception is caught and suppressed), add
a logger.debug() statement to log the failure details. This ensures that genuine
server-side failures during the cleanup attempt are not completely silent, which
aids in debugging while still allowing the program to continue gracefully since
this is best-effort cleanup.
---
Nitpick comments:
In `@packages/sdk-py/src/agent_relay/communicate/transport.py`:
- Around line 303-312: The `_ws_run` method silently swallows all websocket
connection exceptions without any logging, making it impossible to diagnose
connection failures. In the except block that catches generic Exception (after
the asyncio.CancelledError check), add a logger call at debug or warning level
to record the exception details before the pass statement, so operators can
observe when and why websocket connections are failing rather than having
failures disappear silently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: c9fd9c70-9a0e-408d-b170-a55c0364833f
📒 Files selected for processing (4)
packages/sdk-py/pyproject.tomlpackages/sdk-py/src/agent_relay/communicate/transport.pypackages/sdk-py/tests/communicate/conftest.pypackages/sdk-py/tests/communicate/test_transport.py
| try: | ||
| import aiohttp | ||
| from aiohttp import WSMsgType | ||
| except ImportError: | ||
| from relay_sdk import AsyncRelay, RelayError, WsClient | ||
| from relay_sdk.agent import AsyncAgentClient |
There was a problem hiding this comment.
These imports are unavailable in relay_sdk 0.1.1 and will cause an ImportError. Withdrawn — verified incorrect. All four symbols exist in the published relaycast-sdk 0.1.1 wheel.
There was a problem hiding this comment.
Verified against the actual published relaycast-sdk 0.1.1 (== relaycast origin/main packages/sdk-python, version 0.1.1; also confirmed against the installed wheel relay_sdk.__version__ == '0.1.1'). The web-search claim here is incorrect — all four symbols exist and are exported:
relay_sdk/__init__.pyexportsAsyncRelay,Relay,RelayError,WsClient,AgentClient,AsyncAgentClient(in__all__).AsyncRelay—relay.py:394;__init__(self, api_key=None, *, base_url=None, ...)atrelay.py:405.RelayError—errors.py, with the.statusattribute_translate_errorreads.WsClient—ws.py:24;__init__(self, token, *, base_url=None, ...)atws.py:36.AsyncAgentClient—agent.py;from relay_sdk.agent import AsyncAgentClientresolves.
No code change needed for this import. (relaycast is a Rust crate too, but the Python package genuinely re-exports these names — the SDK author mirrored the Rust surface.)
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Resolve the review comment by minimizing it
gh api \
--method PATCH \
-H "Accept: application/vnd.github+json" \
repos/AgentWorkforce/relay/pulls/comments/3459829357 \
-f body="~~These imports are unavailable in relay_sdk 0.1.1 and will cause an ImportError.~~ Withdrawn — verified incorrect. All four symbols exist in the published relaycast-sdk 0.1.1 wheel." 2>/dev/null || true
# Try to resolve the review comment thread
gh api graphql -f query='
mutation {
resolveReviewThread(input: {threadId: "PRRT_kwDOO9aWIs6A4rrl"}) {
thread { isResolved }
}
}' 2>/dev/null || trueLength of output: 4785
The review comment has been updated to mark it as withdrawn. The GraphQL thread resolution failed (I guessed the thread ID incorrectly), but the comment body now reflects the correction. Apologies for the noise from the inaccurate web search result — thanks for taking the time to verify directly against the installed wheel.
…oding Address three Codex review findings in the communicate transport: - Restore the no-WebSocket polling fallback. connect() now waits for the first "open" event and distinguishes "opened then later closed" from "never opened". If the socket never opens it tears down and raises RelayConnectionError so Relay._ensure_connected() falls back to HTTP polling (agent registration is preserved for polling). - Close the per-agent httpx client before unregister_agent() clears the reference. disconnect() now closes the agent client first, then unregisters (the workspace relay client stays alive for the DELETE), then closes the relay client. - Percent-encode the agent name in the cleanup DELETE so reserved characters (/, ?, #) don't change the route and silently leave the agent registered. Adds regression tests for the WS-blocked fallback and name encoding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address CodeRabbit review on PR #1187: - :379 / :393 — asyncio only weakly references tasks, so the pong reply (_on_ws_ping) and async message-callback coroutine (_on_ws_event) scheduled via asyncio.ensure_future could be garbage-collected before running, dropping keepalive pongs and intermittently losing delivered messages. Add a _pending_tasks set plus a _track() helper that retains a strong reference and discards on completion. Clears Ruff RUF006. - :126 — log a logger.debug() when the best-effort agent-unregister DELETE fails, so a genuine server-side failure is no longer completely silent. - :26 — verified (no code change): AsyncRelay, RelayError, WsClient and AsyncAgentClient and every method/signature the wrapper calls all exist in the published relaycast-sdk 0.1.1 (== relaycast origin/main packages/sdk-python). The CodeRabbit web-search claim was incorrect. Add tests covering the retained-task behavior for both the async callback and the pong path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…unds
Bump the communicate/dev extras to relaycast-sdk>=0.2.0, which adds the
methods the wrapper previously worked around:
- unregister_agent now calls relay.agents.delete(name) instead of the
raw relay._client.delete("/v1/agents/{name}") escape hatch. agents.delete
percent-encodes the name internally, so the wrapper's manual quote() (and
the urllib import) is dropped. Best-effort suppression + debug log kept.
- Remove the manual ws.on("ping") -> pong shim and _on_ws_ping: WsClient
auto-replies pong to a server {"type":"ping"} as of 0.2.0.
The _pending_tasks/_track retention for async USER message callbacks is an
independent correctness fix and is kept as-is.
Tests: replace the pong-retention test with one that verifies the SDK
auto-pongs end-to-end (new push_ws_ping mock helper). Inbox left unchanged
(engine /v1/inbox contract divergence is out of scope for this bump).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Bumped
Verified in a fresh uv venv with |
Default to cast.agentrelay.com (canonical host). gateway/api.relaycast.dev now only persist in already-shipped versions, served by the legacy-router strangler. Existing pre-cutover users with old-DB workspaces who rely on the default should set RELAY_BASE_URL or migrate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Updated default base URL to https://cast.agentrelay.com (canonical host); legacy api.relaycast.dev now only served by the legacy-router strangler. |
…-sdk # Conflicts: # packages/sdk-py/src/agent_relay/communicate/transport.py # packages/sdk-py/src/agent_relay/communicate/types.py # packages/sdk-py/tests/communicate/test_types.py
Summary
Refactors the Python SDK's hosted communicate transport (
agent_relay.communicate.transport.RelayTransport) to wrap the publishedrelaycast-sdk(PyPIrelaycast-sdk, modulerelay_sdk) instead of hand-rolling aiohttp HTTP/WS.relay_sdknow owns the wire protocol (envelope unwrap, 5xx retry, auth errors, the WebSocket lifecycle).Mapping
AsyncRelay(...).agents.register(name, type="agent")(reads.id/.token)relay._client.delete("/v1/agents/{name}")escape hatch (noagents.delete()exists)AsyncAgentClient.send(channel, text)AsyncAgentClient.reply(id, text)AsyncAgentClient.dm(to, text)agent.client.get("/v1/inbox")(typedInboxResponsedropsmessages)AsyncRelay(...).agents.list()relay_sdk.WsClientwithon(event, handler)What mapped cleanly vs needed workarounds
Clean: register, list, post, reply, dm, error translation (
RelayError→RelayAuthError/RelayConnectionError), 5xx retry, 401 handling, base-URL pass-through.Workarounds:
relay_sdkhas noagents.delete(); used the http-client escape hatch. The cleanup path also swallows the empty-body decode errorrelay_sdkraises on204 No Content.AgentClient.inbox()returns the typedInboxResponse(unread metadata only) and discards the deliverablemessagesarray; read the raw/v1/inboxpayload through the agent http client (still envelope-unwrapped/retried/auth-checked) to preserve behavior.WsClienthas no pong-on-server-ping primitive; kept a minimal local workaround (on("ping", ...)→_send_json({"type":"pong"})).WsClient.connect()blocks for the socket lifetime, so it runs in a background task;connect()waits on the"open"event so the socket is live on return.Constraints honored
https://api.relaycast.devpreserved and passed explicitly into therelay_sdkclients.RelayConfig/ public types andAgentRelayClientbroker behavior untouched.RelayTransportpublic method signatures andagent_id/tokenattributes kept stable.relaycast-sdkadded to thecommunicate(anddev) extras;aiohttpretained for the A2A modules.Tests
relaycast-sdkwas installed and inspected to verify the surface. The in-process mock relay server was updated to return SDK-contract-compliant payloads, and the retry/reconnect transport tests were adapted torelay_sdk's behavior (retry = initial + 3; reconnect owned byWsClient).python -m pytest tests/communicate→ 211 passed, 2 skipped.🤖 Generated with Claude Code