Skip to content

Add Redis-backed challenge coordination for cross-node machine setup#45

Draft
Codex wants to merge 3 commits into
mainfrom
codex/make-ssh-sync-server-scalable
Draft

Add Redis-backed challenge coordination for cross-node machine setup#45
Codex wants to merge 3 commits into
mainfrom
codex/make-ssh-sync-server-scalable

Conversation

@Codex

@Codex Codex AI commented Feb 28, 2026

Copy link
Copy Markdown
Contributor

Challenge handshakes were kept in an in-memory map, preventing responses from being resolved when requests hit different nodes behind a load balancer.

  • Distributed coordination: Introduced ChallengeBus (Redis-based) to register challenges, publish/subscribe acceptance, challenger keys, and encrypted master keys with per-node channels; bus now initializes via DI, fails fast when Redis is configured but unavailable, and shuts down cleanly.
  • Handler integration: NewMachineChallengeHandler registers challenges in Redis (with TTL) and records owner/responder nodes; MachineChallengeResponseHandler now serves remote challenges by relaying through Redis with non-blocking dispatch, falling back to local in-memory flow when Redis is absent.
  • Reliability: Pub/sub events are delivered via asynchronous listeners with guarded channel sends to avoid subscriber stalls; duplicate remote waits are rejected and unused encrypted-key paths removed.
  • Tests: Added a Redis-backed cross-node integration test using miniredis to exercise the distributed handshake flow.
  • Configuration docs: Documented REDIS_ADDR, REDIS_PASSWORD, REDIS_DB, and NODE_ID for multi-node deployments; promoted Redis dependency to main requirements.

Example flow (cross-node):

// Node A registers challenge
bus := resolveChallengeBus(injector)
_ = bus.RegisterChallenge(ctx, challengePhrase, user.Username, time.Minute)

// Node B responds; publishes acceptance and waits for challenger key via Redis
if handled := handleRemoteChallengeResponse(bus, user, challengePhrase, conn); handled {
    return
}
Original prompt

This section details on the original issue you should resolve

<issue_title>Make ssh-sync-server horizontally scalable</issue_title>
<issue_description>For the challenges, currently we store state in ChallengeResponseDict. This is thread-safe but single node.

This state is responsible for responding to challenges, but the problem with this approach is that it is not horizontally scalable...

If a user instantiates a challenge, that pops a thread on node A (which is probably fine). But then if we had a challenge response submitted to node B, node B would not be able to resolve the challenge as it isn't sharing the state that node A has.

If we used Redis to have a distributed state this might help. So node B checks redis, finds the state. It would also need to locate the machine that has the thread active, and notify it somehow. Could probably do this with a pub/sub system.

The app would still utilize channels, but would be receiving signals from Redis directly.

This may not be very necessary but would improve the scalability of ssh-sync.</issue_description>

Comments on the Issue (you are @codex[agent] in this section)

@therealpaulgg https://redis.uptrace.dev/guide/go-redis-pubsub.html @therealpaulgg In order to test this we would need some sort of load balancer to route traffic correctly...infra hell.

K8s, docker swarm

this may be the easiest way?</comment_new>

@Codex Codex AI self-assigned this Feb 28, 2026
@Codex Codex AI changed the title [WIP] Make ssh-sync-server horizontally scalable using Redis Add Redis-backed challenge coordination for cross-node machine setup Feb 28, 2026
@therealpaulgg

therealpaulgg commented Feb 28, 2026

Copy link
Copy Markdown
Owner

@codex[agent]

The overall architecture is correct — using Redis pub/sub as the inter-node bus and local Go channels as the intra-node bus is exactly the right pattern. However there are several bugs and design issues that need to be addressed before this can be merged.


🔴 Critical: synchronous event dispatch blocks the entire subscriber goroutine

startListener dispatches events synchronously:

for msg := range sub.Channel() {
    c.handleEvent(event)  // blocking send inside
}

handleEvent calls sendBool/sendKey/sendBytes which are blocking channel sends. If any of those channels isn't being consumed (e.g. the owner goroutine already timed out), the subscriber goroutine stalls and every subsequent Redis event for this node is dropped. All in-flight challenges on the node hang indefinitely.

Fix: dispatch each event in its own goroutine:

for msg := range sub.Channel() {
    event := challengeEvent{}
    if err := json.Unmarshal([]byte(msg.Payload), &event); err != nil {
        log.Warn().Err(err).Msg("Failed to unmarshal challenge event")
        continue
    }
    go c.handleEvent(event)
}

🔴 sendBool/sendKey/sendBytes use recover() as control flow

func sendBool(ch chan bool, val bool) {
    defer func() { _ = recover() }()
    ch <- val
}

Silently swallowing panics from closed-channel sends hides real bugs. If a send goes to the wrong channel or happens after cleanup, you'll never know.

Fix: use a select with a default or a dedicated done channel to guard sends, rather than relying on panic recovery.


🔴 sync.Once means a Redis startup failure is permanent for the process lifetime

If Redis is briefly unavailable when the first request arrives, newChallengeBus() returns nil, and challengeBusOnce records that — every future call to getChallengeBus() silently returns nil forever, falling back to single-node mode with no recovery path short of restarting all nodes. This defeats the intent of "optional Redis."

Fix: either retry connection on each getChallengeBus() call with a timeout, or fail fast at startup when REDIS_ADDR is set and Redis is unreachable (rather than silently degrading).


🟡 remoteWait.encryptedKey channel is dead code

handleRemoteChallengeResponse (Node B) reads the encrypted key directly from the WebSocket and publishes it to Node A via Redis. Node A's listener handles eventTypeEncryptedMasterKey by writing to session.ResponderChannel. The wait.encryptedKey channel is only ever written to in handleEvent when there is a remoteWait but no local session — a scenario that never occurs in the normal cross-node flow.

Fix: remove encryptedKey from remoteWait and the corresponding branch in handleEvent.


🟡 bus.cancel() is never called — goroutine and subscription leak

cancel is stored on ChallengeBus but never invoked anywhere. The subscriber goroutine and Redis subscription run until the process exits with no graceful shutdown path.

Fix: wire bus.cancel() into the application shutdown sequence (e.g. via the do injector's shutdown hooks).


🟡 ChallengeBus doesn't integrate with the DI container

Everything else in this codebase uses github.com/samber/do for dependency injection and lifecycle management. ChallengeBus is a package-level global initialized with sync.Once, making it inconsistent and hard to test.

Fix: register ChallengeBus as a service in the do injector so its lifecycle (startup ping, shutdown cancel) is managed alongside the rest of the app.


🟡 registerRemoteWait always returns true — misleading API

if wait, exists := c.remoteWaits[challenge]; exists {
    return wait, true  // ← same bool as "freshly created"
}
...
return wait, true

There's no way for the caller to distinguish "just registered" from "already existed." If two responders race on the same challenge phrase, the second silently reuses the first's wait channels.

Fix: return false (or an error) when a wait already exists for that challenge, so the caller can handle the conflict explicitly.


🟡 No tests for the Redis/cross-node path

The existing main_test.go only covers single-node scenarios. The entire distributed coordination flow has zero test coverage.

Fix: add integration tests that spin up two ChallengeBus instances against a real (or miniredis) Redis and exercise the full cross-node challenge flow.

@therealpaulgg

Copy link
Copy Markdown
Owner

@codex[agent] implement the changes requested in the comment above (PR reviewed)

@Codex

Codex AI commented Feb 28, 2026

Copy link
Copy Markdown
Contributor Author

@codex[agent]

The overall architecture is correct — using Redis pub/sub as the inter-node bus and local Go channels as the intra-node bus is exactly the right pattern. However there are several ...

Implemented the requested fixes in 9e73fa2: listener now dispatches events asynchronously with non-blocking sends, ChallengeBus is DI-managed with fail-fast Redis init and proper Close(), remote wait state is simplified (no encryptedKey, duplicate registrations rejected), and session metadata drives cross-node routing. Added a Redis-backed cross-node challenge test to exercise the distributed path. Responder events now prioritize remote waits so mixed-node flows proceed without blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make ssh-sync-server horizontally scalable

2 participants