Skip to content

fix: rotate dead reused heartbeat sessions#4

Open
Gradata wants to merge 1 commit into
masterfrom
gra-1736-dead-reused-session
Open

fix: rotate dead reused heartbeat sessions#4
Gradata wants to merge 1 commit into
masterfrom
gra-1736-dead-reused-session

Conversation

@Gradata
Copy link
Copy Markdown
Owner

@Gradata Gradata commented Jun 6, 2026

Summary

  • Adds a heartbeat circuit-breaker for dead reused persisted sessions.
  • Rotates after 3 consecutive failed liveness runs with session reuse and zero raw/normalized input, cached, and output tokens on the same persisted session.
  • Keeps existing policy-driven session compaction behavior intact.

Tests

  • /home/olive/.hermes/node/bin/pnpm exec vitest run server/src/__tests__/heartbeat-workspace-session.test.ts — 41 passed
  • /home/olive/.hermes/node/bin/pnpm --filter @paperclipai/server typecheck — passed

Fixes Paperclip GRA-1736.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 6, 2026

Review Change Stack

📝 Walkthrough
  • New feature: Adds heartbeat circuit-breaker that rotates dead persisted sessions when they are being reused after 3 consecutive failed liveness checks with zero tokens
  • Rotation criteria: Triggers when session reuse is active, raw/normalized input counts are zero, and cached/output tokens are zero for the same persisted session
  • Preserves existing behavior: Session compaction thresholds and policy-driven compaction remain unchanged
  • New public API:
    • DEAD_REUSED_SESSION_FAILURE_THRESHOLD constant (value: 3)
    • detectDeadReusedSessionCircuitBreaker() function for detecting dead reused session conditions
  • Implementation updates: evaluateSessionCompaction now computes circuit-breaker outcome; session runs query extended to include livenessState data
  • Test coverage: 41 tests passing in heartbeat-workspace-session.test.ts with new circuit-breaker test suite
  • Verification: Typechecking passed
  • Issue resolved: Fixes GRA-1736

Walkthrough

This PR adds a circuit breaker to the heartbeat session compaction logic that forces session rotation after three consecutive failed zero-token reused runs for the same persisted session, protecting against sessions that repeatedly fail without producing tokens.

Changes

Dead Reused Session Circuit Breaker

Layer / File(s) Summary
Circuit breaker detection logic
server/src/services/heartbeat.ts
Exports DEAD_REUSED_SESSION_FAILURE_THRESHOLD constant and implements detectDeadReusedSessionCircuitBreaker function that inspects consecutive heartbeat runs for failed liveness state, matched persisted session ID, reused flag, and zero token counts to determine if rotation is needed.
Session compaction integration
server/src/services/heartbeat.ts
evaluateSessionCompaction now fetches livenessState in the runs query, incorporates the circuit-breaker threshold into fetch limits, prioritizes circuit-breaker outcomes in rotation decision logic, and uses circuit-breaker-derived previousRunId in both rotation and non-rotation return paths.
Test coverage
server/src/__tests__/heartbeat-workspace-session.test.ts
Adds import of detectDeadReusedSessionCircuitBreaker and new test suite with helper to construct dead-reused runs, verifying rotation occurs after three consecutive failures and that progress (non-zero tokens or advanced liveness state) prevents rotation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

bug

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: rotate dead reused heartbeat sessions' directly describes the main change: implementing a circuit-breaker to rotate dead reused persisted sessions.
Description check ✅ Passed The description is clearly related to the changeset, explaining the circuit-breaker logic, rotation threshold, token conditions, and test verification.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gra-1736-dead-reused-session

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the bug Something isn't working label Jun 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/src/services/heartbeat.ts (1)

3538-3550: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Scope the compaction history queries by company.

This new session-history path still reads heartbeatRuns by agent.id/sessionId only. Please add heartbeatRuns.companyId = agent.companyId here and thread companyId into getOldestRunForSession() as well, otherwise the circuit breaker depends on global identifier uniqueness instead of enforcing the tenant boundary in the service.

As per coding guidelines, "Keep changes company-scoped. Every domain entity should be scoped to a company and company boundaries must be enforced in routes/services."

Suggested fix
-  async function getOldestRunForSession(agentId: string, sessionId: string) {
+  async function getOldestRunForSession(companyId: string, agentId: string, sessionId: string) {
     return db
       .select({
         id: heartbeatRuns.id,
         createdAt: heartbeatRuns.createdAt,
       })
       .from(heartbeatRuns)
-      .where(and(eq(heartbeatRuns.agentId, agentId), eq(heartbeatRuns.sessionIdAfter, sessionId)))
+      .where(and(
+        eq(heartbeatRuns.companyId, companyId),
+        eq(heartbeatRuns.agentId, agentId),
+        eq(heartbeatRuns.sessionIdAfter, sessionId),
+      ))
       .orderBy(asc(heartbeatRuns.createdAt), asc(heartbeatRuns.id))
       .limit(1)
       .then((rows) => rows[0] ?? null);
   }
...
     const runs = await db
       .select({
         id: heartbeatRuns.id,
         createdAt: heartbeatRuns.createdAt,
         usageJson: heartbeatRuns.usageJson,
         error: heartbeatRuns.error,
         livenessState: heartbeatRuns.livenessState,
         ...heartbeatRunListResultColumns,
       })
       .from(heartbeatRuns)
-      .where(and(eq(heartbeatRuns.agentId, agent.id), eq(heartbeatRuns.sessionIdAfter, sessionId)))
+      .where(and(
+        eq(heartbeatRuns.companyId, agent.companyId),
+        eq(heartbeatRuns.agentId, agent.id),
+        eq(heartbeatRuns.sessionIdAfter, sessionId),
+      ))
       .orderBy(desc(heartbeatRuns.createdAt))
       .limit(fetchLimit);
...
       policy.maxSessionAgeHours > 0
-        ? await getOldestRunForSession(agent.id, sessionId)
+        ? await getOldestRunForSession(agent.companyId, agent.id, sessionId)
         : runs[runs.length - 1] ?? latestRun;

Also applies to: 3563-3565

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/src/services/heartbeat.ts` around lines 3538 - 3550, The query
selecting heartbeatRuns is missing tenant scoping; add a where condition
requiring heartbeatRuns.companyId equal agent.companyId in the query that builds
"runs" (the chain using
db.select(...).from(heartbeatRuns).where(...).orderBy(...).limit(fetchLimit)),
and likewise update the call sites and signature of getOldestRunForSession() to
accept and use companyId (pass agent.companyId into getOldestRunForSession() and
ensure its internal queries filter on heartbeatRuns.companyId) so all
session-history/compaction logic is enforced per company.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/src/__tests__/heartbeat-workspace-session.test.ts`:
- Around line 582-590: The test is confusing because a run created via
deadRun("run-2") sets rawInputTokens/rawOutputTokens to zero while the override
sets inputTokens/outputTokens to non-zero; update the test to be explicit:
either remove the usageJson.inputTokens and usageJson.outputTokens overrides so
the break in the streak is clearly due to livenessState: "advanced", or make the
usageJson consistent by also setting rawInputTokens and rawOutputTokens to match
the normalized inputTokens/outputTokens; adjust the object created from
deadRun("run-2") accordingly (references: deadRun, livenessState, usageJson,
rawInputTokens, rawOutputTokens, inputTokens, outputTokens).
- Around line 595-597: The test currently asserts decision.rotate and
decision.reason but omits verifying decision.previousRunId; add an assertion to
check the full contract by asserting decision.previousRunId equals "run-3"
(e.g., add expect(decision.previousRunId).toBe("run-3") right after the existing
expect(decision.reason).toBeNull()); reference the test subject `decision` and
the property `previousRunId`.

In `@server/src/services/heartbeat.ts`:
- Around line 3575-3580: The circuit-breaker rotation only suppresses the
session for the current invocation but leaves the old fallback in place, so
resolveNextSessionState() still uses the original previousSessionParams
(persisting the dead session into sessionIdAfter/agentTaskSessions). Update the
handling around detectDeadReusedSessionCircuitBreaker
(deadSessionCircuitBreaker, reason, previousRunId) so that when the circuit
indicates a rotate/suppression you clear the previous-session fallback passed
downstream (e.g., set previousRunId and any previousSessionParams to null or
remove the previousSessionParams object) before calling
resolveNextSessionState() — ensure this change touches the variables named
deadSessionCircuitBreaker, previousRunId, previousSessionParams, and any code
that sets sessionIdAfter/agentTaskSessions so the dead session cannot be
reintroduced on the next wake.

---

Outside diff comments:
In `@server/src/services/heartbeat.ts`:
- Around line 3538-3550: The query selecting heartbeatRuns is missing tenant
scoping; add a where condition requiring heartbeatRuns.companyId equal
agent.companyId in the query that builds "runs" (the chain using
db.select(...).from(heartbeatRuns).where(...).orderBy(...).limit(fetchLimit)),
and likewise update the call sites and signature of getOldestRunForSession() to
accept and use companyId (pass agent.companyId into getOldestRunForSession() and
ensure its internal queries filter on heartbeatRuns.companyId) so all
session-history/compaction logic is enforced per company.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6d52b386-0d18-46ef-8072-55875fe5dddf

📥 Commits

Reviewing files that changed from the base of the PR and between c831623 and ce78cc1.

📒 Files selected for processing (2)
  • server/src/__tests__/heartbeat-workspace-session.test.ts
  • server/src/services/heartbeat.ts
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
server/**/*.{ts,js}

📄 CodeRabbit inference engine (AGENTS.md)

server/**/*.{ts,js}: Keep changes company-scoped. Every domain entity should be scoped to a company and company boundaries must be enforced in routes/services.
Apply company access checks in API endpoints.
Enforce actor permissions (board vs agent) in API endpoints.
Write activity log entries for mutations in API endpoints.
Return consistent HTTP errors (400/401/403/404/409/422/500) from API endpoints.
createServerAdapter() must include ALL optional fields, especially detectModel, when creating adapter instances.

Files:

  • server/src/__tests__/heartbeat-workspace-session.test.ts
  • server/src/services/heartbeat.ts
🔇 Additional comments (3)
server/src/__tests__/heartbeat-workspace-session.test.ts (2)

6-6: LGTM!

Also applies to: 11-11


548-575: LGTM!

server/src/services/heartbeat.ts (1)

239-239: LGTM!

Also applies to: 1499-1544

Comment on lines +582 to +590
{
...deadRun("run-2"),
livenessState: "advanced",
usageJson: {
...deadRun("run-2").usageJson,
inputTokens: 12,
outputTokens: 3,
},
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Clarify the test by updating raw token counts or removing the token override.

The test overrides inputTokens and outputTokens to non-zero values, but leaves rawInputTokens and rawOutputTokens at zero (inherited from the spread). Since the circuit-breaker check returns false as soon as livenessState !== "failed" (per context snippet 3), the token changes are redundant and create potential confusion about which condition breaks the streak.

Consider either:

  1. Removing the token overrides entirely and relying solely on livenessState: "advanced" to break the streak (clearer intent), or
  2. Also updating rawInputTokens and rawOutputTokens to match the normalized counts (more realistic run).
Option 1: Remove redundant token overrides
       {
         ...deadRun("run-2"),
         livenessState: "advanced",
-        usageJson: {
-          ...deadRun("run-2").usageJson,
-          inputTokens: 12,
-          outputTokens: 3,
-        },
       },
Option 2: Update raw token counts for realism
       {
         ...deadRun("run-2"),
         livenessState: "advanced",
         usageJson: {
           ...deadRun("run-2").usageJson,
           inputTokens: 12,
+          rawInputTokens: 12,
           outputTokens: 3,
+          rawOutputTokens: 3,
         },
       },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{
...deadRun("run-2"),
livenessState: "advanced",
usageJson: {
...deadRun("run-2").usageJson,
inputTokens: 12,
outputTokens: 3,
},
},
{
...deadRun("run-2"),
livenessState: "advanced",
},
Suggested change
{
...deadRun("run-2"),
livenessState: "advanced",
usageJson: {
...deadRun("run-2").usageJson,
inputTokens: 12,
outputTokens: 3,
},
},
{
...deadRun("run-2"),
livenessState: "advanced",
usageJson: {
...deadRun("run-2").usageJson,
inputTokens: 12,
rawInputTokens: 12,
outputTokens: 3,
rawOutputTokens: 3,
},
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/src/__tests__/heartbeat-workspace-session.test.ts` around lines 582 -
590, The test is confusing because a run created via deadRun("run-2") sets
rawInputTokens/rawOutputTokens to zero while the override sets
inputTokens/outputTokens to non-zero; update the test to be explicit: either
remove the usageJson.inputTokens and usageJson.outputTokens overrides so the
break in the streak is clearly due to livenessState: "advanced", or make the
usageJson consistent by also setting rawInputTokens and rawOutputTokens to match
the normalized inputTokens/outputTokens; adjust the object created from
deadRun("run-2") accordingly (references: deadRun, livenessState, usageJson,
rawInputTokens, rawOutputTokens, inputTokens, outputTokens).

Comment on lines +595 to +597
expect(decision.rotate).toBe(false);
expect(decision.reason).toBeNull();
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Assert previousRunId to verify the full contract.

When the consecutive streak is broken, the circuit breaker still returns previousRunId pointing to the latest dead run encountered before the break. Per context snippet 2, this should be "run-3" in this test case. Adding this assertion would provide more complete contract verification.

✅ Add previousRunId assertion
   expect(decision.rotate).toBe(false);
   expect(decision.reason).toBeNull();
+  expect(decision.previousRunId).toBe("run-3");
 });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
expect(decision.rotate).toBe(false);
expect(decision.reason).toBeNull();
});
expect(decision.rotate).toBe(false);
expect(decision.reason).toBeNull();
expect(decision.previousRunId).toBe("run-3");
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/src/__tests__/heartbeat-workspace-session.test.ts` around lines 595 -
597, The test currently asserts decision.rotate and decision.reason but omits
verifying decision.previousRunId; add an assertion to check the full contract by
asserting decision.previousRunId equals "run-3" (e.g., add
expect(decision.previousRunId).toBe("run-3") right after the existing
expect(decision.reason).toBeNull()); reference the test subject `decision` and
the property `previousRunId`.

Comment on lines +3575 to +3580
const deadSessionCircuitBreaker = detectDeadReusedSessionCircuitBreaker({
sessionId,
runs,
});
let reason: string | null = deadSessionCircuitBreaker.reason;
let previousRunId = deadSessionCircuitBreaker.previousRunId ?? latestRun?.id ?? null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear the old session fallback when the circuit breaker rotates.

This new rotate path only suppresses the session for the current adapter invocation. resolveNextSessionState() later still receives the original previousSessionParams, so any adapter that does not emit explicit session metadata will persist the dead session back into sessionIdAfter/agentTaskSessions, which defeats the breaker on the next wake.

Suggested fix
-    const previousSessionParams =
+    const previousSessionParams =
       explicitResumeSessionParams ??
       (explicitResumeSessionDisplayId ? { sessionId: explicitResumeSessionDisplayId } : null) ??
       normalizeSessionParams(sessionCodec.deserialize(taskSessionForRun?.sessionParamsJson ?? null));
+    let sessionFallbackParams = previousSessionParams;
...
     if (sessionCompaction.rotate) {
       context.paperclipSessionHandoffMarkdown = sessionCompaction.handoffMarkdown;
       context.paperclipSessionRotationReason = sessionCompaction.reason;
       context.paperclipPreviousSessionId = previousSessionDisplayId ?? runtimeSessionIdForAdapter;
       runtimeSessionIdForAdapter = null;
       runtimeSessionParamsForAdapter = null;
       previousSessionDisplayId = null;
+      sessionFallbackParams = null;
       if (sessionCompaction.reason) {
         runtimeWorkspaceWarnings.push(
           `Starting a fresh session because ${sessionCompaction.reason}.`,
         );
       }
@@
       const nextSessionState = resolveNextSessionState({
         codec: sessionCodec,
         adapterResult,
-        previousParams: previousSessionParams,
+        previousParams: sessionFallbackParams,
         previousDisplayId: runtimeForAdapter.sessionDisplayId,
         previousLegacySessionId: runtimeForAdapter.sessionId,
       });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/src/services/heartbeat.ts` around lines 3575 - 3580, The
circuit-breaker rotation only suppresses the session for the current invocation
but leaves the old fallback in place, so resolveNextSessionState() still uses
the original previousSessionParams (persisting the dead session into
sessionIdAfter/agentTaskSessions). Update the handling around
detectDeadReusedSessionCircuitBreaker (deadSessionCircuitBreaker, reason,
previousRunId) so that when the circuit indicates a rotate/suppression you clear
the previous-session fallback passed downstream (e.g., set previousRunId and any
previousSessionParams to null or remove the previousSessionParams object) before
calling resolveNextSessionState() — ensure this change touches the variables
named deadSessionCircuitBreaker, previousRunId, previousSessionParams, and any
code that sets sessionIdAfter/agentTaskSessions so the dead session cannot be
reintroduced on the next wake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant