fix: rotate dead reused heartbeat sessions#4
Conversation
📝 Walkthrough
WalkthroughThis PR adds a circuit breaker to the heartbeat session compaction logic that forces session rotation after three consecutive failed zero-token reused runs for the same persisted session, protecting against sessions that repeatedly fail without producing tokens. ChangesDead Reused Session Circuit Breaker
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/src/services/heartbeat.ts (1)
3538-3550:⚠️ Potential issue | 🟠 Major | ⚡ Quick winScope the compaction history queries by company.
This new session-history path still reads
heartbeatRunsbyagent.id/sessionIdonly. Please addheartbeatRuns.companyId = agent.companyIdhere and threadcompanyIdintogetOldestRunForSession()as well, otherwise the circuit breaker depends on global identifier uniqueness instead of enforcing the tenant boundary in the service.As per coding guidelines, "Keep changes company-scoped. Every domain entity should be scoped to a company and company boundaries must be enforced in routes/services."
Suggested fix
- async function getOldestRunForSession(agentId: string, sessionId: string) { + async function getOldestRunForSession(companyId: string, agentId: string, sessionId: string) { return db .select({ id: heartbeatRuns.id, createdAt: heartbeatRuns.createdAt, }) .from(heartbeatRuns) - .where(and(eq(heartbeatRuns.agentId, agentId), eq(heartbeatRuns.sessionIdAfter, sessionId))) + .where(and( + eq(heartbeatRuns.companyId, companyId), + eq(heartbeatRuns.agentId, agentId), + eq(heartbeatRuns.sessionIdAfter, sessionId), + )) .orderBy(asc(heartbeatRuns.createdAt), asc(heartbeatRuns.id)) .limit(1) .then((rows) => rows[0] ?? null); } ... const runs = await db .select({ id: heartbeatRuns.id, createdAt: heartbeatRuns.createdAt, usageJson: heartbeatRuns.usageJson, error: heartbeatRuns.error, livenessState: heartbeatRuns.livenessState, ...heartbeatRunListResultColumns, }) .from(heartbeatRuns) - .where(and(eq(heartbeatRuns.agentId, agent.id), eq(heartbeatRuns.sessionIdAfter, sessionId))) + .where(and( + eq(heartbeatRuns.companyId, agent.companyId), + eq(heartbeatRuns.agentId, agent.id), + eq(heartbeatRuns.sessionIdAfter, sessionId), + )) .orderBy(desc(heartbeatRuns.createdAt)) .limit(fetchLimit); ... policy.maxSessionAgeHours > 0 - ? await getOldestRunForSession(agent.id, sessionId) + ? await getOldestRunForSession(agent.companyId, agent.id, sessionId) : runs[runs.length - 1] ?? latestRun;Also applies to: 3563-3565
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/src/services/heartbeat.ts` around lines 3538 - 3550, The query selecting heartbeatRuns is missing tenant scoping; add a where condition requiring heartbeatRuns.companyId equal agent.companyId in the query that builds "runs" (the chain using db.select(...).from(heartbeatRuns).where(...).orderBy(...).limit(fetchLimit)), and likewise update the call sites and signature of getOldestRunForSession() to accept and use companyId (pass agent.companyId into getOldestRunForSession() and ensure its internal queries filter on heartbeatRuns.companyId) so all session-history/compaction logic is enforced per company.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/src/__tests__/heartbeat-workspace-session.test.ts`:
- Around line 582-590: The test is confusing because a run created via
deadRun("run-2") sets rawInputTokens/rawOutputTokens to zero while the override
sets inputTokens/outputTokens to non-zero; update the test to be explicit:
either remove the usageJson.inputTokens and usageJson.outputTokens overrides so
the break in the streak is clearly due to livenessState: "advanced", or make the
usageJson consistent by also setting rawInputTokens and rawOutputTokens to match
the normalized inputTokens/outputTokens; adjust the object created from
deadRun("run-2") accordingly (references: deadRun, livenessState, usageJson,
rawInputTokens, rawOutputTokens, inputTokens, outputTokens).
- Around line 595-597: The test currently asserts decision.rotate and
decision.reason but omits verifying decision.previousRunId; add an assertion to
check the full contract by asserting decision.previousRunId equals "run-3"
(e.g., add expect(decision.previousRunId).toBe("run-3") right after the existing
expect(decision.reason).toBeNull()); reference the test subject `decision` and
the property `previousRunId`.
In `@server/src/services/heartbeat.ts`:
- Around line 3575-3580: The circuit-breaker rotation only suppresses the
session for the current invocation but leaves the old fallback in place, so
resolveNextSessionState() still uses the original previousSessionParams
(persisting the dead session into sessionIdAfter/agentTaskSessions). Update the
handling around detectDeadReusedSessionCircuitBreaker
(deadSessionCircuitBreaker, reason, previousRunId) so that when the circuit
indicates a rotate/suppression you clear the previous-session fallback passed
downstream (e.g., set previousRunId and any previousSessionParams to null or
remove the previousSessionParams object) before calling
resolveNextSessionState() — ensure this change touches the variables named
deadSessionCircuitBreaker, previousRunId, previousSessionParams, and any code
that sets sessionIdAfter/agentTaskSessions so the dead session cannot be
reintroduced on the next wake.
---
Outside diff comments:
In `@server/src/services/heartbeat.ts`:
- Around line 3538-3550: The query selecting heartbeatRuns is missing tenant
scoping; add a where condition requiring heartbeatRuns.companyId equal
agent.companyId in the query that builds "runs" (the chain using
db.select(...).from(heartbeatRuns).where(...).orderBy(...).limit(fetchLimit)),
and likewise update the call sites and signature of getOldestRunForSession() to
accept and use companyId (pass agent.companyId into getOldestRunForSession() and
ensure its internal queries filter on heartbeatRuns.companyId) so all
session-history/compaction logic is enforced per company.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6d52b386-0d18-46ef-8072-55875fe5dddf
📒 Files selected for processing (2)
server/src/__tests__/heartbeat-workspace-session.test.tsserver/src/services/heartbeat.ts
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
server/**/*.{ts,js}
📄 CodeRabbit inference engine (AGENTS.md)
server/**/*.{ts,js}: Keep changes company-scoped. Every domain entity should be scoped to a company and company boundaries must be enforced in routes/services.
Apply company access checks in API endpoints.
Enforce actor permissions (board vs agent) in API endpoints.
Write activity log entries for mutations in API endpoints.
Return consistent HTTP errors (400/401/403/404/409/422/500) from API endpoints.
createServerAdapter() must include ALL optional fields, especially detectModel, when creating adapter instances.
Files:
server/src/__tests__/heartbeat-workspace-session.test.tsserver/src/services/heartbeat.ts
🔇 Additional comments (3)
server/src/__tests__/heartbeat-workspace-session.test.ts (2)
6-6: LGTM!Also applies to: 11-11
548-575: LGTM!server/src/services/heartbeat.ts (1)
239-239: LGTM!Also applies to: 1499-1544
| { | ||
| ...deadRun("run-2"), | ||
| livenessState: "advanced", | ||
| usageJson: { | ||
| ...deadRun("run-2").usageJson, | ||
| inputTokens: 12, | ||
| outputTokens: 3, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Clarify the test by updating raw token counts or removing the token override.
The test overrides inputTokens and outputTokens to non-zero values, but leaves rawInputTokens and rawOutputTokens at zero (inherited from the spread). Since the circuit-breaker check returns false as soon as livenessState !== "failed" (per context snippet 3), the token changes are redundant and create potential confusion about which condition breaks the streak.
Consider either:
- Removing the token overrides entirely and relying solely on
livenessState: "advanced"to break the streak (clearer intent), or - Also updating
rawInputTokensandrawOutputTokensto match the normalized counts (more realistic run).
Option 1: Remove redundant token overrides
{
...deadRun("run-2"),
livenessState: "advanced",
- usageJson: {
- ...deadRun("run-2").usageJson,
- inputTokens: 12,
- outputTokens: 3,
- },
},Option 2: Update raw token counts for realism
{
...deadRun("run-2"),
livenessState: "advanced",
usageJson: {
...deadRun("run-2").usageJson,
inputTokens: 12,
+ rawInputTokens: 12,
outputTokens: 3,
+ rawOutputTokens: 3,
},
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| { | |
| ...deadRun("run-2"), | |
| livenessState: "advanced", | |
| usageJson: { | |
| ...deadRun("run-2").usageJson, | |
| inputTokens: 12, | |
| outputTokens: 3, | |
| }, | |
| }, | |
| { | |
| ...deadRun("run-2"), | |
| livenessState: "advanced", | |
| }, |
| { | |
| ...deadRun("run-2"), | |
| livenessState: "advanced", | |
| usageJson: { | |
| ...deadRun("run-2").usageJson, | |
| inputTokens: 12, | |
| outputTokens: 3, | |
| }, | |
| }, | |
| { | |
| ...deadRun("run-2"), | |
| livenessState: "advanced", | |
| usageJson: { | |
| ...deadRun("run-2").usageJson, | |
| inputTokens: 12, | |
| rawInputTokens: 12, | |
| outputTokens: 3, | |
| rawOutputTokens: 3, | |
| }, | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/src/__tests__/heartbeat-workspace-session.test.ts` around lines 582 -
590, The test is confusing because a run created via deadRun("run-2") sets
rawInputTokens/rawOutputTokens to zero while the override sets
inputTokens/outputTokens to non-zero; update the test to be explicit: either
remove the usageJson.inputTokens and usageJson.outputTokens overrides so the
break in the streak is clearly due to livenessState: "advanced", or make the
usageJson consistent by also setting rawInputTokens and rawOutputTokens to match
the normalized inputTokens/outputTokens; adjust the object created from
deadRun("run-2") accordingly (references: deadRun, livenessState, usageJson,
rawInputTokens, rawOutputTokens, inputTokens, outputTokens).
| expect(decision.rotate).toBe(false); | ||
| expect(decision.reason).toBeNull(); | ||
| }); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Assert previousRunId to verify the full contract.
When the consecutive streak is broken, the circuit breaker still returns previousRunId pointing to the latest dead run encountered before the break. Per context snippet 2, this should be "run-3" in this test case. Adding this assertion would provide more complete contract verification.
✅ Add previousRunId assertion
expect(decision.rotate).toBe(false);
expect(decision.reason).toBeNull();
+ expect(decision.previousRunId).toBe("run-3");
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expect(decision.rotate).toBe(false); | |
| expect(decision.reason).toBeNull(); | |
| }); | |
| expect(decision.rotate).toBe(false); | |
| expect(decision.reason).toBeNull(); | |
| expect(decision.previousRunId).toBe("run-3"); | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/src/__tests__/heartbeat-workspace-session.test.ts` around lines 595 -
597, The test currently asserts decision.rotate and decision.reason but omits
verifying decision.previousRunId; add an assertion to check the full contract by
asserting decision.previousRunId equals "run-3" (e.g., add
expect(decision.previousRunId).toBe("run-3") right after the existing
expect(decision.reason).toBeNull()); reference the test subject `decision` and
the property `previousRunId`.
| const deadSessionCircuitBreaker = detectDeadReusedSessionCircuitBreaker({ | ||
| sessionId, | ||
| runs, | ||
| }); | ||
| let reason: string | null = deadSessionCircuitBreaker.reason; | ||
| let previousRunId = deadSessionCircuitBreaker.previousRunId ?? latestRun?.id ?? null; |
There was a problem hiding this comment.
Clear the old session fallback when the circuit breaker rotates.
This new rotate path only suppresses the session for the current adapter invocation. resolveNextSessionState() later still receives the original previousSessionParams, so any adapter that does not emit explicit session metadata will persist the dead session back into sessionIdAfter/agentTaskSessions, which defeats the breaker on the next wake.
Suggested fix
- const previousSessionParams =
+ const previousSessionParams =
explicitResumeSessionParams ??
(explicitResumeSessionDisplayId ? { sessionId: explicitResumeSessionDisplayId } : null) ??
normalizeSessionParams(sessionCodec.deserialize(taskSessionForRun?.sessionParamsJson ?? null));
+ let sessionFallbackParams = previousSessionParams;
...
if (sessionCompaction.rotate) {
context.paperclipSessionHandoffMarkdown = sessionCompaction.handoffMarkdown;
context.paperclipSessionRotationReason = sessionCompaction.reason;
context.paperclipPreviousSessionId = previousSessionDisplayId ?? runtimeSessionIdForAdapter;
runtimeSessionIdForAdapter = null;
runtimeSessionParamsForAdapter = null;
previousSessionDisplayId = null;
+ sessionFallbackParams = null;
if (sessionCompaction.reason) {
runtimeWorkspaceWarnings.push(
`Starting a fresh session because ${sessionCompaction.reason}.`,
);
}
@@
const nextSessionState = resolveNextSessionState({
codec: sessionCodec,
adapterResult,
- previousParams: previousSessionParams,
+ previousParams: sessionFallbackParams,
previousDisplayId: runtimeForAdapter.sessionDisplayId,
previousLegacySessionId: runtimeForAdapter.sessionId,
});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@server/src/services/heartbeat.ts` around lines 3575 - 3580, The
circuit-breaker rotation only suppresses the session for the current invocation
but leaves the old fallback in place, so resolveNextSessionState() still uses
the original previousSessionParams (persisting the dead session into
sessionIdAfter/agentTaskSessions). Update the handling around
detectDeadReusedSessionCircuitBreaker (deadSessionCircuitBreaker, reason,
previousRunId) so that when the circuit indicates a rotate/suppression you clear
the previous-session fallback passed downstream (e.g., set previousRunId and any
previousSessionParams to null or remove the previousSessionParams object) before
calling resolveNextSessionState() — ensure this change touches the variables
named deadSessionCircuitBreaker, previousRunId, previousSessionParams, and any
code that sets sessionIdAfter/agentTaskSessions so the dead session cannot be
reintroduced on the next wake.
Summary
Tests
/home/olive/.hermes/node/bin/pnpm exec vitest run server/src/__tests__/heartbeat-workspace-session.test.ts— 41 passed/home/olive/.hermes/node/bin/pnpm --filter @paperclipai/server typecheck— passedFixes Paperclip GRA-1736.