Skip to content

Fix watcher degradation on watch exhaustion and prolonged lock contention#877

Closed
thismilktea wants to merge 3 commits into
colbymchenry:mainfrom
thismilktea:fix/watcher-degrade-lock-contention
Closed

Fix watcher degradation on watch exhaustion and prolonged lock contention#877
thismilktea wants to merge 3 commits into
colbymchenry:mainfrom
thismilktea:fix/watcher-degrade-lock-contention

Conversation

@thismilktea

Copy link
Copy Markdown
Contributor

Closes #876

Summary

This PR hardens the live file watcher around two reliability failure modes:

  1. Watch-resource exhaustion (EMFILE / ENFILE) now disables live watching cleanly instead of leaving the watcher half-broken.
  2. Prolonged sync lock contention (LockUnavailableError) no longer retries forever at the normal debounce cadence; it now uses bounded backoff and eventually degrades auto-sync explicitly.

The goal is to fail closed and clearly once live watching is no longer trustworthy, while preserving the current behavior for normal edits and short-lived contention.

What changed

Watch exhaustion

  • Detects watch-resource exhaustion more explicitly
  • Degrades/stops the watcher instead of just logging forever
  • Emits a single actionable warning
  • Adds an onDegraded callback so callers can observe permanent watcher degradation

Lock contention

  • Keeps the current quiet behavior for brief lock contention
  • Adds bounded retry backoff for repeated LockUnavailableError
  • Stops infinite normal-debounce retries under long-lived contention
  • Degrades auto-sync after the retry threshold is crossed

Internal cleanup

  • Separates normal debounce scheduling from retry scheduling
  • Tightens exhaustion detection so message matching is only used as a fallback when no err.code is available

Why

Before this change, the watcher could remain "alive" after it had effectively stopped being trustworthy:

  • EMFILE / ENFILE could leave live watching unusable without a clean degraded/off transition
  • Prolonged lock contention could keep the watcher retrying forever with no terminal state
  • Callers could continue assuming auto-sync was still working even while the index drifted stale

This is especially problematic for long-running MCP/daemon sessions.

Tests

Added / extended watcher tests for:

  • Startup watch exhaustion
  • Runtime recursive watcher exhaustion
  • Prolonged LockUnavailableError degradation
  • Degraded-state callback notification

Verified with:

npx vitest run __tests__/watcher.test.ts __tests__/watch-policy.test.ts

colbymchenry added a commit that referenced this pull request Jun 15, 2026
…contention (#891)

The live file watcher could stay "alive" after it had stopped being
trustworthy. EMFILE/ENFILE watch-resource exhaustion only logged (and was
silently tolerated on the Linux per-directory path), and prolonged
LockUnavailableError retried forever at the normal debounce cadence — both
left auto-sync dead while the index silently drifted stale. Especially bad
for long-running MCP/daemon sessions.

Add a one-way degrade(): on watch-resource exhaustion (any watch strategy)
or on lock contention past a bounded exponential-backoff budget, log once,
fire a new onDegraded callback, and stop. start() now returns false
consistently when the per-directory path degrades at startup — it previously
returned true on Linux, so the MCP server reported the watcher "active" when
it had degraded. Wire onDegraded into the MCP server so callers are actually
told, and expose isDegraded()/getDegradedReason().

Builds on the approach in #877 by @thismilktea. Validated on macOS
(recursive), Linux (per-directory, Docker) and Windows (recursive) — 30/30
watcher + watch-policy tests on each.

Closes #876

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@colbymchenry

Copy link
Copy Markdown
Owner

Thanks for this, @thismilktea — solid diagnosis and the degrade/backoff approach was the right one. I've merged it into main (with a couple of corrections) as #891, and credited you in the changelog.

Two things I adjusted on top of your branch:

  1. Linux start() consistency. On the per-directory watch path (Linux), a startup exhaustion degraded but start() still returned true, so the MCP server would report the watcher "active" on a watcher that had just disabled itself — and the new should not start when fs.watch setup exhausts test would fail there. Both watch strategies now return false consistently (verified natively in Docker).
  2. onDegraded wiring. The callback wasn't consumed anywhere, so MCP/daemon callers still weren't told. It's now wired into the MCP server (File watcher degraded — …).

Also validated the recursive path on Windows (Parallels) in addition to macOS/Linux. Closing in favor of #891 — thanks again for driving this. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Watcher can stay half-broken after watch exhaustion or prolonged lock contention

2 participants