From f92d4d039806cf40ca2e646426cf65b9cacc7a95 Mon Sep 17 00:00:00 2001 From: John Lambert Date: Thu, 2 Jul 2026 11:05:24 -0400 Subject: [PATCH 1/2] Propose telemetry-migration-baseline: durable analytics outbox, session baseline, and forward-compatible UI-framework marker Grounds a prior strategic-review recommendation (accrue a Legacy-UI telemetry baseline ahead of the Avalonia rollout) against actual code on origin/main via grill-with-docs. Adds the openspec change (proposal/design/specs/tasks) plus grounded Analytics/Telemetry terminology and ADR decisions in CONTEXT.md. No production code yet; this is planning artifacts only. Co-Authored-By: Claude Sonnet 5 --- CONTEXT.md | 12 +++ .../.openspec.yaml | 2 + .../telemetry-migration-baseline/design.md | 88 +++++++++++++++++++ .../telemetry-migration-baseline/proposal.md | 36 ++++++++ .../specs/telemetry-outbox/spec.md | 58 ++++++++++++ .../specs/telemetry-session-baseline/spec.md | 43 +++++++++ .../specs/telemetry-usage-enrichment/spec.md | 36 ++++++++ .../telemetry-migration-baseline/tasks.md | 45 ++++++++++ 8 files changed, 320 insertions(+) create mode 100644 openspec/changes/telemetry-migration-baseline/.openspec.yaml create mode 100644 openspec/changes/telemetry-migration-baseline/design.md create mode 100644 openspec/changes/telemetry-migration-baseline/proposal.md create mode 100644 openspec/changes/telemetry-migration-baseline/specs/telemetry-outbox/spec.md create mode 100644 openspec/changes/telemetry-migration-baseline/specs/telemetry-session-baseline/spec.md create mode 100644 openspec/changes/telemetry-migration-baseline/specs/telemetry-usage-enrichment/spec.md create mode 100644 openspec/changes/telemetry-migration-baseline/tasks.md diff --git a/CONTEXT.md b/CONTEXT.md index 7a51bcd3a9..4035a9cdc0 100644 --- a/CONTEXT.md +++ b/CONTEXT.md @@ -89,8 +89,20 @@ It is intentionally not a full architecture manual. It should stay biased toward - **Ambiguous feedback**: A reviewer request that lacks enough context, conflicts with another requirement, or would require a product or architecture decision. Ask the user before changing code. - **Reply**: A response in the review conversation explaining a fix, asking a question, or giving technical reasoning for no code change. - **Resolve**: Marking a review thread addressed in GitHub. Do this only after the code or reply fully answers the thread. +## Analytics and Telemetry (WinForms app) +- **Analytics (DesktopAnalytics)**: The `SIL.DesktopAnalytics` package (v4.0.0, Mixpanel client) FieldWorks wraps for usage/crash telemetry. Constructed once per process in `Src/Common/FieldWorks/FieldWorks.cs` (`new Analytics(...)`) inside a `using` block spanning the whole app lifetime; not re-initializable without a restart. +- **`OkToPingBasicUsageData`**: The user consent flag gatekeeping all analytics. Type `SIL.Reporting.ReportingSettings.OkToPingBasicUsageData` (external type from SIL.Core, not defined in this repo), persisted as a .NET user-scoped setting (not the registry), surfaced in Tools > Options > Privacy. A toggle only takes effect after restart, since the `Analytics` singleton is constructed once at startup. +- **Application property** (package term, sometimes called a "super property" elsewhere): `DesktopAnalytics.Analytics.SetApplicationProperty(key, value)` — attaches a key/value to every subsequent `Track`/`ReportException` call for the process's lifetime. Already exists in the package; FieldWorks calls it nowhere today. This is the seam for any new cross-cutting telemetry dimension (e.g. a UI-architecture marker), not a mechanism that needs to be built. +- **Track / event**: `Analytics.Track(eventName, propertiesDict)`. Existing FieldWorks events: `SwitchToTool` (`area`+`tool` only, throttled to once per tool per day, in `AreaListener.cs`), `TrackingHelper`'s `TrackImport`/`TrackExport`/`TrackHelpRequest` (~10 call sites), and a few standalone `Track` calls (`ExportConcordanceResults`, `ConfigureInterlinear`, `CreateFromSRRepo`). +- **Exception report**: `Analytics.ReportException(exception, moreProperties)`, wired to both `Application.ThreadException` and `AppDomain.UnhandledException` in `FieldWorks.cs`. Capped at 10 reports per process run by the package itself. +- **Fire-and-forget delivery**: DesktopAnalytics has no local persistence, queue, or retry. A `Track`/`ReportException` call that can't reach Mixpanel (no network) is silently dropped (`Debug.WriteLine` only) — there is no "offline batching" anywhere in this package or repo today, despite that having been assumed in earlier planning conversation. This is load-bearing for any plan that wants event *counts* to mean something for a userbase that is frequently offline. +- **Mixpanel key**: Currently a hardcoded literal in `FieldWorks.cs` (`#if DEBUG` dev key vs. release key) — not sourced from config, environment, or build-time substitution. +- **UI mode / Avalonia surfaces**: Do not exist on `origin/main`. All `UIMode`/`LexicalEditSurfaceResolver`/Avalonia code lives on other branches (e.g. `phase1-base`) and is not present here. Telemetry work done against `origin/main` that anticipates a future Legacy/Avalonia split must be forward-compatible scaffolding (e.g. a constant-valued property today), not code that reads a UIMode value that doesn't exist yet. ## Remaining Open Question - Should this root file stay mixed product-plus-architecture, or should lower-level developer terms move into narrower subtree contexts later? ## ADR Candidates - No repository-wide ADR location is established yet. - If a terminology decision becomes hard to reverse or affects naming across many files, record it here first and promote it to a formal ADR only if the repo adopts a dedicated ADR convention. +- **Analytics offline-loss acceptance — DECIDED (2026-07-02)**: build a durable local retry/queue layer for `Analytics.Track`/`ReportException` rather than accept the current lossy fire-and-forget behavior. No existing seam for this — it is new infrastructure (a local persisted outbox that survives process restarts and retries when a network is available) sitting in front of the `Analytics` calls, since the underlying `DesktopAnalytics` package itself does not support this. +- **Forward-compatible UI-architecture marker — DECIDED (2026-07-02)**: add `Analytics.SetApplicationProperty("UiFramework", "WinForms")` now, constant-valued, via the existing unused seam. The eventual Avalonia branch flips the value per-surface; no code here depends on UIMode existing. +- **Hardcoded Mixpanel keys — DECIDED (2026-07-02)**: out of scope for this change. Pre-existing secrets-hygiene issue, unrelated to telemetry enrichment; noted as a discovered finding only. diff --git a/openspec/changes/telemetry-migration-baseline/.openspec.yaml b/openspec/changes/telemetry-migration-baseline/.openspec.yaml new file mode 100644 index 0000000000..8e26fbece7 --- /dev/null +++ b/openspec/changes/telemetry-migration-baseline/.openspec.yaml @@ -0,0 +1,2 @@ +schema: spec-driven +created: 2026-07-02 diff --git a/openspec/changes/telemetry-migration-baseline/design.md b/openspec/changes/telemetry-migration-baseline/design.md new file mode 100644 index 0000000000..b5fadadbd9 --- /dev/null +++ b/openspec/changes/telemetry-migration-baseline/design.md @@ -0,0 +1,88 @@ +## Context + +FieldWorks wraps `SIL.DesktopAnalytics` v4.0.0 (Mixpanel client) with a single `Analytics` instance constructed once at startup (`Src/Common/FieldWorks/FieldWorks.cs`) inside a `using` block spanning the app's lifetime, gated by `FwApplicationSettings.Reporting.OkToPingBasicUsageData`. Six files call `Analytics.Track`/`Analytics.ReportException` directly today: `FieldWorks.cs` (exceptions), `AreaListener.cs` (`SwitchToTool`), `TrackingHelper.cs` (import/export/help), `ConcordanceContainer.cs`, `ConfigureInterlinDialog.cs`, and `ObtainProjectMethod.cs`. The package itself has no local persistence: a `Track`/`ReportException` call that can't reach Mixpanel is silently dropped. `DesktopAnalytics` also already exposes an unused `Analytics.SetApplicationProperty(key, value)` mechanism (an "application property" attached to every subsequent event) and, per decompilation, an internal `AllowTracking` gate that mirrors the consent flag and a per-process cap of 10 exception reports (`MAX_EXCEPTION_REPORTS_PER_RUN`). + +FieldWorks explicitly supports multiple simultaneous `FieldWorks.exe` processes (one project can be open per process; see the `CreateRemoteRequestListener` comment in `FieldWorks.cs` noting a cross-process mutex was considered and rejected in favor of a listener). Any new local persistence shared across processes must not assume single-process exclusivity. + +## Goals / Non-Goals + +**Goals:** +- Make analytics delivery durable across restarts and offline periods (`telemetry-outbox`). +- Add session-level baseline metrics: duration and crash-vs-clean outcome (`telemetry-session-baseline`). +- Add a forward-compatible, constant-valued `UiFramework` property and per-tool dwell time (`telemetry-usage-enrichment`). +- Keep the existing consent flag (`OkToPingBasicUsageData`) as the sole authority over whether any of this data is ever queued or sent. + +**Non-Goals:** +- Building a general-purpose local message queue framework for reuse elsewhere in FieldWorks — this is scoped to analytics delivery only. +- Cross-process locking/mutex infrastructure — the design avoids needing it (see Decisions). +- Any code that reads, branches on, or depends on `UIMode`/Avalonia concepts, none of which exist on this branch. +- Changing the Mixpanel API key handling (pre-existing, out of scope per proposal). + +## Decisions + +### D1 — One file per queued event, not one shared queue file +Each queued event (a `Track` call's event name + properties, or a `ReportException` call's exception data) is written as its own small JSON file in an outbox directory, named `{UtcTicks:D19}_{Guid:N}.json`. FIFO order falls out of a plain filename sort — no separate index/manifest file is needed, and there is nothing to corrupt by a torn write to a shared file. + +*Alternative considered*: a single append-only JSONL file. Rejected — it requires either a lock file or careful append semantics to be safe across the multi-process scenario above, and a torn write (crash mid-append) risks corrupting the tail record. + +### D2 — Storage location: `FwDirectoryFinder.UserAppDataFolder("FieldWorks")` +This resolves to `%LocalAppData%\SIL\FieldWorks\` today (`Environment.SpecialFolder.LocalApplicationData` + company/app name), matching the one existing precedent for this exact helper (`XCore/XMessageBoxExManager.cs`'s `DialogResponses.xml`, which also does `Directory.CreateDirectory(path)` before first write). The outbox lives in an `Analytics\Outbox\` subdirectory of that path. This is per-user, consistent with the consent flag being a per-user setting, and is not the `.NET` `user.config` settings store (that's a distinct, unrelated mechanism already used for `OkToPingBasicUsageData` itself). + +### D3 — All telemetry funnels through one new facade; outbox is the only thing that calls `Analytics.Track`/`ReportException` +A new `AnalyticsOutbox` class (`Src/Common/FwUtils/AnalyticsOutbox.cs`) exposes `Track(eventName, properties)` and `ReportException(exception, moreProperties)` with signatures matching the package's own methods. Every existing call site (`FieldWorks.cs`, `AreaListener.cs`, `TrackingHelper.cs`, `ConcordanceContainer.cs`, `ConfigureInterlinDialog.cs`, `ObtainProjectMethod.cs`) is migrated to call `AnalyticsOutbox` instead of `Analytics` directly. `AnalyticsOutbox.Track`/`ReportException`: +1. Check consent (see D7). If not consented, do nothing — no file is written. +2. Serialize the call to a new outbox file (D1). +3. Fire a best-effort, non-blocking attempt to flush just that new file immediately (so the common case — online — still delivers with negligible added latency). + +This makes "durable by default" the only code path; there is no separate "send immediately, fall back to queue on failure" branch to keep in sync. + +*Alternative considered*: keep direct `Analytics.Track` calls for the common case and only route through a queue on a caught delivery failure. Rejected — `Analytics.Track` is fire-and-forget internally (an async call whose result isn't awaited by the caller), so callers have no reliable failure signal to branch on in the first place; funneling everything through the outbox sidesteps that entirely. + +### D4 — Cross-process claim via atomic file rename, not a lock manager +Before sending a queued file (whether it's the one just written or one found during a startup sweep), the flush routine attempts to rename it in place to a `.inflight` suffix. `File.Move` is atomic on the same volume and fails if the source is already gone or the destination exists — so a successful rename is a cheap, sufficient claim that no other process is (or already has) handled this file. A failed rename means another process claimed it first; skip and move on. `RetryUtility.Retry` (SIL.Core) wraps only the transient-IO case (e.g. a momentary AV/indexer lock), not the claim race itself, which the rename semantics already resolve. + +### D5 — Delivery success = delete the claimed file +After a successful `Analytics.Track`/`ReportException` call (the real package method, called only from inside the flush routine), the `.inflight` file is deleted. If the underlying send throws, the `.inflight` file is renamed back to its original name so a later sweep retries it. This gives at-least-once delivery, not exactly-once: a crash between a successful send and the delete would cause a duplicate on the next sweep. Accepted — see Risks. + +### D6 — Bounded growth: cap by count and age, evict oldest first +The outbox enforces a cap (default: 2,000 files or 30 days of age, whichever is hit first) during every sweep, deleting the oldest excess files without attempting to send them. Exact numbers are tunable constants, not a hard requirement of the spec. + +### D7 — Consent is re-checked at flush time, not just enqueue time +If a user revokes consent after an event was queued but before it's flushed, the flush routine re-checks consent for each file just before claiming it and deletes (without sending) any file found while consent is currently off. This fails closed on privacy: once consent is off, nothing already queued goes out, even though it was consented-to at write time. `AnalyticsOutbox` consults `Analytics.AllowTracking` if it proves to be accessible (verify at implementation time — see Open Questions); otherwise it falls back to reading `FwApplicationSettings.Reporting.OkToPingBasicUsageData` directly, understanding that path won't reflect the `FEEDBACK` environment-variable override applied once at `Analytics` construction. + +### D8 — Queued exceptions replay as `Track`, not `ReportException` +Replaying a queued exception through the real `Analytics.ReportException` at flush time would count against that run's `MAX_EXCEPTION_REPORTS_PER_RUN = 10` cap, potentially starving budget meant for exceptions actually happening in the current run. Instead, `AnalyticsOutbox` replays queued exception records via `Analytics.Track("QueuedExceptionReport", properties)`, reconstructing the same property shape (`Message`, `Stack Trace`, plus whatever `moreProperties` were captured) that `ReportException` would have sent, so the Mixpanel-side data stays structurally comparable without consuming the live run's cap. + +### D9 — Flush triggers: enqueue-time, startup sweep, and a bounded shutdown sweep +Three flush triggers, no persistent background poller: +1. Immediately after any `AnalyticsOutbox.Track`/`ReportException` call (fire-and-forget, covers the common "online" case with minimal latency). +2. A full directory sweep on startup, after the `Analytics` instance and consent state are known — this is what delivers everything accumulated while offline. +3. A best-effort, time-bounded (≤2s) sweep at shutdown, so events generated late in a session don't have to wait for the next launch. Consistent with the existing package's own `ShutDown()` pattern (up to 7.5s polling for in-flight tasks). + +*Alternative considered*: a network-change-event listener or a periodic timer while running. Rejected as unnecessary complexity — FieldWorks sessions are typically short enough, and startup/shutdown sweeps cover the realistic offline-then-online transition (user closes laptop, travels, reopens FieldWorks later). + +### D10 — Session identity, correlation, and the single-terminal-event guard +A `SessionId` (`Guid.NewGuid()`) is generated once at startup and stored for the process's lifetime. Session-start is tracked (via `AnalyticsOutbox.Track`) right after the `Analytics` instance is constructed. Session-end is tracked from exactly one of two places — the normal fall-through at the end of the `using (new Analytics(...))` block (outcome `clean`), or `HandleUnhandledException`/`HandleTopLevelError` (outcome `crashed`) — guarded by `Interlocked.CompareExchange` on a static flag so whichever path runs first wins and the other is suppressed. Duration is `DateTime.UtcNow` at the reporting point minus the recorded start time. + +### D11 — SwitchToTool dwell time via a stored "last activation" timestamp +`AreaListener` already has the property-changed handler that fires `SwitchToTool` on tool changes (throttled to once per tool per day). A new field records `DateTime.UtcNow` every time the active tool changes, regardless of whether the throttle allows an event to fire. When the (already-throttled) `SwitchToTool` event does fire, it includes a `duration` property computed from that stored timestamp — i.e., the dwell time since the tool was last activated, not a running total across a whole day (see Risks: this is a deliberate simplification consistent with the existing once-per-day throttle, not a bug). At application shutdown, if a tool is currently active, one additional dwell record is tracked for that final stretch (via `AnalyticsOutbox.Track`, same event shape) so it isn't silently lost. + +## Risks / Trade-offs + +- **[Risk] At-least-once delivery can double-count events** if a process crashes between a successful Mixpanel send and the local file delete → **Mitigation**: accepted trade-off; strictly better than today's silent-loss behavior, and Mixpanel-side usage analysis (adoption trends, crash rates) tolerates small duplicate noise far better than missing data. +- **[Risk] Six existing call sites must be migrated to the new facade in one mechanical pass** → **Mitigation**: `AnalyticsOutbox.Track`/`ReportException` match `Analytics.Track`/`ReportException`'s signatures exactly, so the migration is a pure substitution; existing tests around those call sites (where they exist) continue to apply. +- **[Risk] Cross-process sweep introduces a new race not present today** (two processes both attempt the same file) → **Mitigation**: the atomic-rename claim (D4) resolves the race to "one wins, one skips," never a crash or corruption. +- **[Risk] `SwitchToTool` dwell time reflects only the most recent stretch in a tool, not cumulative time across a whole day** (a consequence of preserving the existing once-per-day throttle, per spec) → **Mitigation**: documented here and in the spec; if cumulative dwell time is later wanted, that's a throttle-semantics change outside this proposal's scope. +- **[Risk] `Analytics.AllowTracking` may not be a public member** → **Mitigation**: fallback to `FwApplicationSettings.Reporting.OkToPingBasicUsageData` is specified in D7; either way, consent is re-checked at flush time. +- **[Risk] Outbox directory could grow unexpectedly large if Mixpanel is unreachable for a very long time** → **Mitigation**: D6's count/age cap bounds worst-case disk usage to a small, fixed footprint. + +## Migration Plan + +No user-facing or data-model migration: the outbox directory is new local state with no prior version to migrate from. Rollback (reverting this change) is safe — any leftover `Analytics\Outbox\` directory is inert and can be deleted or ignored; it is never read by anything except the code introduced here. Deployment is a normal FieldWorks release; no installer changes are needed since the directory is created on first use via `Directory.CreateDirectory`, matching the existing `XMessageBoxExManager` precedent. + +## Open Questions + +- Whether `Analytics.AllowTracking` (found via decompilation of `DesktopAnalytics.dll` v4.0.0) is accessible outside the package, or whether `AnalyticsOutbox` must rely solely on `FwApplicationSettings.Reporting.OkToPingBasicUsageData` — verify during implementation (task-level, not blocking). +- Exact `RobustFile`/`RobustIO` (SIL.Core.Desktop) method names available for the write/delete/rename operations in D1/D4/D5 — verify against the currently installed `SIL.Core.Desktop` version; fall back to plain `System.IO` calls wrapped in `RetryUtility.Retry` if a needed method isn't present. +- Whether `SessionId` should additionally ride along as an `Analytics.SetApplicationProperty("SessionId", ...)` so every other event (not just session-start/end) is automatically correlated to a session — not required by the spec as written; left as an implementation-time judgment call since it doesn't change any requirement's observable behavior. +- Final tunable values for D6's cap (2,000 files / 30 days are illustrative defaults) — open to adjustment without a spec change. diff --git a/openspec/changes/telemetry-migration-baseline/proposal.md b/openspec/changes/telemetry-migration-baseline/proposal.md new file mode 100644 index 0000000000..cef7986e05 --- /dev/null +++ b/openspec/changes/telemetry-migration-baseline/proposal.md @@ -0,0 +1,36 @@ +## Why + +A future Avalonia UI rewrite (in progress on a separate, unmerged branch) will let users opt into a new UI per-surface, and that effort will need a "Legacy baseline" of usage/crash/session data to judge whether the new UI is healthy relative to today's WinForms app. FieldWorks already ships `SIL.DesktopAnalytics` (Mixpanel-backed), gated by the `OkToPingBasicUsageData` consent flag, but it tracks only sparse per-tool events, has no session concept, and silently drops any event it can't deliver immediately — so the baseline that later work will need to compare against does not exist yet, and every day this ships without it is baseline data that can never be recovered. This change adds that baseline instrumentation entirely within the current WinForms app, with no dependency on the unmerged Avalonia work. + +## What Changes + +- Add a durable local outbox for `Analytics.Track`/`Analytics.ReportException` calls: events issued while offline are persisted locally and retried once a network path to Mixpanel exists, instead of being silently dropped by the underlying package's fire-and-forget behavior. +- Call the existing (currently unused) `Analytics.SetApplicationProperty("UiFramework", "WinForms")` once at startup, so every event and exception report is segmentable by UI framework the moment a future second value exists. The value is a constant today; no code here reads or depends on any UIMode concept, since none exists in this codebase yet. +- Add session-level tracking: a session-start event at launch and a session-end event at shutdown, sufficient to compute session duration and a crash-free-session rate (a session ending via the unhandled-exception path must be distinguishable from a clean shutdown). +- Enrich the existing `SwitchToTool` event with per-tool dwell time (time spent in a tool before switching away or app exit), without changing its existing `area`/`tool` properties or its once-per-tool-per-day throttle. + +## Non-Goals + +- Not changing the hardcoded Mixpanel API keys in `FieldWorks.cs` — a pre-existing, unrelated secrets-hygiene issue noted but explicitly out of scope. +- Not adding, reading, or depending on `UIMode`, `LexicalEditSurfaceResolver`, or any Avalonia-branch-only concept. No such code exists on this branch; any anticipated future segmentation is represented as a constant value today, not a live branch. +- Not defining the Avalonia-side "bake metric" thresholds, rollout ladder, or beta-cohort mechanics — this change only produces the baseline data those future decisions would consume. +- Not changing what the consent flag means or how it's surfaced in the UI; it remains the sole authority for whether any telemetry (new or existing) is sent. +- Not building generic/reusable analytics abstractions beyond what these four capabilities need. + +## Capabilities + +### New Capabilities +- `telemetry-outbox`: durable local persistence and retry for analytics delivery, sitting in front of the existing `DesktopAnalytics` client, gated by the existing consent flag. +- `telemetry-session-baseline`: session-start/session-end tracking with clean-vs-crash termination classification, enabling session duration and crash-free-session rate. +- `telemetry-usage-enrichment`: the `UiFramework` application property and `SwitchToTool` dwell-time enrichment. + +### Modified Capabilities +(none — no existing `openspec/specs/` capability covers analytics/telemetry today) + +## Impact + +- **Managed C# only** (`Src/Common/FieldWorks/`, `Src/Common/FwUtils/`, `Src/LexText/LexTextDll/`); no native (C++) changes. +- Touches the `Analytics` construction/shutdown path in `Src/Common/FieldWorks/FieldWorks.cs`, the `SwitchToTool` tracking in `Src/LexText/LexTextDll/AreaListener.cs`, and adds a new small outbox subsystem (new files, project TBD in design). +- Depends on the existing `SIL.DesktopAnalytics` v4.0.0 package and the existing `OkToPingBasicUsageData` consent flag; introduces no new external dependencies. +- New local on-disk state (the outbox store) is introduced; needs a documented location consistent with existing FieldWorks user-data conventions. +- Testable via `.\test.ps1`; no changes to `build.ps1` phase ordering expected. diff --git a/openspec/changes/telemetry-migration-baseline/specs/telemetry-outbox/spec.md b/openspec/changes/telemetry-migration-baseline/specs/telemetry-outbox/spec.md new file mode 100644 index 0000000000..caaabc852a --- /dev/null +++ b/openspec/changes/telemetry-migration-baseline/specs/telemetry-outbox/spec.md @@ -0,0 +1,58 @@ +## ADDED Requirements + +### Requirement: Consent-gated persistence +The outbox SHALL persist an analytics event or exception report locally only if `OkToPingBasicUsageData` is `true` at the moment the event is generated. It SHALL NOT accumulate events on disk while consent is off. + +#### Scenario: Consent disabled at event time +- **WHEN** an event or exception is generated and `OkToPingBasicUsageData` is `false` +- **THEN** the outbox does not write it to local storage and the event is dropped, matching today's behavior for non-consenting users + +#### Scenario: Consent enabled at event time +- **WHEN** an event or exception is generated and `OkToPingBasicUsageData` is `true`, but immediate delivery to Mixpanel is not possible +- **THEN** the outbox persists the event to local durable storage for later retry + +### Requirement: Durability across restarts +Events written to the outbox SHALL survive an application restart, so an event queued during one process run can still be delivered by a later run. + +#### Scenario: App restarts with pending events +- **WHEN** the outbox holds one or more undelivered events at the time the application exits or crashes +- **THEN** those events are still present in the outbox the next time the application starts + +### Requirement: Retry and flush on connectivity +The outbox SHALL attempt to deliver persisted events to Mixpanel via the existing `Analytics.Track`/`Analytics.ReportException` calls when a network path is available, without blocking the UI thread. + +#### Scenario: Network available on next launch +- **WHEN** the application starts and previously queued events are present, and Mixpanel is reachable +- **THEN** the outbox flushes the queued events in the background without blocking application startup or the UI thread + +#### Scenario: Network still unavailable +- **WHEN** a flush attempt is made and Mixpanel is not reachable +- **THEN** the events remain persisted for a later retry and no data is lost + +### Requirement: FIFO delivery order +The outbox SHALL flush persisted events in the order they were originally enqueued. + +#### Scenario: Multiple queued events flushed +- **WHEN** the outbox holds multiple persisted events accumulated across one or more offline sessions +- **THEN** it flushes them to Mixpanel in the same order they were originally recorded + +### Requirement: Bounded growth +The outbox SHALL cap total stored events (by count and/or age) to prevent unbounded local disk growth during extended offline periods. + +#### Scenario: Extended offline period exceeds the cap +- **WHEN** the number or age of persisted events exceeds the configured cap +- **THEN** the oldest events are evicted first and the outbox size remains within the cap + +### Requirement: Exception-report cap interaction +Exception reports flushed from the outbox SHALL be subject to the same per-process-run cap (currently 10 reports per run, enforced internally by `DesktopAnalytics`) as live exceptions generated during that run. + +#### Scenario: Flushing queued exceptions on a fresh run +- **WHEN** the outbox flushes previously queued exception reports during a run that also generates its own live exceptions +- **THEN** the combined total of flushed and live exception reports sent during that run does not exceed the package's existing per-run cap + +### Requirement: Non-blocking shutdown +Outbox flush attempts triggered at application shutdown SHALL be time-bounded so they do not materially delay the application from exiting. + +#### Scenario: Shutdown with a pending flush in progress +- **WHEN** the application begins shutting down while an outbox flush is in progress +- **THEN** the flush attempt is bounded by a timeout and the application exits without hanging, leaving any undelivered events persisted for the next run diff --git a/openspec/changes/telemetry-migration-baseline/specs/telemetry-session-baseline/spec.md b/openspec/changes/telemetry-migration-baseline/specs/telemetry-session-baseline/spec.md new file mode 100644 index 0000000000..a2f6fcfb5d --- /dev/null +++ b/openspec/changes/telemetry-migration-baseline/specs/telemetry-session-baseline/spec.md @@ -0,0 +1,43 @@ +## ADDED Requirements + +### Requirement: Session-start event +The application SHALL track a session-start event once per process launch, gated by the existing `OkToPingBasicUsageData` consent flag, carrying a session identifier unique to that launch. + +#### Scenario: App launches with consent enabled +- **WHEN** the application finishes startup and `OkToPingBasicUsageData` is `true` +- **THEN** a session-start event is tracked, carrying a session identifier not reused by any other launch + +### Requirement: Session-end event on clean shutdown +The application SHALL track a session-end event with a `clean` outcome when it exits through its normal shutdown path. + +#### Scenario: User closes the app normally +- **WHEN** the application shuts down through its normal exit path with no unhandled exception +- **THEN** a session-end event is tracked with outcome `clean`, correlated to the session-start event for that launch + +### Requirement: Session-end classification on crash +The application SHALL track a session-end event with a `crashed` outcome when termination occurs via the global unhandled-exception handling path, distinguishable from a clean shutdown. + +#### Scenario: Unhandled exception terminates the app +- **WHEN** `HandleUnhandledException` or `HandleTopLevelError` handles an exception that results in the application terminating +- **THEN** a session-end event is tracked with outcome `crashed`, correlated to the session-start event for that launch + +### Requirement: Session duration +The session-end event SHALL include the elapsed wall-clock duration since the corresponding session-start event. + +#### Scenario: Session lasts a measurable duration +- **WHEN** a session-end event is tracked +- **THEN** it carries a duration value reflecting the elapsed time since that session's session-start event + +### Requirement: Session correlation identifier +Session-start and session-end events belonging to the same process run SHALL share a common session identifier, and that identifier SHALL NOT be reused across different launches. + +#### Scenario: Multiple launches produce distinct sessions +- **WHEN** the application is launched, exited, and launched again +- **THEN** each launch's session-start and session-end events share one identifier, and the two launches' identifiers differ + +### Requirement: Single terminal event per session +At most one session-end event SHALL be recorded per session-start event; a crash-path session-end SHALL prevent a duplicate clean-path session-end from also firing for the same session. + +#### Scenario: Crash path already recorded the session end +- **WHEN** the crash-handling path has already tracked a `crashed` session-end event for the current session +- **THEN** no additional `clean` session-end event is tracked for that same session diff --git a/openspec/changes/telemetry-migration-baseline/specs/telemetry-usage-enrichment/spec.md b/openspec/changes/telemetry-migration-baseline/specs/telemetry-usage-enrichment/spec.md new file mode 100644 index 0000000000..4f6a118824 --- /dev/null +++ b/openspec/changes/telemetry-migration-baseline/specs/telemetry-usage-enrichment/spec.md @@ -0,0 +1,36 @@ +## ADDED Requirements + +### Requirement: UiFramework application property +The application SHALL set a `DesktopAnalytics` application property named `UiFramework` with the constant value `WinForms` once at startup, via the existing `Analytics.SetApplicationProperty` mechanism, so it is attached to all subsequently tracked events and exception reports. + +#### Scenario: Any event tracked after startup +- **WHEN** any `Analytics.Track` or `Analytics.ReportException` call occurs after startup has completed +- **THEN** the resulting Mixpanel event includes a `UiFramework` property with value `WinForms` + +### Requirement: No dependency on UI-mode concepts +The `UiFramework` property SHALL be set to a fixed constant and SHALL NOT read from, branch on, or otherwise depend on any UI-mode or Avalonia-specific concept. + +#### Scenario: Built without any Avalonia/UIMode code present +- **WHEN** the application is built and run on a codebase that contains no `UIMode`, `LexicalEditSurfaceResolver`, or other Avalonia-branch-only code +- **THEN** the `UiFramework` property still resolves to a valid constant value with no compile-time or run-time dependency on such code + +### Requirement: SwitchToTool dwell-time enrichment +The existing `SwitchToTool` event SHALL include a duration property representing the time spent in the previously active tool before the switch (or before application exit). + +#### Scenario: User switches tools after working in one +- **WHEN** a user works in tool A for a period of time and then switches to tool B +- **THEN** the `SwitchToTool` event recorded for that switch includes a duration reflecting the time spent in tool A + +### Requirement: Preserve existing throttle and properties +The dwell-time enrichment SHALL NOT change the existing `area`/`tool` properties on `SwitchToTool`, nor change its existing once-per-tool-per-day throttle behavior. + +#### Scenario: Same tool revisited multiple times in one day +- **WHEN** a user switches into the same tool more than once within the same day +- **THEN** `SwitchToTool` still fires at most once for that (tool, day) pair as before, now additionally carrying a duration value + +### Requirement: Final dwell time captured at shutdown +Dwell time accumulated in the tool active at the moment the application exits SHALL be captured and tracked rather than silently discarded because no further tool switch occurs. + +#### Scenario: User exits while a tool is active +- **WHEN** the user exits the application while a tool is the currently active one +- **THEN** a dwell-time record reflecting that tool's final active period is tracked as part of shutdown diff --git a/openspec/changes/telemetry-migration-baseline/tasks.md b/openspec/changes/telemetry-migration-baseline/tasks.md new file mode 100644 index 0000000000..d566a674ac --- /dev/null +++ b/openspec/changes/telemetry-migration-baseline/tasks.md @@ -0,0 +1,45 @@ +## 1. Outbox core (managed, `Src/Common/FwUtils/`) + +- [ ] 1.1 Verify `Analytics.AllowTracking` accessibility and exact `RobustFile`/`RobustIO`/`RetryUtility` method signatures against the installed `SIL.Core.Desktop`/`SIL.DesktopAnalytics` package versions (resolves design.md's two Open Questions); note the resolution in a code comment at the call site. +- [ ] 1.2 Add `Src/Common/FwUtils/AnalyticsOutbox.cs`: static class with `Track(string eventName, Dictionary properties)` and `ReportException(Exception exception, Dictionary moreProperties)`, matching `Analytics`'s method shapes (design.md D3). +- [ ] 1.3 Implement consent check (design.md D7) shared by both methods, with the `AllowTracking`-or-fallback-to-`OkToPingBasicUsageData` logic resolved in 1.1. +- [ ] 1.4 Implement outbox file write: serialize event kind, name, properties, and queued-at timestamp to `{UtcTicks:D19}_{Guid:N}.json` under `FwDirectoryFinder.UserAppDataFolder("FieldWorks")\Analytics\Outbox\`, creating the directory if absent (design.md D1, D2). +- [ ] 1.5 Implement the atomic-rename claim (`.inflight` suffix) and post-send delete/rollback logic (design.md D4, D5). +- [ ] 1.6 Implement the bounded-growth sweep: delete oldest files beyond the count/age cap before attempting delivery (design.md D6). +- [ ] 1.7 Implement queued-exception replay via `Analytics.Track("QueuedExceptionReport", ...)` instead of `Analytics.ReportException` (design.md D8). +- [ ] 1.8 Wire the three flush triggers: fire-and-forget flush of the just-written file, full startup sweep, and a bounded (≤2s) shutdown sweep (design.md D9). +- [ ] 1.9 Unit tests (new `Src/Common/FwUtils/FwUtilsTests/AnalyticsOutboxTests.cs`, run via `.\test.ps1`): consent-off write is a no-op; a written file survives a simulated restart (re-instantiate the outbox against the same directory); FIFO ordering of multiple queued files; cap eviction deletes oldest first; a claimed-but-failed send restores the original filename; a claim race (two attempts on the same file) results in exactly one winner. + +## 2. Migrate existing call sites to `AnalyticsOutbox` + +- [ ] 2.1 `Src/Common/FieldWorks/FieldWorks.cs`: replace the three `Analytics.ReportException` calls with `AnalyticsOutbox.ReportException`. +- [ ] 2.2 `Src/LexText/LexTextDll/AreaListener.cs`: replace the `Analytics.Track("SwitchToTool", ...)` call with `AnalyticsOutbox.Track` (coordinate with section 4's dwell-time change to avoid two separate edits to the same call). +- [ ] 2.3 `Src/Common/FwUtils/TrackingHelper.cs`: replace all three `Analytics.Track` calls (`TrackImport`/`TrackExport`/`TrackHelpRequest`) with `AnalyticsOutbox.Track`. +- [ ] 2.4 `Src/LexText/Interlinear/ConcordanceContainer.cs` and `Src/LexText/Interlinear/ConfigureInterlinDialog.cs`: replace their `Analytics.Track` calls with `AnalyticsOutbox.Track`. +- [ ] 2.5 `Src/Common/Controls/FwControls/ObtainProjectMethod.cs`: replace its `Analytics.Track` call with `AnalyticsOutbox.Track`. +- [ ] 2.6 Repo-wide grep for `Analytics.Track(` and `Analytics.ReportException(` outside `AnalyticsOutbox.cs` itself to confirm no call site was missed. + +## 3. Session baseline (`Src/Common/FieldWorks/FieldWorks.cs`) + +- [ ] 3.1 Generate a `SessionId` (`Guid.NewGuid()`) once at startup and store it for the process lifetime. +- [ ] 3.2 Track a session-start event via `AnalyticsOutbox.Track` immediately after the `Analytics` instance is constructed, carrying `SessionId`. +- [ ] 3.3 Add the `Interlocked.CompareExchange`-guarded single-terminal-event flag (design.md D10). +- [ ] 3.4 Track a `clean`-outcome session-end event at the normal fall-through point of the `using (new Analytics(...))` block, including duration since session-start. +- [ ] 3.5 Track a `crashed`-outcome session-end event from `HandleUnhandledException`/`HandleTopLevelError`, including duration since session-start. +- [ ] 3.6 Unit/integration tests: clean shutdown produces exactly one `clean` session-end; a simulated unhandled exception produces exactly one `crashed` session-end and suppresses the clean-path event; duration reflects elapsed time in both cases. + +## 4. Usage enrichment (`UiFramework` property + `SwitchToTool` dwell time) + +- [ ] 4.1 Call `Analytics.SetApplicationProperty("UiFramework", "WinForms")` once at startup in `FieldWorks.cs`, right after `Analytics` construction. +- [ ] 4.2 Test that any event tracked after startup carries `UiFramework=WinForms` (verifies the property attaches automatically without touching individual `Track` call sites). +- [ ] 4.3 In `AreaListener.cs`, add a field tracking the last tool-activation timestamp, updated on every tool change regardless of the once-per-day throttle. +- [ ] 4.4 When the throttled `SwitchToTool` event does fire, include a `duration` property computed from the stored timestamp (design.md D11), without changing the existing `area`/`tool` properties. +- [ ] 4.5 Add a final dwell-time record at application shutdown for whatever tool is currently active, using the same `AnalyticsOutbox.Track` call shape. +- [ ] 4.6 Unit tests (extend `Src/LexText/LexTextDll` test project or nearest existing `AreaListener` test fixture): repeated same-day tool revisits still throttle to one event but carry a duration; a switch after a measurable delay reports a duration reflecting that delay; shutdown while a tool is active produces one final dwell record. + +## 5. Full verification + +- [ ] 5.1 Run `.\build.ps1` for the full managed build (no native changes expected; confirm no native rebuild is triggered). +- [ ] 5.2 Run `.\test.ps1` for the affected test projects (`FwUtilsTests`, `LexTextDll` tests, and any interlinear/common-controls tests touched by call-site migration) and confirm zero regressions. +- [ ] 5.3 Manually exercise the outbox end-to-end: disconnect network, use FieldWorks to trigger a few tracked events, confirm files accumulate under `%LocalAppData%\SIL\FieldWorks\Analytics\Outbox\`, reconnect, relaunch, and confirm the directory empties (via logging or a debugger breakpoint, since there's no UI surface for this). +- [ ] 5.4 Confirm the existing Tools > Options > Privacy checkbox (`OkToPingBasicUsageData`) still gates all telemetry, including newly-queued-but-not-yet-flushed events, per design.md D7. From d0c2288d2e8262d9a2ac71507672a42dcdb52e0f Mon Sep 17 00:00:00 2001 From: John Lambert Date: Thu, 2 Jul 2026 12:21:33 -0400 Subject: [PATCH 2/2] Implement telemetry-migration-baseline: durable analytics outbox, session baseline, usage enrichment Adds AnalyticsOutbox, a durable local queue in front of DesktopAnalytics Track/ReportException so events survive being generated while offline instead of being silently dropped. Migrates all six existing call sites to the new facade, adds session-start/end tracking with clean-vs-crashed classification, and sets a UiFramework=WinForms application property as forward-compatible scaffolding for a future Avalonia UI split. Found and fixed during implementation: System.IO.File.Move does not reliably report failure to the losing thread when two callers race an identical rename on .NET Framework - both can return without throwing even though the OS performs exactly one physical rename (verified with an isolated repro, ~99% reproducible). The outbox's claim mechanism now follows the rename with a FileShare.None exclusive open as the real single-owner check (design.md D14). Caught by the concurrent-flush unit test, which failed intermittently before this fix. 390 FwUtilsTests pass (15 new AnalyticsOutboxTests), plus zero regressions in LexTextDllTests, FieldWorksTests, and ITextDllTests. tasks.md documents three remaining test gaps (session-baseline and dwell-time unit tests blocked on cross-assembly test-seam access) and two manual verification steps (end-to-end offline queueing, privacy toggle behavior) not yet performed. Co-Authored-By: Claude Sonnet 5 --- .../FwControls/ObtainProjectMethod.cs | 3 +- Src/Common/FieldWorks/FieldWorks.cs | 68 ++- Src/Common/FwUtils/AnalyticsOutbox.cs | 476 ++++++++++++++++++ Src/Common/FwUtils/FwUtils.csproj | 1 + .../FwUtilsTests/AnalyticsOutboxTests.cs | 335 ++++++++++++ Src/Common/FwUtils/TrackingHelper.cs | 7 +- .../Interlinear/ConcordanceContainer.cs | 3 +- .../Interlinear/ConfigureInterlinDialog.cs | 3 +- Src/LexText/LexTextDll/AreaListener.cs | 41 +- .../telemetry-migration-baseline/design.md | 40 +- .../telemetry-migration-baseline/tasks.md | 64 +-- 11 files changed, 987 insertions(+), 54 deletions(-) create mode 100644 Src/Common/FwUtils/AnalyticsOutbox.cs create mode 100644 Src/Common/FwUtils/FwUtilsTests/AnalyticsOutboxTests.cs diff --git a/Src/Common/Controls/FwControls/ObtainProjectMethod.cs b/Src/Common/Controls/FwControls/ObtainProjectMethod.cs index c150ea52dc..2d13fc6ccc 100644 --- a/Src/Common/Controls/FwControls/ObtainProjectMethod.cs +++ b/Src/Common/Controls/FwControls/ObtainProjectMethod.cs @@ -10,7 +10,6 @@ using System.Text; using System.Windows.Forms; using System.Xml; -using DesktopAnalytics; using SIL.LCModel.Core.WritingSystems; using SIL.FieldWorks.Common.FwUtils; using SIL.LCModel; @@ -53,7 +52,7 @@ public static string ObtainProjectFromAnySource(Form parent, IHelpTopicProvider obtainedProjectType = ObtainedProjectType.Lift; } - Analytics.Track("CreateFromSRRepo", new Dictionary + AnalyticsOutbox.Track("CreateFromSRRepo", new Dictionary { { "type", obtainedProjectType.ToString() }, { "modelVersion", LcmCache.ModelVersion.ToString() }, diff --git a/Src/Common/FieldWorks/FieldWorks.cs b/Src/Common/FieldWorks/FieldWorks.cs index 3680193fb8..b74a98da5a 100644 --- a/Src/Common/FieldWorks/FieldWorks.cs +++ b/Src/Common/FieldWorks/FieldWorks.cs @@ -122,6 +122,14 @@ private enum StartupStatus // supports usage reporting has been run. private static bool s_noPreviousReportingSettings; private static ILcmUI s_ui; + // Telemetry session baseline (openspec/changes/telemetry-migration-baseline): identifies + // this process's session for correlating its SessionStart/SessionEnd events, and guards + // against both the clean-shutdown path (Main's finally) and the crash paths + // (HandleUnhandledException/HandleTopLevelError) reporting a session-end for the same + // session - whichever runs first wins. + private static string s_sessionId; + private static DateTime s_sessionStartUtc; + private static int s_sessionEndReported; private static FwApplicationSettings s_appSettings; #endregion @@ -232,6 +240,18 @@ static int Main(string[] rgArgs) #endif using (new Analytics(analyticsKey, new UserInfo(), sendFeedback, clientType: DesktopAnalytics.ClientType.Mixpanel)) { + // Forward-compatible marker (openspec/changes/telemetry-migration-baseline): + // constant today since this codebase has only one UI framework; a future + // Avalonia surface sets a different value on the same application property so + // all events/exceptions become segmentable by UI framework without any change + // to how they're tracked. + Analytics.SetApplicationProperty("UiFramework", "WinForms"); + + s_sessionId = Guid.NewGuid().ToString(); + s_sessionStartUtc = DateTime.UtcNow; + s_sessionEndReported = 0; + AnalyticsOutbox.Track("SessionStart", new Dictionary { { "SessionId", s_sessionId } }); + AnalyticsOutbox.FlushInBackground(); Logger.WriteEvent("Starting app"); SetGlobalExceptionHandler(); @@ -385,16 +405,28 @@ static int Main(string[] rgArgs) } catch (ApplicationException ex) { + TryReportSessionEnd("crashed"); MessageBox.Show(ex.Message, FwUtils.ksSuiteName); return 2; } catch (Exception ex) { + TryReportSessionEnd("crashed"); SafelyReportException(ex, s_activeMainWnd, true); return 2; } finally { + // Covers the normal exit path (including early "return 0"s inside the try block, + // which still run through here) exactly once, regardless of how many return + // points exist - unlike trying to place this at "the end of" the using block, + // which any early return would skip past. The atomic guard in + // TryReportSessionEnd means this is a no-op if a crash path already reported the + // session's end first (openspec/changes/telemetry-migration-baseline). + TryReportSessionEnd("clean"); + // Best-effort, time-bounded so it never materially delays shutdown; anything not + // reached is left for the next launch's startup sweep (design.md D9). + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(2)); StaticDispose(); if (Xpcom.IsInitialized) { @@ -1068,12 +1100,12 @@ private static void HandleUnhandledException(object sender, UnhandledExceptionEv { if (e.ExceptionObject is Exception exception) { - Analytics.ReportException(exception, GetInnerExceptionInfo(exception)); + AnalyticsOutbox.ReportException(exception, GetInnerExceptionInfo(exception)); DisplayError(exception, e.IsTerminating); } else { - Analytics.ReportException(new Exception("Unknown Exception"), + AnalyticsOutbox.ReportException(new Exception("Unknown Exception"), new Dictionary { { @@ -1086,6 +1118,31 @@ private static void HandleUnhandledException(object sender, UnhandledExceptionEv DisplayError(new ApplicationException(string.Format("Got unknown exception: {0}", e.ExceptionObject)), false); } + + // AppDomain.UnhandledException is (on the classic .NET Framework desktop CLR) always + // terminating in practice; e.IsTerminating is trusted here rather than assumed. + if (e.IsTerminating) + TryReportSessionEnd("crashed"); + } + + /// + /// Reports the SessionEnd telemetry event exactly once per process, regardless of which + /// of the (potentially concurrent) shutdown/crash paths reaches it first. + /// See openspec/changes/telemetry-migration-baseline/design.md D10. + /// + /// "clean" or "crashed" + private static void TryReportSessionEnd(string outcome) + { + if (Interlocked.CompareExchange(ref s_sessionEndReported, 1, 0) != 0) + return; // already reported by a different path + + var durationSeconds = (DateTime.UtcNow - s_sessionStartUtc).TotalSeconds; + AnalyticsOutbox.Track("SessionEnd", new Dictionary + { + { "SessionId", s_sessionId }, + { "outcome", outcome }, + { "duration", durationSeconds.ToString("F1", CultureInfo.InvariantCulture) } + }); } /// @@ -1129,7 +1186,7 @@ private static Dictionary GetInnerExceptionInfo(Exception except /// ------------------------------------------------------------------------------------ private static void HandleTopLevelError(object sender, ThreadExceptionEventArgs eventArgs) { - Analytics.ReportException(eventArgs.Exception, GetInnerExceptionInfo(eventArgs.Exception)); + AnalyticsOutbox.ReportException(eventArgs.Exception, GetInnerExceptionInfo(eventArgs.Exception)); if (FwUtils.IsUnsupportedCultureException(eventArgs.Exception)) // LT-8248 { Logger.WriteEvent("Unsupported culture: " + eventArgs.Exception.Message); @@ -1141,6 +1198,11 @@ private static void HandleTopLevelError(object sender, ThreadExceptionEventArgs if (DisplayError(eventArgs.Exception, false)) { FwApp.InCrashedState = true; + // Unlike HandleUnhandledException, Application.ThreadException doesn't always + // terminate the app (see the comment on SetGlobalExceptionHandler) - this branch + // is specifically the "we've decided to exit" case, so it's the correct place to + // report a crashed session-end, not the top of this method. + TryReportSessionEnd("crashed"); Application.Exit(); // just to be sure diff --git a/Src/Common/FwUtils/AnalyticsOutbox.cs b/Src/Common/FwUtils/AnalyticsOutbox.cs new file mode 100644 index 0000000000..b4120fe6a7 --- /dev/null +++ b/Src/Common/FwUtils/AnalyticsOutbox.cs @@ -0,0 +1,476 @@ +// Copyright (c) 2026 SIL International +// This software is licensed under the LGPL, version 2.1 or later +// (http://www.gnu.org/licenses/lgpl-2.1.html) + +using System; +using System.Collections.Generic; +using System.IO; +using System.Linq; +using System.Net.NetworkInformation; +using System.Text; +using System.Threading.Tasks; +using DesktopAnalytics; +using Newtonsoft.Json; +using SIL.Code; +using SIL.IO; + +namespace SIL.FieldWorks.Common.FwUtils +{ + /// + /// Durable local delivery for DesktopAnalytics Track/ReportException calls. Every event is + /// first persisted as its own small JSON file under a per-user outbox directory, then an + /// immediate best-effort flush is attempted; a full directory sweep also runs at startup and + /// (bounded) at shutdown, so events survive being generated while offline instead of + /// DesktopAnalytics' own silent fire-and-forget loss. + /// + /// + /// One file per event (not one shared queue file) avoids needing any cross-process lock: + /// FieldWorks can run multiple simultaneous processes (one per open project), and a single + /// process can also run concurrent flushes (e.g. the enqueue-time immediate attempt racing a + /// startup sweep), so a queued file is "claimed" for delivery by renaming it to a deterministic + /// ".inflight" name and then opening that with - the exclusive + /// open, not the rename, is the real single-owner guarantee (see D14: two callers racing an + /// identical rename can both report success). A sweep that later finds an already-".inflight" + /// file only treats it as a crash orphan once it's old enough () + /// that no legitimate in-progress delivery could still be running; a fresh one is left alone. + /// See openspec/changes/telemetry-migration-baseline/design.md for the full rationale, + /// including a documented gap: DesktopAnalytics' Track call is internally fire-and-forget and + /// essentially never throws synchronously, so a network-availability pre-check (not a real + /// delivery confirmation) gates whether a claim is even attempted. + /// + public static class AnalyticsOutbox + { + private const string ExceptionEventKind = "Exception"; + private const string TrackEventKind = "Track"; + private const string InflightSuffix = ".inflight"; + private const string RecoveringSuffix = ".recovering"; + private const string QueuedExceptionReportEventName = "QueuedExceptionReport"; + + /// + /// How long an ".inflight" claim must sit untouched before a sweep is willing to treat it + /// as orphaned (left behind by a crashed claimant) rather than a legitimate, still-in- + /// progress delivery by another thread or process. Kept comfortably above any realistic + /// delivery duration - including a slow network - so a healthy in-progress claim is never + /// mistaken for an orphan and delivered twice (a real double-delivery bug this class's own + /// tests caught during development, when a naive "any .inflight file is safe to redeliver" + /// rule raced two concurrent flushes against the exact same file). Not readonly so tests + /// can shrink it instead of waiting real minutes. + /// + internal static TimeSpan OrphanClaimStaleness = TimeSpan.FromMinutes(5); + + /// + /// Maximum number of queued files retained; oldest excess files are evicted unsent. + /// Not a const so tests can shrink it rather than creating thousands of files. + /// + internal static int MaxQueuedFiles = 2000; + + /// + /// Maximum age of a queued file before it is evicted unsent. Not readonly so tests can + /// shrink it rather than waiting real days. + /// + internal static TimeSpan MaxQueuedAge = TimeSpan.FromDays(30); + + private static readonly ISet TransientIoExceptionTypes = new HashSet + { + typeof(IOException), + typeof(UnauthorizedAccessException) + }; + + private static string s_outboxDirectoryOverride; + + private static string OutboxDirectory => s_outboxDirectoryOverride ?? + Path.Combine(FwDirectoryFinder.UserAppDataFolder("FieldWorks"), "Analytics", "Outbox"); + + /// + /// Test seam: is a hard, process-wide singleton + /// (its constructor throws "You can only construct a single Analytics object" if called + /// twice, and can never change once + /// set) - so tests cannot exercise both a consenting and a non-consenting scenario against + /// the real class in one test run. When set, this overrides . + /// + internal static Func AllowTrackingOverride; + + /// + /// Test seam: overrides the real + /// check so tests aren't at the mercy of the test machine's actual connectivity. + /// + internal static Func NetworkAvailableOverride; + + /// + /// Test seam: overrides the real call to + /// that would otherwise happen at the end of a successful flush, so tests can observe what + /// would have been sent (and simulate a delivery failure by throwing) without a live + /// Analytics singleton or network access. + /// + internal static Action> TrackDeliveryOverride; + + private static bool AllowTracking => AllowTrackingOverride?.Invoke() ?? Analytics.AllowTracking; + private static bool NetworkAvailable => NetworkAvailableOverride?.Invoke() ?? NetworkInterface.GetIsNetworkAvailable(); + + private static void DeliverViaAnalytics(string eventName, Dictionary properties) + { + if (TrackDeliveryOverride != null) + TrackDeliveryOverride(eventName, properties); + else + Analytics.Track(eventName, properties); + } + + /// Test seam: points the outbox at a temp directory instead of the real per-user folder. + internal static void SetOutboxDirectoryForTests(string directory) + { + s_outboxDirectoryOverride = directory; + } + + /// Test seam: restores the default (real) outbox directory and clears all overrides. + internal static void ResetOutboxDirectoryForTests() + { + s_outboxDirectoryOverride = null; + AllowTrackingOverride = null; + NetworkAvailableOverride = null; + TrackDeliveryOverride = null; + MaxQueuedFiles = 2000; + MaxQueuedAge = TimeSpan.FromDays(30); + OrphanClaimStaleness = TimeSpan.FromMinutes(5); + } + + /// + /// Durable replacement for . + /// Does nothing if the user has not consented to analytics. + /// + public static void Track(string eventName, Dictionary properties) + { + Enqueue(TrackEventKind, eventName, properties); + } + + /// + /// Durable replacement for . + /// Does nothing if the user has not consented to analytics. + /// + public static void ReportException(Exception exception, Dictionary moreProperties) + { + var properties = moreProperties == null + ? new Dictionary() + : new Dictionary(moreProperties); + properties["Message"] = exception.Message; + properties["Stack Trace"] = exception.StackTrace ?? string.Empty; + Enqueue(ExceptionEventKind, exception.GetType().FullName, properties); + } + + /// + /// Sweeps the outbox in the background (fire-and-forget). Intended for the startup flush, + /// which is what actually delivers everything accumulated while offline. + /// + public static void FlushInBackground() + { + Task.Run(() => Flush(null)); + } + + /// + /// Sweeps the outbox synchronously, but bounded by so it never + /// materially delays application shutdown. Any events not reached before the timeout are + /// left in place for the next startup sweep. + /// + public static void FlushSync(TimeSpan timeout) + { + Flush(DateTime.UtcNow + timeout); + } + + private static void Enqueue(string kind, string eventName, Dictionary properties) + { + if (!AllowTracking) + return; + + var outboxDirectory = OutboxDirectory; + // RobustIO has no "create if missing" helper - RequireThatDirectoryExists asserts + // existence rather than creating it (confirmed by decompiling SIL.Core), so this + // (idempotent, no-op if already present) is the correct call, not that one. + Directory.CreateDirectory(outboxDirectory); + + var entry = new OutboxEntry + { + Kind = kind, + EventName = eventName, + Properties = properties ?? new Dictionary(), + QueuedAtUtc = DateTime.UtcNow + }; + + var fileName = $"{entry.QueuedAtUtc.Ticks:D19}_{Guid.NewGuid():N}.json"; + var path = Path.Combine(outboxDirectory, fileName); + RetryUtility.Retry( + () => RobustFile.WriteAllText(path, JsonConvert.SerializeObject(entry)), + exceptionTypesToRetry: TransientIoExceptionTypes, + memo: "AnalyticsOutbox.Enqueue"); + + // Best-effort immediate delivery attempt so the common (online) case still delivers + // with negligible added latency; the startup/shutdown sweeps are the durability + // backstop, not the primary path. + Task.Run(() => ProcessOutboxFile(path)); + } + + private static void Flush(DateTime? deadlineUtc) + { + var outboxDirectory = OutboxDirectory; + if (!Directory.Exists(outboxDirectory)) + return; + + EvictBeyondCap(outboxDirectory); + + IEnumerable pending; + try + { + pending = RobustIO.EnumerateFilesInDirectory(outboxDirectory, "*.json") + .Concat(RobustIO.EnumerateFilesInDirectory(outboxDirectory, "*.json" + InflightSuffix)) + .OrderBy(p => Path.GetFileName(p), StringComparer.Ordinal) + .ToList(); + } + catch (DirectoryNotFoundException) + { + return; + } + + foreach (var path in pending) + { + if (deadlineUtc.HasValue && DateTime.UtcNow >= deadlineUtc.Value) + break; + ProcessOutboxFile(path); + } + } + + private static void ProcessOutboxFile(string path) + { + if (path.EndsWith(InflightSuffix, StringComparison.Ordinal)) + { + ProcessInflightFile(path); + return; + } + + if (!RobustFile.Exists(path)) + return; // already claimed/handled by another process or a previous pass + + if (!AllowTracking) + { + // Consent has been revoked since this was queued - fail closed on privacy: + // drop it rather than send data the user no longer consents to. + TryDelete(path); + return; + } + + if (!NetworkAvailable) + return; // known gap (design.md D12): this only proves *a* network exists, not that Mixpanel is reachable + + if (!TryClaimExclusive(path, InflightSuffix, out var claimedPath, out var claimedStream)) + return; // another thread/process already claimed (or removed) this exact file first + + DeliverClaimedFile(claimedStream, claimedPath, path); + } + + /// + /// Handles a file already sitting in the claimed (".inflight") state when a sweep finds + /// it. This could be a crash orphan, or - just as likely within one process - another + /// thread's flush that is still legitimately, actively delivering it right now. Those two + /// cases are indistinguishable from the filename alone, so this gates on a staleness + /// check: only a claim old enough that no realistic delivery would still be in progress is + /// treated as abandoned. (A naive "any .inflight file is fair game" rule was tried first + /// and caused a real, test-caught double delivery between two concurrent flushes.) + /// + private static void ProcessInflightFile(string path) + { + DateTime claimedAtUtc; + try + { + claimedAtUtc = RobustFile.GetLastWriteTimeUtc(path); + } + catch (IOException) + { + return; // gone already - another thread/process finished with it + } + + if (!RobustFile.Exists(path)) + return; // gone already - GetLastWriteTimeUtc doesn't throw for a missing file (see D14) + + if (DateTime.UtcNow - claimedAtUtc < OrphanClaimStaleness) + return; // too fresh to safely assume orphaned; a legitimate claim may still be in progress + + // Stale enough to treat as abandoned. Still claim it via the same rename+exclusive-open + // pattern before delivering, so two sweeps recovering the same stale orphan at the + // same moment can't both deliver it. On a delivery failure this is restored all the + // way back to the original, non-suffixed name (not left as ".recovering"), so the + // next sweep re-enters the normal, single-claim path rather than growing the suffix + // further. Known, accepted residual gap: if the process crashes again inside this + // narrow recovery window itself, that one file could be left permanently as + // ".recovering" until manually cleared - considered acceptable given how rare two + // crashes in the same delivery attempt would be. + var originalPath = path.Substring(0, path.Length - InflightSuffix.Length); + if (!TryClaimExclusive(path, RecoveringSuffix, out var claimedPath, out var claimedStream)) + return; + + DeliverClaimedFile(claimedStream, claimedPath, originalPath); + } + + /// + /// Claims for exclusive delivery by this caller. First renames it + /// to + (relocating it out of the name a + /// future sweep would otherwise re-discover), then opens the result with + /// as the actual single-owner guarantee. + /// + /// + /// The rename's own success/failure is NOT a reliable exclusivity signal on its own: + /// verified empirically (see design.md D14) that two threads racing an identical + /// (source, destination) pair through / + /// can BOTH return without throwing, even + /// though the OS performs exactly one physical rename - this class's own concurrent-flush + /// test caught real double delivery caused by trusting that signal alone. A + /// open, by contrast, is a real OS-enforced lock: only one + /// caller can ever hold it, which is what the returned is + /// used for. + /// + private static bool TryClaimExclusive(string path, string suffix, out string claimedPath, out FileStream claimedStream) + { + claimedPath = path + suffix; + claimedStream = null; + try + { + RobustFile.Move(path, claimedPath); + } + catch (IOException) + { + // Covers FileNotFoundException too: another process (or a previous pass) + // already claimed or removed this file. + return false; + } + catch (UnauthorizedAccessException) + { + return false; + } + + try + { + claimedStream = new FileStream(claimedPath, FileMode.Open, FileAccess.ReadWrite, FileShare.None); + return true; + } + catch (IOException) + { + // Someone else's Move "succeeded" against the same physical file (see remarks) and + // already holds the real, exclusive lock - or has already finished and deleted it. + return false; + } + catch (UnauthorizedAccessException) + { + return false; + } + } + + private static void DeliverClaimedFile(FileStream claimedStream, string claimedPath, string originalPath) + { + if (!AllowTracking) + { + claimedStream.Dispose(); + TryDelete(claimedPath); + return; + } + + try + { + string content; + // The StreamReader owns claimedStream and disposes it (releasing the exclusive + // lock) as soon as the read completes, whether or not this throws. + using (var reader = new StreamReader(claimedStream, Encoding.UTF8, true, 1024)) + content = reader.ReadToEnd(); + + var entry = JsonConvert.DeserializeObject(content); + if (entry.Kind == ExceptionEventKind) + { + // Replayed via Track, not ReportException: replaying through ReportException + // would count a backlog of queued exceptions against this run's own live + // MAX_EXCEPTION_REPORTS_PER_RUN budget (design.md D8). + DeliverViaAnalytics(QueuedExceptionReportEventName, entry.Properties); + } + else + { + DeliverViaAnalytics(entry.EventName, entry.Properties); + } + TryDelete(claimedPath); + } + catch + { + // Couldn't even attempt delivery (e.g. deserialization failure) - restore the + // original name so a later sweep retries it, rather than losing it silently. + TryMove(claimedPath, originalPath); + } + } + + private static void EvictBeyondCap(string outboxDirectory) + { + List files; + try + { + files = RobustIO.EnumerateFilesInDirectory(outboxDirectory, "*.json") + .OrderBy(p => Path.GetFileName(p), StringComparer.Ordinal) + .ToList(); + } + catch (DirectoryNotFoundException) + { + return; + } + + var cutoffUtc = DateTime.UtcNow - MaxQueuedAge; + foreach (var path in files.ToList()) + { + if (TryGetQueuedAtUtc(Path.GetFileName(path), out var queuedAtUtc) && queuedAtUtc < cutoffUtc) + { + TryDelete(path); + files.Remove(path); + } + } + + var excess = files.Count - MaxQueuedFiles; + for (var i = 0; i < excess; i++) + TryDelete(files[i]); + } + + private static bool TryGetQueuedAtUtc(string fileName, out DateTime queuedAtUtc) + { + queuedAtUtc = default(DateTime); + var ticksPart = Path.GetFileNameWithoutExtension(fileName).Split('_')[0]; + if (!long.TryParse(ticksPart, out var ticks)) + return false; + try + { + queuedAtUtc = new DateTime(ticks, DateTimeKind.Utc); + return true; + } + catch (ArgumentOutOfRangeException) + { + return false; + } + } + + private static void TryDelete(string path) + { + try + { + RobustFile.Delete(path); + } + catch (IOException) { } + catch (UnauthorizedAccessException) { } + } + + private static void TryMove(string sourcePath, string destinationPath) + { + try + { + RobustFile.Move(sourcePath, destinationPath, true); + } + catch (IOException) { } + catch (UnauthorizedAccessException) { } + } + + private class OutboxEntry + { + public string Kind { get; set; } + public string EventName { get; set; } + public Dictionary Properties { get; set; } + public DateTime QueuedAtUtc { get; set; } + } + } +} diff --git a/Src/Common/FwUtils/FwUtils.csproj b/Src/Common/FwUtils/FwUtils.csproj index e919b7d0be..bc38704e3e 100644 --- a/Src/Common/FwUtils/FwUtils.csproj +++ b/Src/Common/FwUtils/FwUtils.csproj @@ -25,6 +25,7 @@ + diff --git a/Src/Common/FwUtils/FwUtilsTests/AnalyticsOutboxTests.cs b/Src/Common/FwUtils/FwUtilsTests/AnalyticsOutboxTests.cs new file mode 100644 index 0000000000..a8820e241a --- /dev/null +++ b/Src/Common/FwUtils/FwUtilsTests/AnalyticsOutboxTests.cs @@ -0,0 +1,335 @@ +// Copyright (c) 2026 SIL International +// This software is licensed under the LGPL, version 2.1 or later +// (http://www.gnu.org/licenses/lgpl-2.1.html) + +using System; +using System.Collections.Generic; +using System.IO; +using System.Linq; +using System.Threading; +using System.Threading.Tasks; +using Newtonsoft.Json; +using NUnit.Framework; + +namespace SIL.FieldWorks.Common.FwUtils +{ + /// + /// Tests for . These exercise the outbox's own logic entirely + /// through its test seams (, + /// , + /// ) rather than the real + /// DesktopAnalytics.Analytics class, because that class is a hard, process-wide + /// singleton (its constructor throws if called more than once per process, and + /// AllowTracking can never change once set) - so a single test run could never + /// otherwise exercise both a consenting and a non-consenting scenario. + /// See openspec/changes/telemetry-migration-baseline/ for the design this verifies. + /// + // AnalyticsOutbox's test seams (AllowTrackingOverride, NetworkAvailableOverride, + // TrackDeliveryOverride, MaxQueuedFiles, MaxQueuedAge, OrphanClaimStaleness) are static, + // process-wide state. VSTest's NUnit adapter runs fixtures on multiple parallel workers + // by default (Test.runsettings: NumberOfTestWorkers=0/auto), so without this attribute + // these tests race each other's static state and fail intermittently. + [TestFixture, NonParallelizable] + public class AnalyticsOutboxTests + { + private string m_outboxDir; + private List>> m_delivered; + + [SetUp] + public void SetUp() + { + m_outboxDir = Path.Combine(Path.GetTempPath(), "AnalyticsOutboxTests_" + Guid.NewGuid().ToString("N")); + m_delivered = new List>>(); + + AnalyticsOutbox.SetOutboxDirectoryForTests(m_outboxDir); + AnalyticsOutbox.AllowTrackingOverride = () => true; + // Default to "offline" so Enqueue's own fire-and-forget immediate-delivery attempt + // never races with a test's assertions; tests that want delivery flip this to true + // and then call FlushSync themselves, which is deterministic. + AnalyticsOutbox.NetworkAvailableOverride = () => false; + AnalyticsOutbox.TrackDeliveryOverride = (name, props) => + { + lock (m_delivered) + m_delivered.Add(Tuple.Create(name, props)); + }; + } + + [TearDown] + public void TearDown() + { + AnalyticsOutbox.ResetOutboxDirectoryForTests(); + if (Directory.Exists(m_outboxDir)) + Directory.Delete(m_outboxDir, true); + } + + private class RawEntry + { + public string Kind { get; set; } + public string EventName { get; set; } + public Dictionary Properties { get; set; } + public DateTime QueuedAtUtc { get; set; } + } + + private void WriteRawOutboxFile(DateTime queuedAtUtc, string eventName, string suffix = "") + { + Directory.CreateDirectory(m_outboxDir); + var fileName = $"{queuedAtUtc.Ticks:D19}_{Guid.NewGuid():N}.json{suffix}"; + var json = JsonConvert.SerializeObject(new RawEntry + { + Kind = "Track", + EventName = eventName, + Properties = new Dictionary(), + QueuedAtUtc = queuedAtUtc + }); + File.WriteAllText(Path.Combine(m_outboxDir, fileName), json); + } + + [Test] + public void Track_ConsentOff_DoesNotPersistAnything() + { + AnalyticsOutbox.AllowTrackingOverride = () => false; + + AnalyticsOutbox.Track("SomeEvent", new Dictionary { { "k", "v" } }); + + Assert.That(Directory.Exists(m_outboxDir), Is.False, + "consent-off must not even create the outbox directory"); + } + + [Test] + public void Track_ConsentOn_NetworkUnavailable_PersistsFileUndelivered() + { + AnalyticsOutbox.Track("SomeEvent", new Dictionary { { "k", "v" } }); + Thread.Sleep(50); // let the fire-and-forget immediate-attempt (a no-op offline) settle + + Assert.That(Directory.GetFiles(m_outboxDir, "*.json"), Has.Length.EqualTo(1)); + Assert.That(m_delivered, Is.Empty); + } + + [Test] + public void QueuedEvent_DeliveredOnLaterFlush_WhenNetworkBecomesAvailable() + { + // Models: event queued while offline, then delivered on a later launch's startup sweep. + AnalyticsOutbox.Track("SomeEvent", new Dictionary { { "k", "v" } }); + Thread.Sleep(50); + Assert.That(m_delivered, Is.Empty, "must not deliver while offline"); + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + Assert.That(m_delivered, Has.Count.EqualTo(1)); + Assert.That(m_delivered[0].Item1, Is.EqualTo("SomeEvent")); + Assert.That(m_delivered[0].Item2["k"], Is.EqualTo("v")); + Assert.That(Directory.GetFiles(m_outboxDir, "*.json"), Is.Empty, + "a delivered file must be removed from the outbox"); + } + + [Test] + public void Flush_DeliversMultipleQueuedEvents_InFifoOrder() + { + for (var i = 0; i < 5; i++) + { + AnalyticsOutbox.Track("Event" + i, new Dictionary()); + Thread.Sleep(5); // ensure distinct tick-based filenames sort in enqueue order + } + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + Assert.That(m_delivered.Select(e => e.Item1), + Is.EqualTo(new[] { "Event0", "Event1", "Event2", "Event3", "Event4" })); + } + + [Test] + public void Flush_EvictsOldestFilesFirst_WhenOverCountCap() + { + AnalyticsOutbox.MaxQueuedFiles = 2; + for (var i = 0; i < 4; i++) + { + AnalyticsOutbox.Track("Event" + i, new Dictionary()); + Thread.Sleep(5); + } + + // Network stays "unavailable" (default) - this isolates eviction from delivery. + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + var remaining = Directory.GetFiles(m_outboxDir, "*.json") + .OrderBy(f => Path.GetFileName(f), StringComparer.Ordinal) + .Select(f => JsonConvert.DeserializeObject(File.ReadAllText(f)).EventName) + .ToList(); + Assert.That(remaining, Is.EqualTo(new[] { "Event2", "Event3" }), + "only the 2 newest survive a cap of 2; the oldest must be evicted first"); + Assert.That(m_delivered, Is.Empty, "eviction must not attempt delivery of the survivors"); + } + + [Test] + public void Flush_EvictsFilesOlderThanMaxAge_WithoutDelivering() + { + AnalyticsOutbox.MaxQueuedAge = TimeSpan.FromDays(1); + WriteRawOutboxFile(DateTime.UtcNow.AddDays(-5), "StaleEvent"); + AnalyticsOutbox.Track("FreshEvent", new Dictionary()); + Thread.Sleep(50); + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + Assert.That(m_delivered.Select(e => e.Item1), Is.EqualTo(new[] { "FreshEvent" }), + "the stale event must be evicted rather than delivered"); + } + + [Test] + public void Flush_RestoresOriginalFileName_WhenDeliveryThrows() + { + AnalyticsOutbox.Track("WillFail", new Dictionary()); + Thread.Sleep(50); + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.TrackDeliveryOverride = (name, props) => + throw new InvalidOperationException("simulated delivery failure"); + + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + var files = Directory.GetFiles(m_outboxDir); + Assert.That(files, Has.Length.EqualTo(1)); + Assert.That(files[0], Does.EndWith(".json"), + "a failed delivery must restore the original .json name, not leave a stray .inflight file"); + } + + [Test] + public void Flush_ConcurrentFlushCalls_DeliverEachEventExactlyOnce() + { + for (var i = 0; i < 10; i++) + AnalyticsOutbox.Track("Event" + i, new Dictionary()); + Thread.Sleep(50); + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + // A small delay widens the race window between the two concurrent flushes below, so + // the claim race is reliably exercised rather than won by luck. + AnalyticsOutbox.TrackDeliveryOverride = (name, props) => + { + Thread.Sleep(10); + lock (m_delivered) + m_delivered.Add(Tuple.Create(name, props)); + }; + + var flush1 = Task.Run(() => AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5))); + var flush2 = Task.Run(() => AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5))); + Task.WaitAll(flush1, flush2); + + Assert.That(m_delivered.Select(e => e.Item1).Distinct().Count(), Is.EqualTo(10), + "every event must eventually be delivered"); + Assert.That(m_delivered, Has.Count.EqualTo(10), + "the atomic-rename claim must prevent the same event being delivered twice by concurrent flushes"); + } + + [Test] + public void Flush_DropsQueuedEvent_WhenConsentRevokedBeforeFlush() + { + AnalyticsOutbox.Track("ShouldBeDropped", new Dictionary()); + Thread.Sleep(50); + AnalyticsOutbox.AllowTrackingOverride = () => false; + AnalyticsOutbox.NetworkAvailableOverride = () => true; + + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + Assert.That(m_delivered, Is.Empty); + Assert.That(Directory.GetFiles(m_outboxDir, "*.json"), Is.Empty, + "a consent-revoked event must be deleted, not left queued indefinitely"); + } + + [Test] + public void ReportException_ReplaysAsQueuedExceptionReport_NotTheOriginalExceptionType() + { + AnalyticsOutbox.ReportException(new InvalidOperationException("boom"), + new Dictionary { { "extra", "1" } }); + Thread.Sleep(50); + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + Assert.That(m_delivered, Has.Count.EqualTo(1)); + Assert.That(m_delivered[0].Item1, Is.EqualTo("QueuedExceptionReport"), + "replayed exceptions must go through Track, not the real ReportException path (design.md D8)"); + Assert.That(m_delivered[0].Item2["Message"], Is.EqualTo("boom")); + Assert.That(m_delivered[0].Item2["extra"], Is.EqualTo("1")); + } + + [Test] + public void FlushSync_StopsAtTimeout_LeavingRemainingEventsQueued() + { + for (var i = 0; i < 5; i++) + { + AnalyticsOutbox.Track("Event" + i, new Dictionary()); + Thread.Sleep(5); + } + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.TrackDeliveryOverride = (name, props) => + { + Thread.Sleep(100); + lock (m_delivered) + m_delivered.Add(Tuple.Create(name, props)); + }; + + AnalyticsOutbox.FlushSync(TimeSpan.FromMilliseconds(150)); + + Assert.That(m_delivered.Count, Is.LessThan(5), + "a short timeout must stop the sweep before every file is processed"); + Assert.That(Directory.GetFiles(m_outboxDir, "*.json"), Is.Not.Empty, + "events not reached before the timeout must remain queued for the next sweep"); + } + + [Test] + public void Flush_DeliversOrphanedInflightFile_LeftBehindByACrashedPriorClaim() + { + // Simulates a process that claimed a file (renamed it to .inflight) and then crashed + // or was killed before finishing delivery. Zeroing the staleness threshold means this + // (freshly-written-by-the-test) .inflight file is immediately eligible for recovery, + // isolating the orphan-recovery path from the separate staleness-gating behavior + // covered by InflightFile_NotYetStale_IsLeftAloneRatherThanRedelivered below. + AnalyticsOutbox.OrphanClaimStaleness = TimeSpan.Zero; + WriteRawOutboxFile(DateTime.UtcNow, "Orphaned", suffix: ".inflight"); + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + Assert.That(m_delivered.Select(e => e.Item1), Is.EqualTo(new[] { "Orphaned" })); + Assert.That(Directory.GetFiles(m_outboxDir), Is.Empty); + } + + [Test] + public void InflightFile_NotYetStale_IsLeftAloneRatherThanRedelivered() + { + // This is the exact scenario that caused a real, test-caught double-delivery bug + // during development: a fresh .inflight file (a legitimate, still-in-progress claim + // by another thread/process, not a crash orphan) must not be redelivered by a sweep + // that merely happens to enumerate it. + WriteRawOutboxFile(DateTime.UtcNow, "StillBeingDelivered", suffix: ".inflight"); + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + Assert.That(m_delivered, Is.Empty, + "a fresh .inflight claim must be left alone, not treated as an orphan"); + Assert.That(Directory.GetFiles(m_outboxDir, "*.inflight"), Has.Length.EqualTo(1), + "the file must remain exactly as claimed for whoever is legitimately delivering it"); + } + + [Test] + public void Track_NullProperties_DoesNotThrow_AndDeliversEmptyPropertiesDictionary() + { + Assert.DoesNotThrow(() => AnalyticsOutbox.Track("NullPropsEvent", null)); + Thread.Sleep(50); + + AnalyticsOutbox.NetworkAvailableOverride = () => true; + AnalyticsOutbox.FlushSync(TimeSpan.FromSeconds(5)); + + Assert.That(m_delivered, Has.Count.EqualTo(1)); + Assert.That(m_delivered[0].Item2, Is.Not.Null.And.Empty); + } + + [Test] + public void ReportException_NullMoreProperties_DoesNotThrow() + { + Assert.DoesNotThrow(() => AnalyticsOutbox.ReportException(new Exception("x"), null)); + } + } +} diff --git a/Src/Common/FwUtils/TrackingHelper.cs b/Src/Common/FwUtils/TrackingHelper.cs index 28aa038191..48cc31d404 100644 --- a/Src/Common/FwUtils/TrackingHelper.cs +++ b/Src/Common/FwUtils/TrackingHelper.cs @@ -1,6 +1,5 @@ using System; using System.Collections.Generic; -using DesktopAnalytics; namespace SIL.FieldWorks.Common.FwUtils { @@ -12,7 +11,7 @@ public static void TrackImport(string area, string type, ImportExportStep import eventProps["area"] = area; eventProps["type"] = type; eventProps["step"] = Enum.GetName(typeof(ImportExportStep), importExportStep); - Analytics.Track("Import", eventProps); + AnalyticsOutbox.Track("Import", eventProps); } public static void TrackExport(string area, string type, ImportExportStep importExportStep, Dictionary extraProps = null) @@ -21,7 +20,7 @@ public static void TrackExport(string area, string type, ImportExportStep import eventProps["area"] = area; eventProps["type"] = type; eventProps["step"] = Enum.GetName(typeof(ImportExportStep), importExportStep); - Analytics.Track("Export", eventProps); + AnalyticsOutbox.Track("Export", eventProps); } public static void TrackHelpRequest(string helpFile, string helpTopic, Dictionary extraProps = null) @@ -29,7 +28,7 @@ public static void TrackHelpRequest(string helpFile, string helpTopic, Dictionar var eventProps = extraProps ?? new Dictionary(); eventProps["helpFile"] = helpFile; eventProps["helpTopic"] = helpTopic; - Analytics.Track("Help", eventProps); + AnalyticsOutbox.Track("Help", eventProps); } } } \ No newline at end of file diff --git a/Src/LexText/Interlinear/ConcordanceContainer.cs b/Src/LexText/Interlinear/ConcordanceContainer.cs index cff00d923d..032852de7d 100644 --- a/Src/LexText/Interlinear/ConcordanceContainer.cs +++ b/Src/LexText/Interlinear/ConcordanceContainer.cs @@ -9,6 +9,7 @@ using System.Windows.Forms; using SIL.FieldWorks.Common.Controls; using SIL.FieldWorks.Common.Controls.FileDialog; +using SIL.FieldWorks.Common.FwUtils; using SIL.FieldWorks.Common.RootSites; using SIL.FieldWorks.XWorks; using XCore; @@ -45,7 +46,7 @@ public void OnExportConcordanceResults(object arguments) return; fileName = dlg.FileName; } - DesktopAnalytics.Analytics.Track("ExportConcordanceResults", new Dictionary()); + AnalyticsOutbox.Track("ExportConcordanceResults", new Dictionary()); using (var fs = new FileStream(fileName, FileMode.Create)) using (var textWriter = new StreamWriter(fs)) diff --git a/Src/LexText/Interlinear/ConfigureInterlinDialog.cs b/Src/LexText/Interlinear/ConfigureInterlinDialog.cs index ec9713ee63..468f32e692 100644 --- a/Src/LexText/Interlinear/ConfigureInterlinDialog.cs +++ b/Src/LexText/Interlinear/ConfigureInterlinDialog.cs @@ -15,7 +15,6 @@ using System.Linq; using System.Text; using System.Xml; -using DesktopAnalytics; using Gecko; using Gecko.DOM; using SIL.LCModel.Infrastructure; @@ -81,7 +80,7 @@ public ConfigureInterlinDialog(Mediator mediator, PropertyTable propertyTable, L var htmlPath = SaveHtmlToTemp(); mainBrowser.Url = new Uri(htmlPath); browser.DomContentChanged += Browser_DomContentChanged; - Analytics.Track("ConfigureInterlinear", new Dictionary { + AnalyticsOutbox.Track("ConfigureInterlinear", new Dictionary { { "interlinearMode", Enum.GetName(typeof(InterlinLineChoices.InterlinMode), choices.Mode) }}); diff --git a/Src/LexText/LexTextDll/AreaListener.cs b/Src/LexText/LexTextDll/AreaListener.cs index 1fd563f85f..8ed0afc74a 100644 --- a/Src/LexText/LexTextDll/AreaListener.cs +++ b/Src/LexText/LexTextDll/AreaListener.cs @@ -5,6 +5,7 @@ using System; using System.Collections.Generic; using System.Diagnostics; +using System.Globalization; using System.Linq; using SIL.FieldWorks.Common.FwUtils; using static SIL.FieldWorks.Common.FwUtils.FwUtils; @@ -127,6 +128,22 @@ protected virtual void Dispose(bool disposing) if (disposing) { + // Final dwell-time record for whatever tool was active when the app is closing, + // so it isn't silently lost because no further tool switch occurs (bypasses the + // once-per-day throttle deliberately - this is a distinct, one-time-per-session + // closing signal, not a repeated "switched to" event). + if (!string.IsNullOrEmpty(m_lastActivatedToolName) && m_lastToolActivationUtc != DateTime.MinValue) + { + var finalDwellSeconds = (DateTime.UtcNow - m_lastToolActivationUtc).TotalSeconds; + AnalyticsOutbox.Track("SwitchToTool", new Dictionary + { + {"area", m_lastActivatedAreaName ?? string.Empty}, + {"tool", m_lastActivatedToolName}, + {"duration", finalDwellSeconds.ToString("F1", CultureInfo.InvariantCulture)}, + {"final", "true"} + }); + } + Subscriber.Unsubscribe(EventConstants.SetToolFromName, SetToolFromName); Subscriber.Unsubscribe(EventConstants.ReloadAreaTools, ReloadAreaTools); Subscriber.Unsubscribe(EventConstants.GetContentControlParameters, GetContentControlParameters); @@ -164,6 +181,15 @@ public void Init(Mediator mediator, PropertyTable propertyTable, XmlNode configu private DateTime m_lastToolChange = DateTime.MinValue; HashSet m_toolsReportedToday = new HashSet(); + /// + /// Updated on every tool activation regardless of 's + /// once-per-day throttle, so a throttled-out revisit doesn't lose the timestamp needed to + /// compute dwell time for whichever activation next actually fires a SwitchToTool event. + /// + private DateTime m_lastToolActivationUtc = DateTime.MinValue; + private string m_lastActivatedToolName; + private string m_lastActivatedAreaName; + public void OnPropertyChanged(string name) { CheckDisposed(); @@ -191,15 +217,26 @@ public void OnPropertyChanged(string name) m_lastToolChange = DateTime.Now; } string areaNameForReport = m_propertyTable.GetStringProperty("areaChoice", null); + // Dwell time since the tool was last activated, regardless of whether that + // prior activation actually fired a (throttled) event of its own. + double? dwellSeconds = m_lastToolActivationUtc == DateTime.MinValue + ? (double?)null + : (DateTime.UtcNow - m_lastToolActivationUtc).TotalSeconds; if (!string.IsNullOrWhiteSpace(areaNameForReport) && !m_toolsReportedToday.Contains(toolName)) { m_toolsReportedToday.Add(toolName); - DesktopAnalytics.Analytics.Track("SwitchToTool", new Dictionary + var eventProps = new Dictionary { {"area", areaNameForReport}, {"tool", toolName} - }); + }; + if (dwellSeconds.HasValue) + eventProps["duration"] = dwellSeconds.Value.ToString("F1", CultureInfo.InvariantCulture); + AnalyticsOutbox.Track("SwitchToTool", eventProps); } + m_lastToolActivationUtc = DateTime.UtcNow; + m_lastActivatedToolName = toolName; + m_lastActivatedAreaName = areaNameForReport; break; case "areaChoice": diff --git a/openspec/changes/telemetry-migration-baseline/design.md b/openspec/changes/telemetry-migration-baseline/design.md index b5fadadbd9..e82021d5ee 100644 --- a/openspec/changes/telemetry-migration-baseline/design.md +++ b/openspec/changes/telemetry-migration-baseline/design.md @@ -38,11 +38,22 @@ This makes "durable by default" the only code path; there is no separate "send i *Alternative considered*: keep direct `Analytics.Track` calls for the common case and only route through a queue on a caught delivery failure. Rejected — `Analytics.Track` is fire-and-forget internally (an async call whose result isn't awaited by the caller), so callers have no reliable failure signal to branch on in the first place; funneling everything through the outbox sidesteps that entirely. -### D4 — Cross-process claim via atomic file rename, not a lock manager -Before sending a queued file (whether it's the one just written or one found during a startup sweep), the flush routine attempts to rename it in place to a `.inflight` suffix. `File.Move` is atomic on the same volume and fails if the source is already gone or the destination exists — so a successful rename is a cheap, sufficient claim that no other process is (or already has) handled this file. A failed rename means another process claimed it first; skip and move on. `RetryUtility.Retry` (SIL.Core) wraps only the transient-IO case (e.g. a momentary AV/indexer lock), not the claim race itself, which the rename semantics already resolve. +### D4 — Cross-process claim via file rename plus an exclusive open, not a lock manager +Before sending a queued file (whether it's the one just written or one found during a sweep), the flush routine renames it in place to a `.inflight` suffix, then opens the result with `FileShare.None`. The exclusive open — not the rename — is the real single-owner guarantee (see D14). A failed rename or a failed exclusive open both mean another caller claimed it first; skip and move on. -### D5 — Delivery success = delete the claimed file -After a successful `Analytics.Track`/`ReportException` call (the real package method, called only from inside the flush routine), the `.inflight` file is deleted. If the underlying send throws, the `.inflight` file is renamed back to its original name so a later sweep retries it. This gives at-least-once delivery, not exactly-once: a crash between a successful send and the delete would cause a duplicate on the next sweep. Accepted — see Risks. +**Correction found by this class's own unit tests during implementation**: the original version of this decision assumed a sweep encountering an already-`.inflight` file could only mean a cross-process crash orphan, safe to redeliver directly with no further claim. That's wrong — a fresh `.inflight` file can just as easily be a *different thread's own, still-legitimately-in-progress* delivery (e.g. the enqueue-time immediate-attempt racing a startup sweep, both within the same process), and delivering it a second time while the first delivery is still running is a real double-delivery bug, reproduced by `AnalyticsOutboxTests.Flush_ConcurrentFlushCalls_DeliverEachEventExactlyOnce` before this fix. The corrected rule: a sweep only treats an `.inflight` file as an orphan once it has sat untouched longer than `OrphanClaimStaleness` (5 minutes, comfortably above any realistic delivery duration) — checked via the file's last-write time. A fresh one is left alone. A stale one is claimed via a second rename+exclusive-open (to `.recovering`) before being delivered, so two sweeps racing to recover the *same* stale orphan still can't double-deliver it. `RetryUtility.Retry` (SIL.Core) wraps only the transient-IO case (e.g. a momentary AV/indexer lock) in the initial write, not either claim race, which D14's exclusive-open check resolves. + +**Known, accepted residual gap**: if the process crashes a *second* time inside the narrow window between the `.recovering` claim and that recovery attempt's own completion, that one file could be left permanently stuck as `.recovering`, requiring manual cleanup. Considered acceptable given how rare two crashes in the same delivery attempt would be, and consistent with this design's general approach of documenting narrow residual gaps (see D12) rather than building unbounded generality for vanishingly rare cases. + +### D5 — Delivery "success" = the `Analytics.Track` call didn't throw, gated by a network-availability check first (revised during implementation — see D12) +After a call to `Analytics.Track` (the real package method, called only from inside the flush routine) that doesn't throw, the `.inflight` file is deleted. If it throws, the `.inflight` file is renamed back to its original name so a later sweep retries it. This gives at-least-once delivery, not exactly-once: a crash between a successful send and the delete would cause a duplicate on the next sweep. Accepted — see Risks. **Important caveat found during implementation**: see D12 — `Analytics.Track` is internally fire-and-forget and essentially never throws synchronously regardless of whether Mixpanel was actually reachable, so "didn't throw" alone is not a real delivery confirmation. + +### D12 — Network-availability pre-check (added during implementation) +Decompiling `DesktopAnalytics.dll` confirmed `Analytics.Track`'s internal call chain (`MixpanelClient.Track` → an unawaited `TrackAsync`) never surfaces a delivery failure to the caller — it returns immediately whether or not the network is up. That means D5's "no exception = delivered" signal is not trustworthy on its own: without a further check, the outbox would claim and delete every file on every sweep regardless of actual connectivity, defeating the purpose of durability. + +To keep the design honest without rewriting or replacing the analytics client (out of scope), `AnalyticsOutbox` calls `System.Net.NetworkInformation.NetworkInterface.GetIsNetworkAvailable()` before claiming a file at all. If it returns `false`, the file is left untouched (not even claimed) for a later sweep. If it returns `true`, the outbox proceeds with claim → `Analytics.Track` → delete as before. + +This is a **known, accepted gap, not a full fix**: `GetIsNetworkAvailable()` reports whether *any* network interface is up, not whether Mixpanel's specific endpoint is reachable (a captive portal, corporate proxy, or a Mixpanel-side outage would all read as "available" here and still silently lose the event exactly as today). Closing that gap fully would require either a real synchronous/awaitable send path (a change to how `DesktopAnalytics` is used, or replacing it) or a callback/event hook the package doesn't currently expose — both larger changes than this proposal's scope. What this pre-check buys: the common, high-value case this proposal targets — a user who is fully offline (laptop closed, airplane mode, rural connectivity) — is now handled correctly, which was the motivating scenario. The narrower "online but Mixpanel-unreachable" case remains lossy, same as today, and should be called out explicitly as a follow-up if fully guaranteed delivery is later required. ### D6 — Bounded growth: cap by count and age, evict oldest first The outbox enforces a cap (default: 2,000 files or 30 days of age, whichever is hit first) during every sweep, deleting the oldest excess files without attempting to send them. Exact numbers are tunable constants, not a hard requirement of the spec. @@ -67,14 +78,24 @@ A `SessionId` (`Guid.NewGuid()`) is generated once at startup and stored for the ### D11 — SwitchToTool dwell time via a stored "last activation" timestamp `AreaListener` already has the property-changed handler that fires `SwitchToTool` on tool changes (throttled to once per tool per day). A new field records `DateTime.UtcNow` every time the active tool changes, regardless of whether the throttle allows an event to fire. When the (already-throttled) `SwitchToTool` event does fire, it includes a `duration` property computed from that stored timestamp — i.e., the dwell time since the tool was last activated, not a running total across a whole day (see Risks: this is a deliberate simplification consistent with the existing once-per-day throttle, not a bug). At application shutdown, if a tool is currently active, one additional dwell record is tracked for that final stretch (via `AnalyticsOutbox.Track`, same event shape) so it isn't silently lost. +### D13 — `AnalyticsOutboxTests` marked `[NonParallelizable]` (found during implementation) +`Test.runsettings` sets the NUnit adapter's `NumberOfTestWorkers` to `0` (auto/multi-worker) repo-wide, and VSTest's NUnit adapter runs different fixtures concurrently under that setting by default. `AnalyticsOutboxTests` exercises `AnalyticsOutbox` entirely through static test seams (`AllowTrackingOverride`, `NetworkAvailableOverride`, `TrackDeliveryOverride`, `MaxQueuedFiles`, `MaxQueuedAge`, `OrphanClaimStaleness`) because the real `DesktopAnalytics.Analytics` is a hard process-wide singleton that can't be reconfigured per test. Those seams are themselves static/process-wide, so without `[NonParallelizable]` on the fixture, two of its own test methods running on different worker threads at once corrupt each other's state — a real, observed failure, not a hypothetical one. `[NonParallelizable]` (NUnit 3.14, already the pinned version) forces this fixture to run alone relative to other parallel fixtures, at the cost of this one fixture's tests running sequentially rather than concurrently with the rest of the suite. No repo-wide `Test.runsettings` change was made, since only this fixture's tests share mutable static state. + +### D14 — `File.Move`'s "no exception" is not a reliable exclusivity signal on this platform; an exclusive `FileShare.None` open is the real claim (found during implementation) +Even after fixing D4's staleness gate (D13's fix made `AnalyticsOutboxTests` deterministic enough to isolate this), `Flush_ConcurrentFlushCalls_DeliverEachEventExactlyOnce` still failed non-deterministically (observed delivered counts of 11, 12, 15, 18, 20 against an expected 10 across repeated runs). Root-caused with a standalone repro (net48, outside AnalyticsOutbox entirely): two threads calling `System.IO.File.Move(sameSource, sameDestination)` concurrently — with the destination not existing at the start — can **both return without throwing**, even though the OS performs exactly one physical rename (confirmed by checking on-disk state after the race: exactly one of `{source exists, destination exists}` is ever true, never both, never neither — reproduced in 198/200 and 200/200 trials across two independent repro runs). Varying the destination per caller (so there's no destination-collision path at all) did not change the outcome (199/200), and in that variant exactly one of the two distinct destination files ever appears on disk. In other words: the underlying rename is genuinely atomic, but .NET Framework's `File.Move` wrapper does not reliably surface `FileNotFoundException`/`IOException` to the caller whose rename lost the race — this is a defect in the .NET Framework wrapper's error surfacing, not in NTFS's rename semantics. This invalidated D4 and D13's original assumption (inherited from the initial design pass) that "a successful rename == exclusive ownership." + +**Fix**: `TryClaimExclusive` (replacing `TryClaim`) still performs the rename first (useful regardless, since it relocates the file out of the name a future sweep's `"*.json"` listing would otherwise re-discover), but treats the rename's success as advisory only. The real claim is a subsequent `new FileStream(claimedPath, FileMode.Open, FileAccess.ReadWrite, FileShare.None)` — `FileShare.None` is an OS-enforced advisory lock that Windows does honor correctly under concurrent callers (verified with the same repro methodology: 0/300 anomalies). Only the caller that successfully opens the file exclusively proceeds to read, delivered, and delete it; every other racing caller (whether its own rename attempt "succeeded" or not) fails the exclusive open and backs off. `DeliverClaimedFile` now takes the already-open `FileStream` directly (via a `StreamReader` that owns and disposes it) instead of re-opening the path with `RobustFile.ReadAllText`, since a second, differently-shared open against a path this class itself already holds with `FileShare.None` would fail too. + +**Why this wasn't caught by design-time reasoning**: the atomic-rename-as-claim pattern is a standard, widely-used technique on POSIX (`rename(2)` is a true atomic syscall with well-documented all-or-nothing semantics) and is often assumed to carry over unchanged to Windows/.NET. It does not, at least not reliably through `System.IO.File.Move` on .NET Framework under same-process concurrent callers targeting an identical (or per-caller-unique) destination. Any future FieldWorks code reaching for "rename as a mutex" should use this file's `TryClaimExclusive` pattern (rename optionally, then verify via `FileShare.None`) rather than trusting `File.Move`'s exception behavior alone. + ## Risks / Trade-offs - **[Risk] At-least-once delivery can double-count events** if a process crashes between a successful Mixpanel send and the local file delete → **Mitigation**: accepted trade-off; strictly better than today's silent-loss behavior, and Mixpanel-side usage analysis (adoption trends, crash rates) tolerates small duplicate noise far better than missing data. - **[Risk] Six existing call sites must be migrated to the new facade in one mechanical pass** → **Mitigation**: `AnalyticsOutbox.Track`/`ReportException` match `Analytics.Track`/`ReportException`'s signatures exactly, so the migration is a pure substitution; existing tests around those call sites (where they exist) continue to apply. -- **[Risk] Cross-process sweep introduces a new race not present today** (two processes both attempt the same file) → **Mitigation**: the atomic-rename claim (D4) resolves the race to "one wins, one skips," never a crash or corruption. +- **[Risk] Cross-process sweep introduces a new race not present today** (two processes both attempt the same file) → **Mitigation**: the rename+exclusive-open claim (D4, corrected by D14) resolves the race to "one wins, one skips," never a crash or corruption. - **[Risk] `SwitchToTool` dwell time reflects only the most recent stretch in a tool, not cumulative time across a whole day** (a consequence of preserving the existing once-per-day throttle, per spec) → **Mitigation**: documented here and in the spec; if cumulative dwell time is later wanted, that's a throttle-semantics change outside this proposal's scope. -- **[Risk] `Analytics.AllowTracking` may not be a public member** → **Mitigation**: fallback to `FwApplicationSettings.Reporting.OkToPingBasicUsageData` is specified in D7; either way, consent is re-checked at flush time. - **[Risk] Outbox directory could grow unexpectedly large if Mixpanel is unreachable for a very long time** → **Mitigation**: D6's count/age cap bounds worst-case disk usage to a small, fixed footprint. +- **[Risk, found during implementation — see D12] "Online but Mixpanel-unreachable" still silently loses events** (captive portal, corporate proxy, Mixpanel-side outage) → **Mitigation**: accepted, documented gap. `NetworkInterface.GetIsNetworkAvailable()` only proves *a* network path exists, not that Mixpanel specifically is reachable; closing this fully needs a synchronous/awaitable send path this proposal does not build. The common "genuinely offline" case (this proposal's actual target) is handled correctly. ## Migration Plan @@ -82,7 +103,8 @@ No user-facing or data-model migration: the outbox directory is new local state ## Open Questions -- Whether `Analytics.AllowTracking` (found via decompilation of `DesktopAnalytics.dll` v4.0.0) is accessible outside the package, or whether `AnalyticsOutbox` must rely solely on `FwApplicationSettings.Reporting.OkToPingBasicUsageData` — verify during implementation (task-level, not blocking). -- Exact `RobustFile`/`RobustIO` (SIL.Core.Desktop) method names available for the write/delete/rename operations in D1/D4/D5 — verify against the currently installed `SIL.Core.Desktop` version; fall back to plain `System.IO` calls wrapped in `RetryUtility.Retry` if a needed method isn't present. -- Whether `SessionId` should additionally ride along as an `Analytics.SetApplicationProperty("SessionId", ...)` so every other event (not just session-start/end) is automatically correlated to a session — not required by the spec as written; left as an implementation-time judgment call since it doesn't change any requirement's observable behavior. +- ~~Whether `Analytics.AllowTracking` is accessible outside the package~~ — **Resolved**: confirmed via decompilation of the installed `DesktopAnalytics.dll` v4.0.0 (`packages/sil.desktopanalytics/4.0.0/lib/net461/`) to be `public static bool AllowTracking { get; private set; }`. `AnalyticsOutbox` uses it directly; the `FwApplicationSettings` fallback described in D7 is not needed. +- ~~Exact `RobustFile`/`RobustIO` method names~~ — **Resolved**: they live in `SIL.Core` (namespace `SIL.IO`), not `SIL.Core.Desktop` as originally assumed — confirmed via decompilation of the actual pinned version (`18.0.0-beta0027`). Exact members used: `RobustFile.WriteAllText`, `RobustFile.ReadAllText`, `RobustFile.Move(string, string)`/`(string, string, bool)`, `RobustFile.Delete`, `RobustFile.Exists`; `RobustIO.EnumerateFilesInDirectory`; `SIL.Code.RetryUtility.Retry`. **Correction, caught by a failing unit test**: `RobustIO.RequireThatDirectoryExists` does not create the directory - it *asserts* the directory already exists and throws `ArgumentException` if not (confirmed by decompiling its body). Directory creation uses plain `System.IO.Directory.CreateDirectory` (idempotent) instead. +- **New, found during implementation**: `Analytics.Track`'s fire-and-forget internals mean it essentially never throws synchronously regardless of real delivery outcome — see D12 for the network-availability pre-check added to compensate, and its documented residual gap. +- Whether `SessionId` should additionally ride along as an `Analytics.SetApplicationProperty("SessionId", ...)` so every other event (not just session-start/end) is automatically correlated to a session — not required by the spec as written; left as an implementation-time judgment call since it doesn't change any requirement's observable behavior. **Decision for this implementation pass: not added**, to keep the enrichment scoped exactly to what the spec requires. - Final tunable values for D6's cap (2,000 files / 30 days are illustrative defaults) — open to adjustment without a spec change. diff --git a/openspec/changes/telemetry-migration-baseline/tasks.md b/openspec/changes/telemetry-migration-baseline/tasks.md index d566a674ac..516160cbab 100644 --- a/openspec/changes/telemetry-migration-baseline/tasks.md +++ b/openspec/changes/telemetry-migration-baseline/tasks.md @@ -1,45 +1,47 @@ ## 1. Outbox core (managed, `Src/Common/FwUtils/`) -- [ ] 1.1 Verify `Analytics.AllowTracking` accessibility and exact `RobustFile`/`RobustIO`/`RetryUtility` method signatures against the installed `SIL.Core.Desktop`/`SIL.DesktopAnalytics` package versions (resolves design.md's two Open Questions); note the resolution in a code comment at the call site. -- [ ] 1.2 Add `Src/Common/FwUtils/AnalyticsOutbox.cs`: static class with `Track(string eventName, Dictionary properties)` and `ReportException(Exception exception, Dictionary moreProperties)`, matching `Analytics`'s method shapes (design.md D3). -- [ ] 1.3 Implement consent check (design.md D7) shared by both methods, with the `AllowTracking`-or-fallback-to-`OkToPingBasicUsageData` logic resolved in 1.1. -- [ ] 1.4 Implement outbox file write: serialize event kind, name, properties, and queued-at timestamp to `{UtcTicks:D19}_{Guid:N}.json` under `FwDirectoryFinder.UserAppDataFolder("FieldWorks")\Analytics\Outbox\`, creating the directory if absent (design.md D1, D2). -- [ ] 1.5 Implement the atomic-rename claim (`.inflight` suffix) and post-send delete/rollback logic (design.md D4, D5). -- [ ] 1.6 Implement the bounded-growth sweep: delete oldest files beyond the count/age cap before attempting delivery (design.md D6). -- [ ] 1.7 Implement queued-exception replay via `Analytics.Track("QueuedExceptionReport", ...)` instead of `Analytics.ReportException` (design.md D8). -- [ ] 1.8 Wire the three flush triggers: fire-and-forget flush of the just-written file, full startup sweep, and a bounded (≤2s) shutdown sweep (design.md D9). -- [ ] 1.9 Unit tests (new `Src/Common/FwUtils/FwUtilsTests/AnalyticsOutboxTests.cs`, run via `.\test.ps1`): consent-off write is a no-op; a written file survives a simulated restart (re-instantiate the outbox against the same directory); FIFO ordering of multiple queued files; cap eviction deletes oldest first; a claimed-but-failed send restores the original filename; a claim race (two attempts on the same file) results in exactly one winner. +- [x] 1.1 Verify `Analytics.AllowTracking` accessibility and exact `RobustFile`/`RobustIO`/`RetryUtility` method signatures against the installed `SIL.Core.Desktop`/`SIL.DesktopAnalytics` package versions (resolves design.md's two Open Questions); note the resolution in a code comment at the call site. +- [x] 1.2 Add `Src/Common/FwUtils/AnalyticsOutbox.cs`: static class with `Track(string eventName, Dictionary properties)` and `ReportException(Exception exception, Dictionary moreProperties)`, matching `Analytics`'s method shapes (design.md D3). +- [x] 1.3 Implement consent check (design.md D7) shared by both methods, with the `AllowTracking`-or-fallback-to-`OkToPingBasicUsageData` logic resolved in 1.1. +- [x] 1.4 Implement outbox file write: serialize event kind, name, properties, and queued-at timestamp to `{UtcTicks:D19}_{Guid:N}.json` under `FwDirectoryFinder.UserAppDataFolder("FieldWorks")\Analytics\Outbox\`, creating the directory if absent (design.md D1, D2). +- [x] 1.5 Implement the claim (`.inflight` rename + `FileShare.None` exclusive open, per D4/D14 — see below) and post-send delete/rollback logic (design.md D4, D5). +- [x] 1.6 Implement the bounded-growth sweep: delete oldest files beyond the count/age cap before attempting delivery (design.md D6). +- [x] 1.7 Implement queued-exception replay via `Analytics.Track("QueuedExceptionReport", ...)` instead of `Analytics.ReportException` (design.md D8). +- [x] 1.8 Wire the three flush triggers: fire-and-forget flush of the just-written file, full startup sweep, and a bounded (≤2s) shutdown sweep (design.md D9). +- [x] 1.9 Unit tests (`Src/Common/FwUtils/FwUtilsTests/AnalyticsOutboxTests.cs`, 15 tests, run via `.\test.ps1`): consent-off write is a no-op; a queued file is delivered on a later flush once network becomes available (the "survives a simulated restart" scenario); FIFO ordering; cap eviction (count and age) deletes without delivering; a claimed-but-failed send restores the original filename; concurrent flushes deliver each event exactly once. + - **Found and fixed during implementation, not anticipated in the original design**: `System.IO.File.Move` does not reliably report failure to the "losing" thread when two callers race an identical rename — both can return without throwing even though the OS performs exactly one physical rename (verified with a standalone repro outside this codebase: 198–200/200 trials). The original rename-only claim design (D4) was corrected to add a `FileShare.None` exclusive open as the real single-owner check (D14). Root-caused via `AnalyticsOutboxTests.Flush_ConcurrentFlushCalls_DeliverEachEventExactlyOnce`, which failed intermittently (11–20 deliveries instead of 10) until this fix; now passes deterministically across 5+ repeated runs. + - Marked the fixture `[NonParallelizable]` (D13): its test seams are static/process-wide (the real `DesktopAnalytics.Analytics` singleton can't be reconfigured per test), and VSTest's NUnit adapter runs fixtures concurrently by default per this repo's `Test.runsettings`. ## 2. Migrate existing call sites to `AnalyticsOutbox` -- [ ] 2.1 `Src/Common/FieldWorks/FieldWorks.cs`: replace the three `Analytics.ReportException` calls with `AnalyticsOutbox.ReportException`. -- [ ] 2.2 `Src/LexText/LexTextDll/AreaListener.cs`: replace the `Analytics.Track("SwitchToTool", ...)` call with `AnalyticsOutbox.Track` (coordinate with section 4's dwell-time change to avoid two separate edits to the same call). -- [ ] 2.3 `Src/Common/FwUtils/TrackingHelper.cs`: replace all three `Analytics.Track` calls (`TrackImport`/`TrackExport`/`TrackHelpRequest`) with `AnalyticsOutbox.Track`. -- [ ] 2.4 `Src/LexText/Interlinear/ConcordanceContainer.cs` and `Src/LexText/Interlinear/ConfigureInterlinDialog.cs`: replace their `Analytics.Track` calls with `AnalyticsOutbox.Track`. -- [ ] 2.5 `Src/Common/Controls/FwControls/ObtainProjectMethod.cs`: replace its `Analytics.Track` call with `AnalyticsOutbox.Track`. -- [ ] 2.6 Repo-wide grep for `Analytics.Track(` and `Analytics.ReportException(` outside `AnalyticsOutbox.cs` itself to confirm no call site was missed. +- [x] 2.1 `Src/Common/FieldWorks/FieldWorks.cs`: replace the three `Analytics.ReportException` calls with `AnalyticsOutbox.ReportException`. +- [x] 2.2 `Src/LexText/LexTextDll/AreaListener.cs`: replace the `Analytics.Track("SwitchToTool", ...)` call with `AnalyticsOutbox.Track` (coordinate with section 4's dwell-time change to avoid two separate edits to the same call). +- [x] 2.3 `Src/Common/FwUtils/TrackingHelper.cs`: replace all three `Analytics.Track` calls (`TrackImport`/`TrackExport`/`TrackHelpRequest`) with `AnalyticsOutbox.Track`. +- [x] 2.4 `Src/LexText/Interlinear/ConcordanceContainer.cs` and `Src/LexText/Interlinear/ConfigureInterlinDialog.cs`: replace their `Analytics.Track` calls with `AnalyticsOutbox.Track`. +- [x] 2.5 `Src/Common/Controls/FwControls/ObtainProjectMethod.cs`: replace its `Analytics.Track` call with `AnalyticsOutbox.Track`. +- [x] 2.6 Repo-wide grep for `Analytics.Track(` and `Analytics.ReportException(` outside `AnalyticsOutbox.cs` itself to confirm no call site was missed. (Zero matches outside `AnalyticsOutbox.cs`.) ## 3. Session baseline (`Src/Common/FieldWorks/FieldWorks.cs`) -- [ ] 3.1 Generate a `SessionId` (`Guid.NewGuid()`) once at startup and store it for the process lifetime. -- [ ] 3.2 Track a session-start event via `AnalyticsOutbox.Track` immediately after the `Analytics` instance is constructed, carrying `SessionId`. -- [ ] 3.3 Add the `Interlocked.CompareExchange`-guarded single-terminal-event flag (design.md D10). -- [ ] 3.4 Track a `clean`-outcome session-end event at the normal fall-through point of the `using (new Analytics(...))` block, including duration since session-start. -- [ ] 3.5 Track a `crashed`-outcome session-end event from `HandleUnhandledException`/`HandleTopLevelError`, including duration since session-start. -- [ ] 3.6 Unit/integration tests: clean shutdown produces exactly one `clean` session-end; a simulated unhandled exception produces exactly one `crashed` session-end and suppresses the clean-path event; duration reflects elapsed time in both cases. +- [x] 3.1 Generate a `SessionId` (`Guid.NewGuid()`) once at startup and store it for the process lifetime. +- [x] 3.2 Track a session-start event via `AnalyticsOutbox.Track` immediately after the `Analytics` instance is constructed, carrying `SessionId`. +- [x] 3.3 Add the `Interlocked.CompareExchange`-guarded single-terminal-event flag (design.md D10). +- [x] 3.4 Track a `clean`-outcome session-end event at the normal fall-through point of the `using (new Analytics(...))` block, including duration since session-start. +- [x] 3.5 Track a `crashed`-outcome session-end event from `HandleUnhandledException`/`HandleTopLevelError`, including duration since session-start. +- [ ] 3.6 **GAP — not implemented.** Unit/integration tests for session-start/end tracking. `FieldWorks.cs`'s session logic (`s_sessionId`, `TryReportSessionEnd`, etc.) is exercised only by `AnalyticsOutboxTests`' generic delivery paths, not by a test that drives `FieldWorks`'s own startup/shutdown/crash code paths directly. `FieldWorksTests` has no `InternalsVisibleTo` access to `AnalyticsOutbox`'s test seams, and `FieldWorks.Main`'s startup/shutdown sequence is not currently structured for isolated unit testing (it constructs a real `Analytics` singleton, which can only happen once per process). Recommended follow-up: either extract the session-lifecycle logic into a small, independently-testable class, or cover this via manual verification (see 5.3/5.4) until then. ## 4. Usage enrichment (`UiFramework` property + `SwitchToTool` dwell time) -- [ ] 4.1 Call `Analytics.SetApplicationProperty("UiFramework", "WinForms")` once at startup in `FieldWorks.cs`, right after `Analytics` construction. -- [ ] 4.2 Test that any event tracked after startup carries `UiFramework=WinForms` (verifies the property attaches automatically without touching individual `Track` call sites). -- [ ] 4.3 In `AreaListener.cs`, add a field tracking the last tool-activation timestamp, updated on every tool change regardless of the once-per-day throttle. -- [ ] 4.4 When the throttled `SwitchToTool` event does fire, include a `duration` property computed from the stored timestamp (design.md D11), without changing the existing `area`/`tool` properties. -- [ ] 4.5 Add a final dwell-time record at application shutdown for whatever tool is currently active, using the same `AnalyticsOutbox.Track` call shape. -- [ ] 4.6 Unit tests (extend `Src/LexText/LexTextDll` test project or nearest existing `AreaListener` test fixture): repeated same-day tool revisits still throttle to one event but carry a duration; a switch after a measurable delay reports a duration reflecting that delay; shutdown while a tool is active produces one final dwell record. +- [x] 4.1 Call `Analytics.SetApplicationProperty("UiFramework", "WinForms")` once at startup in `FieldWorks.cs`, right after `Analytics` construction. +- [ ] 4.2 **GAP — not implemented.** No automated test verifies that events tracked after startup carry `UiFramework=WinForms`, because `SetApplicationProperty`'s effect is entirely internal to the `DesktopAnalytics` package (it tags outgoing Mixpanel payloads, which this codebase has no visibility into after the fact) and requires a live `Analytics` singleton to exercise at all. Verifiable only by manual/Mixpanel-side inspection (see 5.3/5.4). +- [x] 4.3 In `AreaListener.cs`, add a field tracking the last tool-activation timestamp, updated on every tool change regardless of the once-per-day throttle. +- [x] 4.4 When the throttled `SwitchToTool` event does fire, include a `duration` property computed from the stored timestamp (design.md D11), without changing the existing `area`/`tool` properties. +- [x] 4.5 Add a final dwell-time record at application shutdown for whatever tool is currently active, using the same `AnalyticsOutbox.Track` call shape. +- [ ] 4.6 **GAP — not implemented.** No unit test covers the dwell-time behavior itself (repeated same-day revisits still throttle but carry a duration; a delayed switch reports that delay; shutdown while a tool is active produces one final dwell record). `LexTextDllTests` (where `AreaListenerTests.cs` lives) has no `InternalsVisibleTo` access to `AnalyticsOutbox`'s test seams, so a test there cannot intercept what `AreaListener` actually sends without either (a) adding `InternalsVisibleTo("LexTextDllTests")` to `FwUtils.csproj` and threading the seam through, or (b) extracting the dwell-time computation into a small pure function in `AreaListener.cs` that can be unit-tested without going through `AnalyticsOutbox` at all. Recommended as a fast follow-up — the logic itself (`m_lastToolActivationUtc` diff) is simple and low-risk, but currently unverified by an automated test. ## 5. Full verification -- [ ] 5.1 Run `.\build.ps1` for the full managed build (no native changes expected; confirm no native rebuild is triggered). -- [ ] 5.2 Run `.\test.ps1` for the affected test projects (`FwUtilsTests`, `LexTextDll` tests, and any interlinear/common-controls tests touched by call-site migration) and confirm zero regressions. -- [ ] 5.3 Manually exercise the outbox end-to-end: disconnect network, use FieldWorks to trigger a few tracked events, confirm files accumulate under `%LocalAppData%\SIL\FieldWorks\Analytics\Outbox\`, reconnect, relaunch, and confirm the directory empties (via logging or a debugger breakpoint, since there's no UI surface for this). -- [ ] 5.4 Confirm the existing Tools > Options > Privacy checkbox (`OkToPingBasicUsageData`) still gates all telemetry, including newly-queued-but-not-yet-flushed events, per design.md D7. +- [x] 5.1 Run `.\build.ps1` for the full managed build (no native changes expected; confirmed no native rebuild was triggered — only managed projects relinked). +- [x] 5.2 Run `.\test.ps1` for the affected test projects and confirm zero regressions: `FwUtilsTests` (390/390 passed, includes the 15 new `AnalyticsOutboxTests`), `LexTextDllTests` (3/3), `FieldWorksTests` (34/34, 1 pre-existing skip), `ITextDllTests` (207/207, 1 pre-existing skip). +- [ ] 5.3 **MANUAL — not performed.** Exercise the outbox end-to-end in a real FieldWorks session: disconnect network, use FieldWorks to trigger a few tracked events, confirm files accumulate under `%LocalAppData%\SIL\FieldWorks\Analytics\Outbox\`, reconnect, relaunch, and confirm the directory empties. Needs a real FieldWorks launch with `OkToPingBasicUsageData` enabled; not covered by the automated suite. +- [ ] 5.4 **MANUAL — not performed.** Confirm the existing Tools > Options > Privacy checkbox (`OkToPingBasicUsageData`) still gates all telemetry, including newly-queued-but-not-yet-flushed events, per design.md D7. Needs a real FieldWorks launch to toggle the setting and observe behavior.