From e080255742b368bed16412a8e91ba271a6bcd800 Mon Sep 17 00:00:00 2001 From: Aaron Boodman Date: Wed, 13 May 2026 06:30:58 -1000 Subject: [PATCH 1/4] wip --- contents/docs/release-notes/1.6.mdx | 47 +++++++++++++++++++++++++++ contents/docs/release-notes/index.mdx | 1 + 2 files changed, 48 insertions(+) create mode 100644 contents/docs/release-notes/1.6.mdx diff --git a/contents/docs/release-notes/1.6.mdx b/contents/docs/release-notes/1.6.mdx new file mode 100644 index 00000000..a2a1bc65 --- /dev/null +++ b/contents/docs/release-notes/1.6.mdx @@ -0,0 +1,47 @@ +--- +title: Zero 1.6 +description: Postgres 17 Replication Failover and Performance +--- + +## Installation + +```bash +npm install @rocicorp/zero@1.6 +``` + +## Features + +- [**Postgres 17 Logical Replication Failover:**](https://github.com/rocicorp/mono/pull/5934) `zero-cache` now creates replication slots with an ordinal naming scheme (e.g. `zero_0_a`, `zero_0_b`) so they can be registered with [`synchronized_standby_slots`](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-SYNCHRONIZED-STANDBY-SLOTS) for [logical replication failover](https://www.postgresql.org/docs/current/logical-replication-failover.html). _TODO: add docs link._ +- [**Litestream Region:**](https://github.com/rocicorp/mono/pull/5933) Added [`ZERO_LITESTREAM_REGION`](/docs/zero-cache-config#litestream-region) for deployments in non-standard AWS partitions like GovCloud (thanks [@ericykim](https://github.com/ericykim)!). +- [**Optional `args` in custom queries/mutators:**](https://github.com/rocicorp/mono/pull/5945) Custom query and mutator execution functions now treat `args` as optional when the args type already allows `undefined` (thanks [@0xcadams](https://github.com/0xcadams)!). + +## Performance + +- [Faster `EXISTS` subqueries via the new `Cap` operator, which lets SQLite skip `ORDER BY` for non-flipped `EXISTS` children](https://github.com/rocicorp/mono/pull/5943) +- [Bulk-insertion optimization in Replicache via `putMany`, speeding up large sync patches (3-5x faster for typical sync batches, up to 53x for construction)](https://github.com/rocicorp/mono/pull/5380) +- [Batch deletes and upserts in `SQLiteStore` writes (~7-9x faster on 1000-put commits)](https://github.com/rocicorp/mono/pull/5915) +- [Parallelize I/O during pull and rebase](https://github.com/rocicorp/mono/pull/5926) +- [Heap-based k-way merge in `fetchMergeSort` (O(log K) per row vs O(K))](https://github.com/rocicorp/mono/pull/5921), [with a new `mergeSortedStreams` utility](https://github.com/rocicorp/mono/pull/5917) +- [Initial sync progress reporting uses `pg_class` estimates instead of full table scans](https://github.com/rocicorp/mono/pull/5932) +- [De-dupe SQLite requests in flip-join when children want the same parent](https://github.com/rocicorp/mono/pull/5918) + +## Fixes + +- [Returning to an app after stale-tab GC or CVR purge caused a full page reload; now the Zero instance rotates in place](https://github.com/rocicorp/mono/pull/5903) +- [`"Row already exists"` errors after an IVM advance failure could mask the original error and continue through corrupt branch state](https://github.com/rocicorp/mono/pull/5910), [also fixed for `IVMBranch.fork()`](https://github.com/rocicorp/mono/pull/5916) +- [`"Row already exists"` assertion failures during poke processing caused by `putMany` rebalancing duplicating entries across adjacent BTree children](https://github.com/rocicorp/mono/pull/5923) +- [Initial sync could fail or take hours on large databases because progress reporting did full `COUNT(*)` and `SUM(pg_column_size(...))` scans](https://github.com/rocicorp/mono/pull/5932) +- [Deadlock between post-initial-sync `changeLog` reset and a live replication-manager during non-disruptive resync](https://github.com/rocicorp/mono/pull/5953) +- [Zombie `ViewSyncer`s could accumulate in the `active-client-groups` metric when clients disconnected before `initConnection` resolved](https://github.com/rocicorp/mono/pull/5907) +- [`ConcurrentModificationException` is now classified as a Rehome so the client reconnects instead of erroring](https://github.com/rocicorp/mono/pull/5930) +- [`zero-cache` startup errors during change-streamer init were not published to subscribers](https://github.com/rocicorp/mono/pull/5956) +- [`TypeError: Expected string at context.query. Got null` when handling DDL events with `NULL current_query()`](https://github.com/rocicorp/mono/pull/5944) +- [Repeated initial-sync failures could exhaust the replication-slot name pool; cleanup now runs preemptively under the management lock](https://github.com/rocicorp/mono/pull/5947), [and inactive slots are deleted together with their `replicas` row so a stuck slot doesn't keep claiming a name](https://github.com/rocicorp/mono/pull/5948) +- [Replication slot creation timeouts crashed the server during backfill retries; backfill timeouts now only error after the maximum retry backoff is reached](https://github.com/rocicorp/mono/pull/5901) +- [Shadow sync threw when a synced table could not be queried by ZQL; it now silently ignores the table to match prod behavior](https://github.com/rocicorp/mono/pull/5950) +- [WebSocket errors are now logged as warnings instead of errors, since they reflect client or upstream issues rather than server faults](https://github.com/rocicorp/mono/pull/5842) +- [Inspector now caches AST and metrics for deleted queries so they remain accessible after eviction](https://github.com/rocicorp/mono/pull/5924) + +## Breaking Changes + +- [**Inspector per-query hydration metrics format changed:**](https://github.com/rocicorp/mono/pull/5924) Per-query hydration metrics (`query-hydration-server-ms`) are now reported as a plain number (most-recent hydration time in ms) instead of a TDigest histogram, and the per-query metrics type was renamed from `ServerMetrics` to `QueryServerMetrics`. If you have custom tooling reading inspector metrics, you'll need to update it. The protocol version was bumped from 50 to 51 to reflect this; `MIN_SERVER_SUPPORTED_SYNC_PROTOCOL` remains at 30, so 1.6 servers remain compatible with older clients. diff --git a/contents/docs/release-notes/index.mdx b/contents/docs/release-notes/index.mdx index a60cf62e..dde5594a 100644 --- a/contents/docs/release-notes/index.mdx +++ b/contents/docs/release-notes/index.mdx @@ -2,6 +2,7 @@ title: Release Notes --- +- [Zero 1.6: Postgres 17 Replication Failover and Performance](/docs/release-notes/1.6) - [Zero 1.5: Schema Change Improvements and Client Group Auth](/docs/release-notes/1.5) - [Zero 1.4: Performance and Reliability Improvements](/docs/release-notes/1.4) - [Zero 1.3: Faster Initial Sync and Other Perf Improvements](/docs/release-notes/1.3) From 864de0760c5bb28fcbd7d14ce7a9272652ce4e95 Mon Sep 17 00:00:00 2001 From: Aaron Boodman Date: Tue, 2 Jun 2026 15:26:07 -1000 Subject: [PATCH 2/4] spruce --- contents/docs/connecting-to-postgres.mdx | 39 ++++++++++++++-- contents/docs/otel.mdx | 2 + contents/docs/release-notes/1.6.mdx | 58 ++++++++++++++---------- contents/docs/zero-cache-config.mdx | 15 ++++++ 4 files changed, 88 insertions(+), 26 deletions(-) diff --git a/contents/docs/connecting-to-postgres.mdx b/contents/docs/connecting-to-postgres.mdx index ac447948..17b7f053 100644 --- a/contents/docs/connecting-to-postgres.mdx +++ b/contents/docs/connecting-to-postgres.mdx @@ -71,13 +71,40 @@ This configuration can cause problems like `slot has been invalidated because it ### PlanetScale for Postgres -You should use the `default` role that PlanetScale provides, because PlanetScale user-defined roles cannot create replication slots. +#### Roles -Planetscale Postgres defaults `max_connections` to 25, which can easily be exhausted by Zero's connection pools. This will result in an error like `remaining connection slots are reserved for roles with the SUPERUSER attribute`. -You should increase this value in the Parameters section of the PlanetScale dashboard to 100 or more. +`zero-cache` should connect using the `default` role that PlanetScale provides, because PlanetScale user-defined roles cannot create replication slots. + +#### Connection Limits + +Change `max_connections` to at least 100. The default is 25, which is too low for Zero in most configurations. + +#### Connections Make sure to only use a direct connection for the `ZERO_UPSTREAM_DB`, and use pooled URLs for `ZERO_CVR_DB`, `ZERO_CHANGE_DB`, and your API (see [Deployment](/docs/self-host)). +#### High Availability and Failover + +PlanetScale Postgres can fail over to a standby (during maintenance, switchover, or an outage). By default a logical replication slot does **not** survive promotion of a standby, so after a failover zero-cache would find its slot missing and re-sync every replica from scratch. + +To avoid this, register Zero's replication slots with PlanetScale's failover-slot preservation, which is built on [Postgres 17 failover slots](https://www.postgresql.org/docs/current/logical-replication-failover.html). PlanetScale keeps a synced copy of each registered slot on the standby, so after a failover the slot is already present on the new primary and zero-cache reconnects without re-syncing. + +First, run `zero-cache` with [`ZERO_UPSTREAM_PG_REPLICATION_SLOT_FAILOVER=true`](/docs/zero-cache-config#pg-replication-slot-failover) (Postgres 17+) so it creates replication slots with the Postgres `failover` flag set. Slots are only flagged when they are created, so if you are upgrading an existing deployment, see [PlanetScale Replication Failover](/docs/release-notes/1.6#planetscale-replication-failover) in the 1.6 release notes for how to roll this out. + +Zero names its slots `{ZERO_APP_ID}_{shard}_a` through `_z` — for the default app ID and shard, that's `zero_0_a`, `zero_0_b`, … `zero_0_z`. In **Cluster configuration → Parameters** in the PlanetScale dashboard: + +1. Under the **Failover** section, add Zero's slot names as a comma-delimited list. Registering the full `zero_0_a` … `zero_0_z` range covers slot rotation. +2. Set `sync_replication_slots = on` and `hot_standby_feedback = on`. +3. Apply the queued configuration changes. + +After zero-cache has connected, confirm the slots are marked for failover: + +```sql +SELECT slot_name, failover, synced FROM pg_replication_slots; +``` + +`failover` should be `true` for Zero's active slot. A slot only becomes failover-eligible after its consumer has advanced it at least once while the standby is syncing, so a brand-new or idle slot can still be lost if a failover races it. + ### Neon #### Logical Replication @@ -151,6 +178,10 @@ difficult. [Hetzner](https://www.hetzner.com/) offers cheap hosted VPS that supp IPv4 addresses are only supported on the Pro plan and are an extra $4/month. +#### High Availability + +Zero does not support Supabase's high-availability automatic failover. Supabase does not currently expose the replication-slot failover configuration Zero needs, so a promotion would orphan Zero's replication slot and force a full resync. If you need this, [reach out on Discord](https://discord.rocicorp.dev/). + ### Render Render _can_ work with Zero, but requires admin/support-side setup, and does not support a few core Zero features. @@ -161,6 +192,8 @@ You also must ensure `wal_level=logical` by creating a Render support ticket. Render does not provide superuser access, but you can submit another support ticket to ask Render to create a publication with `FOR ALL TABLES` for you, and then set that publication in [App Publications](/docs/zero-cache-config#app-publications). +Zero does not support Render's high availability (HA). Render's standby replicates asynchronously, so a failover can drop the most recent writes — which is incompatible with a sync engine like Zero that must never miss a change. Do not enable HA for a database used as a Zero upstream. + ### Google Cloud SQL Zero works with Google Cloud SQL out of the box. In many configurations, when you connect with a user that has sufficient privileges, `zero-cache` will create its default publication automatically. diff --git a/contents/docs/otel.mdx b/contents/docs/otel.mdx index 6acea036..f40a86ef 100644 --- a/contents/docs/otel.mdx +++ b/contents/docs/otel.mdx @@ -147,6 +147,8 @@ This callback is called before sending WebSocket messages that trigger API serve | `total_lag` | Gauge | ms | End-to-end replication latency. Grows as an estimate if the next report hasn't arrived | | `events` | Counter | | Number of replication events processed | | `transactions` | Counter | | Count of replicated transactions | +| `shadow-sync-runs` | Counter | | Number of [shadow initial-sync](/docs/zero-cache-config#shadow-sync-enabled) runs. Has a `result` attribute: `success`, `error` | +| `shadow-sync-duration` | Histogram | s | Wall-clock duration of a shadow initial-sync run. Has a `result` attribute: `success`, `error` | ### zero.sync diff --git a/contents/docs/release-notes/1.6.mdx b/contents/docs/release-notes/1.6.mdx index a2a1bc65..f2f2acc1 100644 --- a/contents/docs/release-notes/1.6.mdx +++ b/contents/docs/release-notes/1.6.mdx @@ -9,39 +9,51 @@ description: Postgres 17 Replication Failover and Performance npm install @rocicorp/zero@1.6 ``` +## Upgrading + +### PlanetScale Replication Failover + +Previous Zero versions lost replication slots after a PlanetScale Postgres failover, forcing resync. Zero 1.6 fixes this problem. To enable support, run `zero-cache` with [`ZERO_UPSTREAM_PG_REPLICATION_SLOT_FAILOVER=true`](/docs/zero-cache-config#pg-replication-slot-failover). You must then do a resync, and register the full range of slots in PlanetScale's failover configuration. See [High Availability and Failover](/docs/connecting-to-postgres#high-availability-and-failover) for the full setup. + ## Features -- [**Postgres 17 Logical Replication Failover:**](https://github.com/rocicorp/mono/pull/5934) `zero-cache` now creates replication slots with an ordinal naming scheme (e.g. `zero_0_a`, `zero_0_b`) so they can be registered with [`synchronized_standby_slots`](https://www.postgresql.org/docs/current/runtime-config-replication.html#GUC-SYNCHRONIZED-STANDBY-SLOTS) for [logical replication failover](https://www.postgresql.org/docs/current/logical-replication-failover.html). _TODO: add docs link._ -- [**Litestream Region:**](https://github.com/rocicorp/mono/pull/5933) Added [`ZERO_LITESTREAM_REGION`](/docs/zero-cache-config#litestream-region) for deployments in non-standard AWS partitions like GovCloud (thanks [@ericykim](https://github.com/ericykim)!). -- [**Optional `args` in custom queries/mutators:**](https://github.com/rocicorp/mono/pull/5945) Custom query and mutator execution functions now treat `args` as optional when the args type already allows `undefined` (thanks [@0xcadams](https://github.com/0xcadams)!). +- [**Litestream Region:**](/docs/zero-cache-config#litestream-region) Added support for deployments in non-standard AWS partitions like GovCloud (thanks [@ericykim](https://github.com/ericykim)!). ## Performance -- [Faster `EXISTS` subqueries via the new `Cap` operator, which lets SQLite skip `ORDER BY` for non-flipped `EXISTS` children](https://github.com/rocicorp/mono/pull/5943) -- [Bulk-insertion optimization in Replicache via `putMany`, speeding up large sync patches (3-5x faster for typical sync batches, up to 53x for construction)](https://github.com/rocicorp/mono/pull/5380) -- [Batch deletes and upserts in `SQLiteStore` writes (~7-9x faster on 1000-put commits)](https://github.com/rocicorp/mono/pull/5915) +- [~5% CPU reduction in replication-manager benchmarks by reusing stringified payloads across subscribers](https://github.com/rocicorp/mono/pull/5900) +- [Faster `EXISTS` subqueries via the new `Cap` operator — can turn exists-heavy queries that previously timed out into ones that complete in milliseconds](https://github.com/rocicorp/mono/pull/5943) +- [Bulk-insertion optimization in Replicache via `putMany`, speeding up large sync patches (3-5x faster for typical sync batches, up to 50x+ for large preloads)](https://github.com/rocicorp/mono/pull/5380) +- [Batch deletes/upserts in `SQLiteStore` writes (~8x faster on 1k-put commits)](https://github.com/rocicorp/mono/pull/5915) +- [Batch concurrent `SQLiteStore` `get`/`has` reads into fewer database queries](https://github.com/rocicorp/mono/pull/5958) - [Parallelize I/O during pull and rebase](https://github.com/rocicorp/mono/pull/5926) -- [Heap-based k-way merge in `fetchMergeSort` (O(log K) per row vs O(K))](https://github.com/rocicorp/mono/pull/5921), [with a new `mergeSortedStreams` utility](https://github.com/rocicorp/mono/pull/5917) -- [Initial sync progress reporting uses `pg_class` estimates instead of full table scans](https://github.com/rocicorp/mono/pull/5932) +- [Heap-based k-way merge in `fetchMergeSort` (O(log K) per row vs O(K))](https://github.com/rocicorp/mono/pull/5921) +- [Initial sync progress reporting uses `pg_class` estimates instead of full scans](https://github.com/rocicorp/mono/pull/5932) - [De-dupe SQLite requests in flip-join when children want the same parent](https://github.com/rocicorp/mono/pull/5918) +- [`zero-sqlite3` now gathers up to 128 STAT4 samples per index, improving SQLite query planning for skewed indexed data](https://github.com/rocicorp/mono/pull/5913) ## Fixes -- [Returning to an app after stale-tab GC or CVR purge caused a full page reload; now the Zero instance rotates in place](https://github.com/rocicorp/mono/pull/5903) -- [`"Row already exists"` errors after an IVM advance failure could mask the original error and continue through corrupt branch state](https://github.com/rocicorp/mono/pull/5910), [also fixed for `IVMBranch.fork()`](https://github.com/rocicorp/mono/pull/5916) -- [`"Row already exists"` assertion failures during poke processing caused by `putMany` rebalancing duplicating entries across adjacent BTree children](https://github.com/rocicorp/mono/pull/5923) -- [Initial sync could fail or take hours on large databases because progress reporting did full `COUNT(*)` and `SUM(pg_column_size(...))` scans](https://github.com/rocicorp/mono/pull/5932) -- [Deadlock between post-initial-sync `changeLog` reset and a live replication-manager during non-disruptive resync](https://github.com/rocicorp/mono/pull/5953) -- [Zombie `ViewSyncer`s could accumulate in the `active-client-groups` metric when clients disconnected before `initConnection` resolved](https://github.com/rocicorp/mono/pull/5907) -- [`ConcurrentModificationException` is now classified as a Rehome so the client reconnects instead of erroring](https://github.com/rocicorp/mono/pull/5930) -- [`zero-cache` startup errors during change-streamer init were not published to subscribers](https://github.com/rocicorp/mono/pull/5956) -- [`TypeError: Expected string at context.query. Got null` when handling DDL events with `NULL current_query()`](https://github.com/rocicorp/mono/pull/5944) -- [Repeated initial-sync failures could exhaust the replication-slot name pool; cleanup now runs preemptively under the management lock](https://github.com/rocicorp/mono/pull/5947), [and inactive slots are deleted together with their `replicas` row so a stuck slot doesn't keep claiming a name](https://github.com/rocicorp/mono/pull/5948) -- [Replication slot creation timeouts crashed the server during backfill retries; backfill timeouts now only error after the maximum retry backoff is reached](https://github.com/rocicorp/mono/pull/5901) -- [Shadow sync threw when a synced table could not be queried by ZQL; it now silently ignores the table to match prod behavior](https://github.com/rocicorp/mono/pull/5950) -- [WebSocket errors are now logged as warnings instead of errors, since they reflect client or upstream issues rather than server faults](https://github.com/rocicorp/mono/pull/5842) -- [Inspector now caches AST and metrics for deleted queries so they remain accessible after eviction](https://github.com/rocicorp/mono/pull/5924) +- [Returning to an app after stale-tab GC or CVR purge caused a full page reload](https://github.com/rocicorp/mono/pull/5903) +- [`"Row already exists"` assertion failures during poke processing](https://github.com/rocicorp/mono/pull/5923) +- [Deadlock during post-initial-sync `changeLog` reset](https://github.com/rocicorp/mono/pull/5953) +- [Zombie `ViewSyncer`s inflated `active-client-groups` metric](https://github.com/rocicorp/mono/pull/5907) +- [`ConcurrentModificationException` now reconnects instead of erroring](https://github.com/rocicorp/mono/pull/5930) +- [`zero-cache` startup errors during change-streamer init not published](https://github.com/rocicorp/mono/pull/5956) +- [`TypeError: Expected string at context.query. Got null`](https://github.com/rocicorp/mono/pull/5944) +- [Replication slots were lost after a PlanetScale Postgres failover](https://github.com/rocicorp/mono/pull/5934) +- [Replication slot creation timeouts crashed the server during backfill retries](https://github.com/rocicorp/mono/pull/5901) +- [Shadow sync threw when a synced table could not be queried by ZQL](https://github.com/rocicorp/mono/pull/5950) +- [Shadow sync now reports metrics and logs verification-success counts](https://github.com/rocicorp/mono/pull/5941) +- [WebSocket errors are now logged as warnings instead of errors](https://github.com/rocicorp/mono/pull/5842) +- [Inspector now reports last query hydration time instead of histogram](https://github.com/rocicorp/mono/pull/5924) +- [Query/mutator functions now allow omitting `args`](https://github.com/rocicorp/mono/pull/5945) +- [Reoduce log volume at `INFO` level](https://github.com/rocicorp/mono/pull/5946) +- [Reclassify common Postgres config errors as warnings instead of errors](https://github.com/rocicorp/mono/pull/5981) +- [Union-fan-in queries could ignore reverse ordering](https://github.com/rocicorp/mono/pull/5980) +- [Abort `zero-cache` when ChangeDB CDC tables go missing](https://github.com/rocicorp/mono/pull/5989) +- [CVR purge failures retried immediately in a tight loop instead of backing off](https://github.com/rocicorp/mono/pull/5988) ## Breaking Changes -- [**Inspector per-query hydration metrics format changed:**](https://github.com/rocicorp/mono/pull/5924) Per-query hydration metrics (`query-hydration-server-ms`) are now reported as a plain number (most-recent hydration time in ms) instead of a TDigest histogram, and the per-query metrics type was renamed from `ServerMetrics` to `QueryServerMetrics`. If you have custom tooling reading inspector metrics, you'll need to update it. The protocol version was bumped from 50 to 51 to reflect this; `MIN_SERVER_SUPPORTED_SYNC_PROTOCOL` remains at 30, so 1.6 servers remain compatible with older clients. +None. diff --git a/contents/docs/zero-cache-config.mdx b/contents/docs/zero-cache-config.mdx index dbd5acff..386824b5 100644 --- a/contents/docs/zero-cache-config.mdx +++ b/contents/docs/zero-cache-config.mdx @@ -390,6 +390,13 @@ flag: `--litestream-port`
env: `ZERO_LITESTREAM_PORT`
default: `--port + 2` +### Litestream Region + +The AWS region for the litestream backup bucket. Required for non-standard AWS partitions (e.g. GovCloud `us-gov-west-1`) where Litestream cannot auto-detect the region. The replication-manager and view-syncers must have the same region. + +flag: `--litestream-region`
+env: `ZERO_LITESTREAM_REGION`
+ ### Litestream Restore Parallelism The number of WAL files to download in parallel when performing the initial restore of the replica from the backup. @@ -522,6 +529,14 @@ flag: `--per-user-mutation-limit-window-ms`
env: `ZERO_PER_USER_MUTATION_LIMIT_WINDOW_MS`
default: `60000` +### PG Replication Slot Failover + +For upstream Postgres 17+, creates replication slots with the `failover` flag enabled so they can be synchronized to a standby and survive a failover. This requires additional Postgres-side configuration on your provider; see [High Availability and Failover](/docs/connecting-to-postgres#high-availability-and-failover). Has no effect on Postgres versions before 17. + +flag: `--upstream-pg-replication-slot-failover`
+env: `ZERO_UPSTREAM_PG_REPLICATION_SLOT_FAILOVER`
+default: `false` + ### Port The port for sync connections. From a04863766009967c6247580e515a445b458708fe Mon Sep 17 00:00:00 2001 From: Aaron Boodman Date: Tue, 2 Jun 2026 15:35:22 -1000 Subject: [PATCH 3/4] improve description --- contents/docs/release-notes/1.6.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contents/docs/release-notes/1.6.mdx b/contents/docs/release-notes/1.6.mdx index f2f2acc1..eaf3201c 100644 --- a/contents/docs/release-notes/1.6.mdx +++ b/contents/docs/release-notes/1.6.mdx @@ -1,6 +1,6 @@ --- title: Zero 1.6 -description: Postgres 17 Replication Failover and Performance +description: Failover Support for PlanetScale --- ## Installation From 3002fcf555233f163ad10d89a1fdcf7a5d155d66 Mon Sep 17 00:00:00 2001 From: Aaron Boodman Date: Tue, 2 Jun 2026 15:58:57 -1000 Subject: [PATCH 4/4] spruce --- contents/docs/connecting-to-postgres.mdx | 22 ++++++++-------------- contents/docs/release-notes/1.6.mdx | 6 +++--- contents/docs/release-notes/index.mdx | 2 +- 3 files changed, 12 insertions(+), 18 deletions(-) diff --git a/contents/docs/connecting-to-postgres.mdx b/contents/docs/connecting-to-postgres.mdx index 17b7f053..14376499 100644 --- a/contents/docs/connecting-to-postgres.mdx +++ b/contents/docs/connecting-to-postgres.mdx @@ -79,32 +79,26 @@ This configuration can cause problems like `slot has been invalidated because it Change `max_connections` to at least 100. The default is 25, which is too low for Zero in most configurations. -#### Connections +#### Pooling Make sure to only use a direct connection for the `ZERO_UPSTREAM_DB`, and use pooled URLs for `ZERO_CVR_DB`, `ZERO_CHANGE_DB`, and your API (see [Deployment](/docs/self-host)). -#### High Availability and Failover +#### High Availability -PlanetScale Postgres can fail over to a standby (during maintenance, switchover, or an outage). By default a logical replication slot does **not** survive promotion of a standby, so after a failover zero-cache would find its slot missing and re-sync every replica from scratch. +PlanetScale Postgres can fail over to a standby during maintenance or an outage. By default a logical replication slot does **not** survive promotion of a standby, so after a failover zero-cache would find its slot missing and re-sync every replica from scratch. To avoid this, register Zero's replication slots with PlanetScale's failover-slot preservation, which is built on [Postgres 17 failover slots](https://www.postgresql.org/docs/current/logical-replication-failover.html). PlanetScale keeps a synced copy of each registered slot on the standby, so after a failover the slot is already present on the new primary and zero-cache reconnects without re-syncing. -First, run `zero-cache` with [`ZERO_UPSTREAM_PG_REPLICATION_SLOT_FAILOVER=true`](/docs/zero-cache-config#pg-replication-slot-failover) (Postgres 17+) so it creates replication slots with the Postgres `failover` flag set. Slots are only flagged when they are created, so if you are upgrading an existing deployment, see [PlanetScale Replication Failover](/docs/release-notes/1.6#planetscale-replication-failover) in the 1.6 release notes for how to roll this out. +First, run `zero-cache` with [`ZERO_UPSTREAM_PG_REPLICATION_SLOT_FAILOVER=true`](/docs/zero-cache-config#pg-replication-slot-failover). -Zero names its slots `{ZERO_APP_ID}_{shard}_a` through `_z` — for the default app ID and shard, that's `zero_0_a`, `zero_0_b`, … `zero_0_z`. In **Cluster configuration → Parameters** in the PlanetScale dashboard: +Zero's slot names have the form `{ZERO_APP_ID}_{shard}_{a-z}`. [`ZERO_APP_ID`](/docs/zero-cache-config#app-id) defaults to `zero` when self-hosting (on Zero Cloud it's your instance ID), and the shard number is currently always `0` — so with the default app ID the slots are `zero_0_a` … `zero_0_z`. -1. Under the **Failover** section, add Zero's slot names as a comma-delimited list. Registering the full `zero_0_a` … `zero_0_z` range covers slot rotation. +In **Cluster configuration → Parameters** in the PlanetScale dashboard: + +1. Under the **Failover** section, add the full range of slot names as a comma-delimited list (e.g. `zero_0_a` … `zero_0_z` for the default app ID). The 26-name range covers slot rotation. 2. Set `sync_replication_slots = on` and `hot_standby_feedback = on`. 3. Apply the queued configuration changes. -After zero-cache has connected, confirm the slots are marked for failover: - -```sql -SELECT slot_name, failover, synced FROM pg_replication_slots; -``` - -`failover` should be `true` for Zero's active slot. A slot only becomes failover-eligible after its consumer has advanced it at least once while the standby is syncing, so a brand-new or idle slot can still be lost if a failover races it. - ### Neon #### Logical Replication diff --git a/contents/docs/release-notes/1.6.mdx b/contents/docs/release-notes/1.6.mdx index eaf3201c..a082932d 100644 --- a/contents/docs/release-notes/1.6.mdx +++ b/contents/docs/release-notes/1.6.mdx @@ -1,6 +1,6 @@ --- title: Zero 1.6 -description: Failover Support for PlanetScale +description: PlanetScale Failover Support --- ## Installation @@ -11,9 +11,9 @@ npm install @rocicorp/zero@1.6 ## Upgrading -### PlanetScale Replication Failover +### PlanetScale Failover -Previous Zero versions lost replication slots after a PlanetScale Postgres failover, forcing resync. Zero 1.6 fixes this problem. To enable support, run `zero-cache` with [`ZERO_UPSTREAM_PG_REPLICATION_SLOT_FAILOVER=true`](/docs/zero-cache-config#pg-replication-slot-failover). You must then do a resync, and register the full range of slots in PlanetScale's failover configuration. See [High Availability and Failover](/docs/connecting-to-postgres#high-availability-and-failover) for the full setup. +Previous Zero versions lost replication slots after a PlanetScale Postgres failover, forcing resync. Zero 1.6 fixes this problem. To enable support, run `zero-cache` with [`ZERO_UPSTREAM_PG_REPLICATION_SLOT_FAILOVER=true`](/docs/zero-cache-config#pg-replication-slot-failover). You must then register the full range of slots in PlanetScale's failover configuration, then do a resync. See [High Availability and Failover](/docs/connecting-to-postgres#high-availability) for the full setup. ## Features diff --git a/contents/docs/release-notes/index.mdx b/contents/docs/release-notes/index.mdx index dde5594a..3cb0ddbb 100644 --- a/contents/docs/release-notes/index.mdx +++ b/contents/docs/release-notes/index.mdx @@ -2,7 +2,7 @@ title: Release Notes --- -- [Zero 1.6: Postgres 17 Replication Failover and Performance](/docs/release-notes/1.6) +- [Zero 1.6: PlanetScale Failover Support](/docs/release-notes/1.6) - [Zero 1.5: Schema Change Improvements and Client Group Auth](/docs/release-notes/1.5) - [Zero 1.4: Performance and Reliability Improvements](/docs/release-notes/1.4) - [Zero 1.3: Faster Initial Sync and Other Perf Improvements](/docs/release-notes/1.3)