DOC-13858 Server File-Based Rebalance for Data Service#4105
Conversation
b86953c to
3a8583c
Compare
…yet because UI still has not been merged.
There was a problem hiding this comment.
Pull request overview
Adds documentation for the new Data Service File-Based Rebalance (FBR) feature in Couchbase Server 8.1, including a new REST API reference page, navigation entry, conceptual coverage in the rebalance learn page, a new bucket-level dataServiceRebalanceType parameter, and updates to general settings, node management, and the 8.1 new-features page. Also performs cosmetic cleanup (removing italic emphasis) in the rebalance learn page and updates the shared user/pwd/host/port REST parameter partial.
Changes:
- New REST reference page
file-based-data-rebalance.adocdocumenting GET/POST/internalSettingsusage for FBR, plus nav and rebalance-table entries. - New
dataServiceRebalanceTypebucket parameter documentation added torest-bucket-create.adoc, with cross-references from general settings, add-node-and-rebalance, and the rebalance learn page. - Conceptual FBR section added to
rebalance.adocand a 8.1 new-features entry, plus formatting cleanup of italic markers acrossrebalance.adocandrest-rebalance-overview.adoc.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| modules/ROOT/nav.adoc | Adds new FBR REST page to the navigation. |
| modules/rest-api/partials/user_pwd_host_port_params.adoc | Renames placeholders to uppercase (HOST, PORT, USER, PASSWORD). |
| modules/rest-api/partials/rest-rebalance-table.adoc | Adds GET/POST /internalSettings rows for FBR. |
| modules/rest-api/pages/rest-rebalance-overview.adoc | Removes italic emphasis from prose. |
| modules/rest-api/pages/rest-bucket-create.adoc | Documents new dataServiceRebalanceType bucket parameter with examples. |
| modules/rest-api/pages/file-based-data-rebalance.adoc | New REST reference page for configuring FBR via /internalSettings. |
| modules/manage/pages/manage-settings/general-settings.adoc | Adds description of the new FBR concurrent-moves UI setting. |
| modules/manage/pages/manage-nodes/add-node-and-rebalance.adoc | Adds note that FBR is used automatically during node addition (EE). |
| modules/learn/pages/clusters-and-availability/rebalance.adoc | Adds conceptual FBR section and removes italic emphasis throughout. |
| modules/introduction/partials/new-features-81.adoc | Adds the FBR entry under Data Service new features. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
COmmitting some of CoPilot's suggestions Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
hyunjuV
left a comment
There was a problem hiding this comment.
-
The info about File-Based Rebalance throttle rate setting snapshot_download_throttle_bytes should be added, along with metric kv_ep_snapshot_read_bytes that can be monitored to view the throughput.
-
The rest of the comments were mostly about where the phrasing/text suggests that rebalance type decision is made for each vBucket (not true) and that rebalance type decisions are made for scenarios like when either file-based or DCP rebalance is likely to perform x percent faster than the other or if all data is 100% memory resident, etc (not true as far as I'm aware, but should double check with Ben Huddleston).
While true that for ephemeral buckets, there's no backfill phase (so, no file-based rebalance), for non-ephemeral buckets, memory residency is not considered in the decision to do file-based rebalance or DCP rebalance (I do not believe).
| * *Automatic rebalance type selection*: The server automatically determines whether FBR or DCP is more efficient for each vBucket move. | ||
| When FBR is not applicable or not expected to be faster, the server falls back to DCP automatically. |
There was a problem hiding this comment.
I do not believe that this is true the way it's written (where the server automatically determines which method is more efficient for each vBucket move).
When FBR is enabled (which it is, by default) the backfill is done using the same method for all vBuckets. Generally, it's always done using file-based rebalance unless there is storage migration or eviction policy change pending. So, the "Automatic rebalance type selection" does apply when there are scenarios where DCP rebalance is required for all phases -- like when storage migration or eviction policy change is pending effect. If DCP rebalance is required for all phases, then, that's what the Server will do (even with FBR enabled).
There was a problem hiding this comment.
The server does not know whether or not FBR or DCP is more efficient for any given vBucket move. The server will use FBR if possible for any given vBucket, otherwise DCP is used. FBR can only be used for new vBucket builds (i.e. it was not on the node on which it is being build before), only if configured to do so (FBR settings are enabled), and only if no storage/eviction policy migration is currently being done.
Some of this confusion probably comes from calling this "file based rebalance", as the actual feature is file based backfill (one particular rebalance step).
There was a problem hiding this comment.
Suggest:
- Automatic rebalance type selection: The server automatically determines whether FBR can be used for a given vBucket move.
When FBR is not applicable, the server falls back to DCP automatically.
There was a problem hiding this comment.
Thanks for the clarification @BenHuddleston .
The below summary is to correct any mistaken comments I may have made in my original review comments:
- The server does automatically determine whether FBR can be used for a given vBucket move (or, as in the documentation phrasing, for each vBucket move).
- However, this determination is not made based on perf considerations (like whether file copy or DCP would be faster).
- Also, this determination is not made based on memory residency (like whether data is 100% memory resident).
- But, for ephemeral buckets, DCP is always used (i.e. file copy is not used since there's no persistent storage).
So, as noted by Ben in his suggestion above, best to just say that the server automatically determines whether file-based or DCP rebalance is applicable and not go into specifics.
|
|
||
| * *Separate vBucket move concurrency for FBR*: A new setting, `dataServiceFileBasedRebalanceMovesPerNode`, controls the maximum number of concurrent file-based vBucket moves per node. | ||
| This is independent of the existing `rebalanceMovesPerNode` setting, which applies to DCP rebalance. | ||
|
|
There was a problem hiding this comment.
There is also a new File-Based Rebalance throttle rate setting (snapshot_download_throttle_bytes) which sets the max rate at which file-based rebalance snapshots will be transferred between nodes. Very high transfer rates means that rebalance will proceed very quickly but may have a negative impact on KV operation latencies during rebalance. A value of 0 means that snapshot transfer is unthrottled.
Additional details:
-
snapshot_download_throttle_bytes is an option in GET, POST /pools/default/settings/memcached/global
-
In the UI (under Advanced Rebalance Settings for the Data Service), the setting of 150 MiB/s translates to snapshot_download_throttle_bytes=157286400. By default, the value is 0 (unthrottled).
-
The rate of FBR snapshots transfer can be seen with a rate function applied to the metric
kv_ep_snapshot_read_bytes.
| image::clusters-and-availability/replicaVbucketMove.png[,640,align=left] | ||
|
|
||
| The move has two principal phases. Phase 1 is _Backfill_. Phase 2 is _Book-keeping_. | ||
| The move has two principal phases. Phase 1 is Backfill. Phase 2 is Book-keeping. |
There was a problem hiding this comment.
The picture and the description is slightly different for File based vs DCP rebalance.
The existing pictures are for DCP rebalance.
FYI -- this Google doc has info on how the picture would look different for file-based rebalance.
|
|
||
| The move has four principal phases. | ||
| Phase 1, _Backfill_, and Phase 2, _Book-keeping_, are identical to those required for replica vBuckets; except that the _Book-keeping_ phase includes additional _Persistence Time_. | ||
| Phase 1, Backfill, and Phase 2, Book-keeping, are identical to those required for replica vBuckets; except that the Book-keeping phase includes additional Persistence Time. |
There was a problem hiding this comment.
Same in this section as in the "Rebalance Phases for Replica vBuckets".
The picture and the description is slightly different for File based vs DCP rebalance.
The existing pictures are for DCP rebalance.
FYI -- this Google doc has info on how the picture would look different for file-based rebalance.
|
|
||
| Since vBucket moves are highly resource-intensive, Couchbase Server allows the concurrency of such moves to be _limited_: a setting is provided that determines the maximum number of concurrent vBucket moves permitted on any node. | ||
| Since vBucket moves are highly resource-intensive, Couchbase Server allows the concurrency of such moves to be limited: a setting is provided that determines the maximum number of concurrent vBucket moves permitted on any node. | ||
| The minimum value for the setting is `1`, the maximum `64`, the default `4`. |
There was a problem hiding this comment.
For DCP rebalance, the minimum value for the setting is 1, the maximum 64, the default 4.
For FBR, the minimum value for the setting is 1, the maximum 1024, the default 4. (I see that this info is presented in the file-based rebalance section.)
There was a problem hiding this comment.
Both settings can be set as high as 1024.
| Scenarios where DCP may be faster:: | ||
| For example, DCP can be faster when the data resident ratio is 100%. |
There was a problem hiding this comment.
I'm not sure that the Server automatically does DCP rebalance (for the backfill) in this scenario -- should check with @BenHuddleston
There was a problem hiding this comment.
We do not, FBR is used whenever it is possible to do so. It's quite hard to determine which type of backfill would be faster as it likely depends on disk characteristics, storage backend, fragmentation, and item sizes.
| ==== Performance | ||
|
|
||
| The primary goal of FBR is to deliver significant improvements to rebalance speed for large datasets. | ||
| The target throughput is 1 TB of data movement in 30 minutes. |
There was a problem hiding this comment.
I don't think that we should be specific since the actual rebalance time depends on too many variables. Please remove line 198.
| Changing one setting does not affect the other. | ||
|
|
||
| The setting may be established by means of the xref:manage:manage-settings/general-settings.adoc#rebalance-settings[Couchbase Web Console] or the xref:manage:manage-settings/general-settings.adoc#rebalance-settings-via-rest[REST API]. | ||
|
|
There was a problem hiding this comment.
Need to add info about FBR throttling option -- see info below.
=== File-Based Rebalance Throttle Rate Setting
Since file-based rebalance can increase network usage, there's a way to throttle the file transfer rate, if needed. By default, there is no throttling.
The File-Based Rebalance throttle rate setting is called snapshot_download_throttle_bytes, and it sets the max rate at which file-based rebalance snapshots will be transferred between nodes. Very high transfer rates means that rebalance will proceed very quickly but may have a negative impact on KV operation latencies during rebalance. A value of 0 means that snapshot transfer is unthrottled.
Additional details:
snapshot_download_throttle_bytes is an option in GET, POST /pools/default/settings/memcached/global
In the UI (under Advanced Rebalance Settings for the Data Service), the setting of 150 MiB/s translates to snapshot_download_throttle_bytes=157286400. By default, the value is 0 (unthrottled).
The rate of FBR snapshots transfer can be seen with a rate function applied to the metric kv_ep_snapshot_read_bytes.
| Therefore, replication has successfully distributed the contents of `travel-sample` across both nodes, providing a single replica vBucket for each active vBucket. | ||
|
|
||
| NOTE: By default, Couchbase Server Enterprise Edition automatically uses File-Based Rebalance (FBR) to move data for eligible vBuckets during node addition. | ||
| The server selects the optimal rebalance method for each vBucket move transparently. |
There was a problem hiding this comment.
Line 190 should be removed -- the rebalance method is not selected for each vBucket move.
There was a problem hiding this comment.
See previous comment, it is selected on a per-vBucket basis.
There was a problem hiding this comment.
Per Ben's comment, line 190 is OK.
There was a problem hiding this comment.
IMO "optimal" implies "the most performant" which we do not consider at all (see other comments). I would remove line 190 or at least the word optimal.
|
|
||
| The valid values are: | ||
|
|
||
| * `auto` (default): The server automatically selects File-Based Rebalance (FBR) or DCP for each vBucket move based on which is estimated to be at least 10% faster. |
There was a problem hiding this comment.
As far as I'm aware:
- Not true that file-based rebalance or DCP is chosen for each vBucket move.
- Also not true that file-based or DCP decision is made based on estimates of which is likely to be faster.
Should just say:
The server automatically selects File-Based Rebalance (FBR) or DCP.
There was a problem hiding this comment.
See previous, it is selected on a per-vBucket basis and we do not consider the efficiency of the move, only the eligibility.
There was a problem hiding this comment.
Per Ben's comment:
The server does automatically select File-Based rebalance (FBR) or DCP for each vBucket move, but it's not based on which is estimated to be faster (not based on any performance estimates). So, best to just say that server automatically selects without going into details on how.
Added FBR-specific versions of the rebalance phase diagrams.
This is my reformatting/editing of Schewtha's draft of the FBR docs.
Important Note: This draft does not contain documentation for the Web Console's FBR settings. The UI wasn't ready when this draft was written and revised. This draft is mainly for review before inclusion before the early Totoro release. Other features are taking prioority over getting the GUI documentation done.
Cribbing CoPilot's summary of these changes:
Adds documentation for the new Data Service File-Based Rebalance (FBR) feature in Couchbase Server Totoro, including a new REST API reference page, navigation entry, conceptual coverage in the rebalance learn page, a new bucket-level dataServiceRebalanceType parameter, and updates to general settings, node management, and the 8.1 new-features page. Also performs cosmetic cleanup (removing italic emphasis) in the rebalance learn page and updates the shared user/pwd/host/port REST parameter partial.
Changes (with links to the preview):
You will need the Docs Team credentials on Confluence to view the preview.