Clarify duplicate host management in federated reporting docs by larsewi · Pull Request #3645 · cfengine/documentation

larsewi · 2026-05-12T11:41:20Z

Summary

Reframe the "Duplicate host management" intro around the approach distinction (source-side vs destination-side) rather than treating distributed cleanup and handle-duplicate-hostkeys as if they addressed different lifecycle scenarios. Both mechanisms address the same underlying problem: the same hostkey appearing on multiple feeders.
Distributed cleanup is now described as a source-side cleanup (deletes stale records from the feeders via API, before re-import).
Handle duplicate hostkeys is described as a destination-side filter (moves older duplicates into a dup schema on the superhub during import).
Added a caveat that neither mechanism deduplicates clones reporting to the same feeder, since the feeder's __hosts table is keyed on hostkey (verified against nova/db/schema.sql).
Dropped "fail over" from the list of causes — there is no automatic failover for federated reporting agents, only operator-driven re-bootstrap.
Updated the distributed_cleanup.py prompts in the bullet list and the example console session to match the current script: separate admin username/password prompts (defaulting to admin), and 2FA prompts when enabled.

Example output

# /opt/cfengine/federation/bin/distributed_cleanup.py
Enter admin username for superhub ip-172-31-12-171 [admin]: larsewi
Enter admin password for superhub ip-172-31-12-171: 
Enter email for fr_distributed_cleanup accounts: larsewi@doofus.com

Enter admin username for ip-172-31-6-8 [admin]: larsewi
Enter admin password for ip-172-31-6-8:

craigcomstock · 2026-05-12T14:00:52Z

- hosts are able to "float", re-bootstrap or failover to several different feeder hubs
- hosts may be cloned and not have their hostkey refreshed by running `cf-key` and refreshing `$(sys.workdir)/ppkeys/localhost.pub`.
+CFEngine provides two mechanisms for resolving these cross-feeder duplicates.
+They address the same underlying problem with different approaches, and they can be enabled together:


Can they be enabled together? The destination-side would take only add the newest record into the superhub database, without the duplicate entries in the superhub the source-side process would have nothing to delete on the feeders. I don't think this comment is accurate.

I don't see why not. But I guess it would not make any difference, so might as well remove it ;)

Reframe the "Duplicate host management" intro around the approach distinction rather than separate lifecycle scenarios: both mechanisms address the same underlying problem (same hostkey across multiple feeders), with distributed cleanup as a source-side cleanup and handle duplicate hostkeys as a destination-side filter. Note that neither deduplicates clones reporting to the same feeder, since a feeder's __hosts table is keyed on hostkey. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The script has since been changed to ask for admin username and password separately (defaulting to admin) and to prompt for a 2FA code when the admin account has 2FA enabled. Update the bullet list and the example console session to match the current prompts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

olehermanse · 2026-05-12T22:52:52Z

+They address the same underlying problem with different approaches:

-In the first case you will likely want to remove entries for hosts which are not the latest since the latest data will be most accurate.
+- [Distributed cleanup][Federated reporting#Distributed cleanup] is a _source-side_ cleanup.


I know that we are deleting hosts on the feeder (the source), but I think it's confusing / inaccurate to call this a source-side cleanup when the script runs on the superhub (the destination).

Also, since we have terminology (superhub and feeder) we should probably use those words instead of source vs destination.

olehermanse · 2026-05-12T22:57:34Z

-There are situations where feeder hubs may have hosts with duplicate hostkeys:
+In a federated deployment, the superhub can end up importing the same hostkey from more than one feeder.
+This typically happens when a host is re-bootstrapped to a different feeder, when a VM is re-spawned from an image that preserves the existing key material, or when a host is cloned without regenerating its key pair with `cf-key` and refreshing `$(sys.workdir)/ppkeys/localhost.pub`.
+In all of these cases, multiple feeders hold a record under the same hostkey, and only the most recently reporting one represents accurate data.


This is not really true. If a host is cloned with its key, there will be 2 active hosts, both with recent and accurate data. The same applies to a host spawned from an image (if the original host or another host spawned from the same image is also active).

In those cases, deleting one of the hosts actually does more harm than good, and the correct fix is to re-key one or both of them.

olehermanse · 2026-05-12T23:03:04Z

+  During each import cycle on the superhub, duplicate rows for the same hostkey are compared by `__hosts.lastreporttimestamp`, and all but the most recent are moved out of the per-feeder schemas into a separate `dup` schema for later analysis, so only the most recently reporting host remains visible in Mission Portal.
+
+Note that neither mechanism deduplicates hosts that share a hostkey but report to the _same_ feeder — a feeder's database is keyed on hostkey, so the most recent report simply overwrites the previous one there.
+If cloned or re-spawned hosts are reporting to the same feeder, the right fix is still to regenerate their key pairs.


This fix is correct whether it's federated reporting or not.

Also this part:

the right fix is still to regenerate their key pairs.

is a bit weird when you haven't already mentioned this fix above.

olehermanse · 2026-05-12T23:04:45Z

+- [Handle duplicate hostkeys][Federated reporting#Handle duplicate hostkeys] is a _destination-side_ filter.
+  During each import cycle on the superhub, duplicate rows for the same hostkey are compared by `__hosts.lastreporttimestamp`, and all but the most recent are moved out of the per-feeder schemas into a separate `dup` schema for later analysis, so only the most recently reporting host remains visible in Mission Portal.
+
+Note that neither mechanism deduplicates hosts that share a hostkey but report to the _same_ feeder — a feeder's database is keyed on hostkey, so the most recent report simply overwrites the previous one there.


Here it would be appropriate to mention the health diagnostic alert we have in Mission Portal for duplicate hostkey.

olehermanse · 2026-05-12T23:09:06Z

+  A script on the superhub identifies the feeder with the most recent contact for each hostkey and then calls back into the other feeders to delete the stale records at the source, so they are never re-imported.

-There are two options available for handling these situations depending on your environment: Distributed Cleanup or Handle Duplicate Hostkeys.
+- [Handle duplicate hostkeys][Federated reporting#Handle duplicate hostkeys] is a _destination-side_ filter.


Out of scope for this PR, but: Handle sounds a bit vague and might be confusing, especially when there is also distributed cleanup, I guess we could rename this mechanism and section to simply Filter duplicate hostkeys.

olehermanse · 2026-05-12T23:13:32Z

+- admin username and password for the superhub (the username defaults to `admin` if left blank)
+- a 2FA code for the superhub, if 2FA is enabled for the admin account
+- email address for the `fr_distributed_cleanup` limited privileges user
+- admin username and password for each feeder (the username defaults to `admin` if left blank)
+- a 2FA code for each feeder, if 2FA is enabled for the admin account


olehermanse · 2026-05-12T23:13:39Z

+```command
+ls /opt/cfengine/federation/cftransport/distributed_cleanup/
+```
+
+```output
 superhub.pub  feeder1.cert  feeder1.pub feeder2.cert feeder2.pub
+```

-# /opt/cfengine/federation/bin/distributed_cleanup.py
-Enter admin credentials for superhub https://superhub.domain/api:
+```command
+/opt/cfengine/federation/bin/distributed_cleanup.py
+```
+
+```output
+Enter admin username for superhub superhub.domain [admin]:
+Enter admin password for superhub superhub.domain:
 Enter email for fr_distributed_cleanup accounts:
-Enter admin credentials for feeder1 at https://feeder1.domain/api:
-Enter admin credentials for feeder2 at https://feeder2.domain/api:
+
+Enter admin username for feeder1 [admin]:
+Enter admin password for feeder1:
+
+Enter admin username for feeder2 [admin]:
+Enter admin password for feeder2:


Co-authored-by: Ole Herman Schumacher Elgesem <4048546+olehermanse@users.noreply.github.com>

larsewi force-pushed the federated-reporting-duplicate-hosts branch from 571c83a to 1d4687e Compare May 12, 2026 13:08

larsewi marked this pull request as ready for review May 12, 2026 13:11

larsewi requested review from aleksandrychev, craigcomstock and olehermanse May 12, 2026 13:11

craigcomstock requested changes May 12, 2026

View reviewed changes

larsewi and others added 2 commits May 12, 2026 16:18

larsewi force-pushed the federated-reporting-duplicate-hosts branch from 1d4687e to e21408e Compare May 12, 2026 14:19

larsewi requested a review from craigcomstock May 12, 2026 14:21

olehermanse reviewed May 12, 2026

View reviewed changes

Comment thread content/web-ui/federated-reporting.markdown Outdated

olehermanse reviewed May 12, 2026

View reviewed changes

olehermanse requested a review from nickanderson May 12, 2026 23:10

olehermanse reviewed May 12, 2026

View reviewed changes

Apply suggestions from code review

e74e56e

Co-authored-by: Ole Herman Schumacher Elgesem <4048546+olehermanse@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify duplicate host management in federated reporting docs#3645

Clarify duplicate host management in federated reporting docs#3645
larsewi wants to merge 3 commits into
cfengine:masterfrom
larsewi:federated-reporting-duplicate-hosts

larsewi commented May 12, 2026 •

edited

Loading

Uh oh!

craigcomstock May 12, 2026

Uh oh!

larsewi May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

olehermanse May 12, 2026 •

edited

Loading

Uh oh!

olehermanse May 12, 2026

Uh oh!

olehermanse May 12, 2026

Uh oh!

olehermanse May 12, 2026

Uh oh!

olehermanse May 12, 2026 •

edited

Loading

Uh oh!

olehermanse May 12, 2026

Uh oh!

olehermanse May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

larsewi commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example output

Uh oh!

craigcomstock May 12, 2026

Choose a reason for hiding this comment

Uh oh!

larsewi May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

olehermanse May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olehermanse May 12, 2026

Choose a reason for hiding this comment

Uh oh!

olehermanse May 12, 2026

Choose a reason for hiding this comment

Uh oh!

olehermanse May 12, 2026

Choose a reason for hiding this comment

Uh oh!

olehermanse May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

olehermanse May 12, 2026

Choose a reason for hiding this comment

Uh oh!

olehermanse May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

larsewi commented May 12, 2026 •

edited

Loading

larsewi May 12, 2026 •

edited

Loading

olehermanse May 12, 2026 •

edited

Loading

olehermanse May 12, 2026 •

edited

Loading