Skip to content

Feature/multiserver plugin#3421

Open
Muddyblack wants to merge 8 commits into
ipspace:devfrom
Muddyblack:feature/multiserver-plugin
Open

Feature/multiserver plugin#3421
Muddyblack wants to merge 8 commits into
ipspace:devfrom
Muddyblack:feature/multiserver-plugin

Conversation

@Muddyblack

Copy link
Copy Markdown
Collaborator

Reference: #3420

Summary

This PR adds the multiserver plugin to distribute a single Netlab topology across multiple physical servers.
Sadly for now containerlab-provider only.

Key Details

  • Self-contained: The implementation is entirely within netsim/extra/multiserver/ and doesn't modify any core Netlab engine logic.
  • Consistent Allocations: IP, interface, and VNI allocations are computed once on the workstation. The plugin then generates a per-server directory with a filtered clab.yml and netlab.snapshot.pickle.
  • Native Remote Deployments: Remote servers launch using standard sudo netlab up --snapshot -vv without needing custom CLI options.

For the test-files I am not sure if they make any sense. But they show at least it does not interfere with the normal netlab workflow.

Explanations on how it works can be found in docs/plugins/multiserver.md.

Comment thread docs/plugins/multiserver.md Outdated

@ipspace ipspace left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-awesome-job!!! Thanks a million.

Tons of comments (as you expected ;). Some of them are just suggestions or pointers to existing helper functions, in other cases I think we can make the whole thing a lot more streamlined with significant rewrites.

Comment thread docs/plugins/multiserver.md Outdated
Comment thread docs/plugins/multiserver.md Outdated
Comment thread docs/plugins/multiserver.md
Comment thread docs/plugins/multiserver.md
Comment thread docs/plugins/multiserver.md
Comment thread netsim/extra/multiserver/plugin.py Outdated
Comment thread netsim/extra/multiserver/plugin.py Outdated
Comment thread netsim/extra/multiserver/plugin.py
Comment thread netsim/extra/multiserver/plugin.py Outdated
Comment thread netsim/extra/multiserver/plugin.py
@Muddyblack

Copy link
Copy Markdown
Collaborator Author

I guess about 99% of them are valid 😁
I will have a look at it next week 👍

@Muddyblack Muddyblack self-assigned this Jun 1, 2026
@Muddyblack Muddyblack added the enhancement New feature or request label Jun 1, 2026
@Muddyblack

Copy link
Copy Markdown
Collaborator Author

Refactor & review-response changelog

clab.yml generation

  • Dropped ~200 lines of hand-rolled Python clab generation (_build_server_clab & co).
  • output() now filters the topology and renders via the standard clab.j2 template.
  • Cross-server P2P links carry a link.clab.vxlan attribute; clab.j2 got one extra
    branch under node_count == 2. Regular topologies are unaffected.

Data types

  • Added must_be_group_id and must_be_node_or_group (mirroring must_be_node_id).
  • Schema now validates groups (group_id), members (node_id), replicate (node_or_group).
  • Removed the manual existence-check loops in _resolve_assignments /
    _resolve_replicated; they only expand references now.

Servers as a dictionary

  • multiserver.servers is a name-keyed dict — duplicate names impossible by construction.
  • Dropped the seen_ids loop; _validate_servers uses _dataplane IDs (like VLANs/VRFs).

CLI hooks for VXLAN setup/teardown

  • Hooks (post_start_clab / pre_stop_clab) are baked into each server's snapshot with a
    relative vxlan-setup.sh path, so they fire locally where netlab up --snapshot runs.

Single-server netlab up --snapshot fixes

  • _classify_links marked a link cross-server if endpoints spanned >1 server,
    ignoring clab.uplink — uplink NIC + VXLAN on the same bridge, no STP →
    broadcast storm → 100 % CPU. Fix: clab.uplink links are now forced cross: False.
  • set -e + bare ip link add vxlanN aborted on "File exists" after a crash.
    Fix: ip link del vxlanN 2>/dev/null || true before each add in
    vxlan-setup.j2 / generated script.
  • New multiserver.vxlan.auto_start bool (default true). When false, scripts
    are generated but CLI hooks are not registered — tunnels stay down until run
    manually (staged convergence or single-server local testing).

Remote-server snapshots (superseded by _resolve_snapshot_paths above)

  • make_paths_absolute() no longer touches per-server snapshots.

Bug fixes

  • Fixed two Box crashes in _build_server_topo (default_box=True missing → BoxKeyError).
  • Box cleanup per AGENTS.md: to_dict() and dotted-path .get().
  • Module config render failed in per-server dirs ('frr.j2' not found, OSPF/etc.):
    snapshot load skips make_paths_absolute(), so a bare relative templates dir
    reached render_template(), whose "not .//" branch re-based it onto the netlab
    install dir → in-template {% import %} lookups missed. Fix: _resolve_snapshot_paths
    now anchors relative entries with ./ via _cwd_relative() (stays portable across hosts).

Tests & docs

  • Merged srlinux/frr coverage into multiserver-explicit.yml; dropped the separate test.
  • docs/plugins/multiserver.md updated for the dict format, CLI hooks, and clab.j2 flow.

Open / outlook

  • Waiting on a core post-output callback (offered by maintainer) to replace the atexit
    file-copy handler.
  • Future: optional remote orchestration — control node pushes dirs and runs netlab
    remotely (SSH), so the whole lab comes up from one command. Out of scope here.

I am not sure on the package: (netlab installation path) is it baked into the snapshot file?

This would require the user to have on all their machines the same path where netlab is installed onto.
Although I would say usually you do similar setups of your machines anyways (just cloning them) .

if "package:" in entry:
  resolved.append(str(moddir / entry.replace("package:", "")))

Also unhappy with _cwd_relative and _resolve_snapshot_paths

@Muddyblack Muddyblack requested a review from ipspace June 1, 2026 16:12
@ipspace ipspace mentioned this pull request Jun 3, 2026
@ipspace

ipspace commented Jun 4, 2026

Copy link
Copy Markdown
Owner

A few quick thoughts -- more details after I implement the hooks this needs and we rebase it.

Added must_be_group_id and must_be_node_or_group (mirroring must_be_node_id).

Will cherry-pick this change into a new PR. Don't want to have this hidden in a large blob of unrelated code.

  • Hooks (post_start_clab / pre_stop_clab) are baked into each server's snapshot with a
    relative vxlan-setup.sh path, so they fire locally where netlab up --snapshot runs.

👍

  • New multiserver.vxlan.auto_start bool (default true). When false, scripts
    are generated but CLI hooks are not registered — tunnels stay down until run
    manually (staged convergence or single-server local testing).

👍

  • make_paths_absolute() no longer touches per-server snapshots.

That's a huge can of worms. We need the absolute paths in the snapshot and in the Ansible inventory. I think it would be best to have a plugin hook executed very early in "netlab up" so it can adjust the topology data and recreate the snapshot and Ansible inventory before "netlab up" does some real work.

I am not sure on the package: (netlab installation path) is it baked into the snapshot file?

It's resolved from the current installation path in make_paths_absolute

This would require the user to have on all their machines the same path where netlab is installed onto. Although I would say usually you do similar setups of your machines anyways (just cloning them) .

Unless you install netlab as root on lab VMs and in virtual environment on your local machine.

Also unhappy with _cwd_relative and _resolve_snapshot_paths

Agreed. That whole thing has to be solved in a different way.

@ipspace ipspace left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mixed bag of nits, things that should be fixed (more info on replicated nodes, default server uplink interface name...), and Another Grand Idea (😜). I'm perfectly fine if you tell me to defer the Grand Idea to a later time and just merge this thing.

{% endfor %}
{% endfor %}
{% elif l.node_count == 2 %}
{% if l.clab is defined and l.clab.vxlan is defined %}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the way we implement undefined Jinja2 values, you only need to test l.clab.vxlan is defined (

* The Jinja2 environment uses a custom `undefined` method that can handle dictionary hierarchy. For example, `a.b.c is defined` returns False instead of crashing even when `a.b` does not exist. There is no need for an extra `a.b is defined` guard.
)

|-----------|------|---------|
| **vni_base** | integer | Starting VNI for cross-server links (default: `10000`) |
| **dstport** | integer | UDP destination port for VXLAN traffic (default: `4789`) |
| **dev** | string | Default physical interface to bind VXLAN tunnels (default: `ens33`) |

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to make this eth0 (that's what you get on less-opinionated distros ;) or leave it undefined but make it a required attribute so the user is forced to define it. ens33 is oddly specific.

site2:
members: [ site2-r1, site2-r2, site2-r3, site2-r4, site2-r5 ]
sites:
members: [ site1-r1, site1-r2, site1-r3, site1-r4, site1-r5,

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to point out in a comment that it's even better to use members: [ site1, site2 ]

(multiserver-replicate)=
### Replicated Nodes

Nodes listed in **multiserver.replicate** are instantiated on every server. This is useful for infrastructure services that need local access on each physical host — for example, monitoring collectors, route reflectors, or DNS resolvers.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how well the inevitably overlapping IP addresses work. Also, I don't think route reflectors are a good example (I can easily see how that would result in split routing).

It might be best to move this section to the end of the document and use your specific example, including an explanation of how the overlapping IP addresses are resolved as you're effectively deploying an implicit anycast service.

On a second thought, maybe that's a better way to go -- require an explicit anycast service?

I'm fine with whatever you decide is best, and this is not a showstopper. It's just that I can see too many unexpected consequences, so there should be a large enough "THERE BE DRAGONS" sign attached to this concept ;)


* **Local links** connecting nodes on the same server remain as regular containerlab veth pairs or bridges.
* **Cross-server point-to-point links** are provisioned via containerlab's native VXLAN link endpoints (`type: vxlan` in `clab.yml`).
* **Cross-server multi-access links** use a local Linux bridge on each server, interconnected via host-level VXLAN tunnels configured by generated setup scripts.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a crazy idea: what if you implemented cross-server multi-access links with Linux bridge nodes (https://netlab.tools/node/roles/#implementing-multi-access-links-with-bridge-nodes) -- when analyzing the topology, you could add necessary bridges to nodes and the bridge attribute to links, totally removing the need for extra provisioning scripts.

OTOH, while this would make the end result simpler, you would need a very careful orchestration of steps between this plugin and the bridge code (https://github.com/ipspace/netlab/blob/dev/netsim/roles/bridge.py). You'd have to create the bridge nodes before the pre_transform hook is executed in the bridge role. However, looking at the code, it seems that the roles are the last plugins in the list, so just doing this in the pre_transform hook is probably good enough if you're OK with initializing all the required node data. However, we should not do this too early, or you'd have to crawl through VLAN and VRF links (plus there are topology components and other stuff).

Worst case, I could add another plugin hook ;)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to look more and more like a core feature, not a plugin

Perhaps we should add the notion of 'servers' to the topology, and add vxlan support for containerlab links?

vxlan:
vni_base: int
dstport: int
dev: str

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a required attribute. At the moment, it's set from the system defaults anyway, so no harm done.

vxlan:
vni_base: 10000
dstport: 4789
dev: ens33

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to eth0, or we could omit this default which would (together with _required on the topology attribute) force the user to specify it either in the topology or in the defaults.

# When true, 'netlab up --snapshot' auto-runs vxlan-setup.sh via a CLI hook.
# Set false to keep cross-server tunnels inactive until you run the script
# manually (e.g. to stage convergence or connect servers on your own schedule).
auto_start: true

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor detail that would be unnecessary with the "bridge" nodes ;))

# Register atexit handler to copy node_files, host_vars, etc. into each server
# folder after netlab writes all output files.
if server_folders:
import atexit

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "post-output" callback is there. Please use that.

for name, s in servers.items():
if "id" not in s:
s.id = _dataplane.get_next_id("multiserver_server")
if "host" not in s:

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set the "host" attribute to be required, it will eventually be checked, and if you don't use the "host" value very early, we should be OK.

@ipspace

ipspace commented Jun 12, 2026

Copy link
Copy Markdown
Owner

@Muddyblack -- I'm sorry for the long delay. I wanted to push the June release out before Autocon5 (and it turns out one has a limited daily amount of mental energy as one gets older), and then Autocon5 struck (just joking, it was a great conference).

Anyway, I think we should implement multi-access links with bridge nodes in the long run to make your life easier. I could even make that functionality part of netlab core (in the "bridge" role) to have it available regardless of whether the user uses this plugin. Hey, maybe I could even do that by default just to simplify containerlab provisioning (no need for "manual" creation of Linux bridges).

Whether we do that as part of this PR, or merge this and then work on that, is completely up to you. Just let me know ;)

@ipspace

ipspace commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Update: the "implement clab multi-access networks with bridge nodes" idea is much harder to implement than I thought if we want to support VLAN trunks or routed VLANs across multi-access networks -- we would have to split the multi-access link into P2P links very late in the transformation process.

For the moment, let's fix the other stuff and move forward with this. When I implement that other idea, the multi-access part of the code in this PR will become irrelevant, at which point we can do a cleanup.

@Muddyblack

Copy link
Copy Markdown
Collaborator Author

No Problem some time off the computer and charge energy is also good for me ;)
Autocon would had been nice especially as it was in Munich this year I forgot about it after some time 😅
Maybe I find online material about it tho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: multiserver plugin to easily split topologies across physical servers

3 participants