Skip to content

fix(network): bind strict-PQ peer identity to staking ML-DSA key so validators produce blocks#131

Draft
Darkhorse7stars wants to merge 1 commit into
mainfrom
fix/pq-peer-nodeid-block-production
Draft

fix(network): bind strict-PQ peer identity to staking ML-DSA key so validators produce blocks#131
Darkhorse7stars wants to merge 1 commit into
mainfrom
fix/pq-peer-nodeid-block-production

Conversation

@Darkhorse7stars
Copy link
Copy Markdown
Member

Summary

On a strict-PQ chain a peer's consensus identity is its ML-DSA-65 NodeID (StakingConfig.DeriveNodeID), but the network layer kept every peer on the TLS-cert NodeID derived during the transport upgrade (peer/upgrader.goids.NodeIDFromCert). The validator set is keyed by the ML-DSA NodeID, so every peer was classified as a non-validator → the P-chain saw zero connected validators → consensus never formed → no block was ever produced (the built-in EVM/C-Chain stays at height 0, RPC serves reads but eth_blockNumber never advances).

This was observed on the Liquidity strict-PQ devnet: 3 validators, all healthy, all BLS-correct, peer mesh formed — but P-chain height stuck at 0.

Root cause — two coupled defects

1. The PQ handshake signed with an ephemeral key, not the staking key.
network.NewNetwork built the handshake identity with peer.NewLocalIdentity(MyNodeID), which generates a fresh ML-DSA keypair per process (see its doc-comment). So the handshake signature proved possession of a throwaway key with no relationship to the staking key that MyNodeID derives from. The wire carried the right NodeID, but nothing bound it to a key the validator set knows. (Corollary: the handshake never actually authenticated the validator identity — a peer could assert any NodeID.)

2. The verified peer NodeID was discarded.
peer.runPQHandshakeIfRequired used HandshakeResult.AEADKey but dropped HandshakeResult.PeerNodeID, leaving p.id on the transport TLS-cert NodeID. Consensus then looked up that TLS NodeID in an ML-DSA-keyed validator set and found nothing.

Fix

  • Thread the node's persistent staking ML-DSA keypair (StakingConfig.StakingMLDSA{,Pub}) onto network.Config (mirrored in node.Node.initNetworking) and build the handshake LocalIdentity from it via the new peer.NewLocalIdentityFromStakingKey. The handshake now signs with the same key that derives MyNodeID.
  • peer.adoptVerifiedPQIdentity (new): after a successful handshake, re-derive the NodeID from the peer's presented ML-DSA key under the node-identity domain (ids.Empty — the exact domain DeriveNodeID uses for MyNodeID) and require it to equal the presented NodeID, then adopt that ML-DSA NodeID as p.id.

This both fixes block production and closes an identity-impersonation gap: a peer can no longer claim a NodeID it cannot derive from the key it proved possession of.

Because peer.Start runs the handshake synchronously before the message-pump goroutines and before network.upgrade adds the peer to connectingPeers/connectedPeers, p.id is already the ML-DSA NodeID by the time any peer-set bookkeeping keys by it — no re-keying race.

Blast radius

Entirely inside the strict-PQ path (SecurityProfile != nil && profileRequiresPQHandshake). Classical / permissive chains skip the PQ handshake and are unaffectedp.id stays the TLS-cert NodeID exactly as before. No wire-format change (same INIT/RESP frames); only which key signs, plus an added local verification.

Rollout / review notes

  • Coordinated upgrade for strict-PQ networks. The binding check rejects the old ephemeral-key handshake, so all nodes on a strict-PQ chain must run this together.
  • Needs a devnet soak (3-validator strict-PQ: confirm P-chain height advances + getCurrentValidators lists all NodeIDs with weight + EVM eth_blockNumber increments) before any production rollout. Drafted for chain-team review; do not blind-merge/deploy to a live network.

Tests

go build ./... clean; go vet clean; go test ./network/peer/... green. Adds network/peer/pq_identity_adopt_test.go covering: bound identity → adopted; unbound/forged NodeID → rejected + identity untouched; nil/empty result → rejected.

Follow-ups (not in this PR)

  • The pre-handshake MyNodeID self-check and AllowConnection in network.upgrade still evaluate the TLS-cert NodeID (best-effort pre-filters; authoritative gating is post-handshake on p.id). Worth migrating to the ML-DSA NodeID for completeness.
  • The PQ handshake runs under peersLock in network.upgrade; a slow peer serializes connection establishment. Pre-existing; orthogonal to this fix.

…alidators produce blocks

On a strict-PQ chain a peer's consensus identity is its ML-DSA-65 NodeID
(StakingConfig.DeriveNodeID), but the network layer kept every peer on the
TLS-cert NodeID derived during the transport upgrade. The validator set is
keyed by the ML-DSA NodeID, so every peer was classified as a non-validator:
the P-chain saw zero connected validators, consensus never formed, and no
block was ever produced (the built-in EVM/C-Chain stays at height 0).

Two coupled defects:

1. network.NewNetwork built the PQ handshake identity with
   peer.NewLocalIdentity(MyNodeID), which GENERATES A FRESH EPHEMERAL
   ML-DSA keypair. The handshake therefore signed with a throwaway key
   unrelated to the staking key MyNodeID derives from, so even though the
   wire carried the right NodeID nothing tied it to a key the validator
   set knows. (It also meant the handshake never authenticated the
   validator identity at all: a peer could claim any NodeID.)

2. peer.runPQHandshakeIfRequired discarded HandshakeResult.PeerNodeID and
   left p.id on the transport TLS-cert NodeID.

Fix:

- Thread the node's persistent staking ML-DSA keypair
  (StakingConfig.StakingMLDSA{,Pub}) onto network.Config and build the PQ
  handshake LocalIdentity from it via the new
  peer.NewLocalIdentityFromStakingKey. The handshake now signs with the
  same key that derives MyNodeID.
- After a successful handshake, peer.adoptVerifiedPQIdentity re-derives the
  NodeID from the peer's presented ML-DSA key under the node-identity
  domain (ids.Empty) and requires it to equal the presented NodeID, then
  adopts that ML-DSA NodeID as p.id. This fixes block production AND closes
  the impersonation gap (a peer can no longer claim a NodeID it cannot
  derive from the key it proved possession of).

Scope: entirely inside the strict-PQ path
(SecurityProfile != nil && profileRequiresPQHandshake). Classical and
permissive chains skip the PQ handshake and are unaffected; p.id stays the
TLS-cert NodeID exactly as before. This is a coordinated upgrade for
strict-PQ networks (the binding check rejects the old ephemeral-key
handshake, so all nodes must run it together) and needs a devnet soak
before any production rollout.

Adds white-box tests for the bind / adopt / reject paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant