Skip to content

Remove mailbox for now#268

Merged
sjmiller609 merged 4 commits into
mainfrom
revert-260-hypeship/network-handoff-v2
Jun 3, 2026
Merged

Remove mailbox for now#268
sjmiller609 merged 4 commits into
mainfrom
revert-260-hypeship/network-handoff-v2

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Reverts #260


Note

Medium Risk
Changes networked fork/restore behavior on Firecracker by dropping the fast mailbox/ack path; restores depend on vsock guest-agent reconfigure, which affects timing and failure modes for forked standby restores.

Overview
Reverts the guest-initiated resume network mailbox flow (PR #260): the host no longer patches Firecracker snapshot memory, waits for guest UDP acks, or arms mailbox env on create/start.

Restore and fork networking after standby/running resume now rely only on the existing host-initiated path—reconfigureGuestNetwork via guest-agent gRPC (with shell ip exec fallback)—when a fresh allocation is applied for forked standby restores.

Removed code spans lib/mailbox, instance handoff helpers (prepareResumeNetworkHandoff, mailbox patching/UDP waiter), guest-agent VMGenID/mailbox watcher, related tests/docs, and a ForkSnapshot API unit test. Linux test env no longer passes HYPEMAN_TEST_NETWORK_TMPDIR; parallel test network locks/leases use the system temp dir only.

Reviewed by Cursor Bugbot for commit 71309cb. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 requested a review from rgarcia June 2, 2026 22:15
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Removes the guest-initiated resume-network mailbox — a fast-path where hypeman pre-patched VM snapshot memory with new network identity before restore. Network reconfiguration after standby→running transitions now uses only the shell-exec path (already deployed as the fallback).

Intended effect:

  • Spawn success rate: baseline 99.9–100% (310K–754K successes/hr active hours); confirmed if rate holds at ≥99.9%
  • POST /browsers 500 errors: baseline 0–6/hr during clean active hours (10K–14K req/hr); confirmed if no sustained increase above ~50/hr post-deploy

Risks:

  • Restore failures (exec path less reliable than mailbox)kernel_invocation_spawn_total{success=false}, alert if >1,000/hr sustained for 2+ hours (clean baseline: 0–240/hr; infra-noise ceiling ~774/hr)
  • Restore latency increase — POST /browsers p95 latency, alert if sustained above ~12s (baseline 8–10s p95)
  • Fork readiness regression — spawn failures or "wait for forked guest agent readiness" ERROR logs; alert on any new occurrence
  • failed to configure guest network after restore ERROR logs — any occurrence post-deploy is actionable

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 58ce89f. Configure here.

Comment thread lib/instances/manager.go Outdated
@sjmiller609 sjmiller609 changed the title Revert "Add mailbox resume network handoff" Remove mailbox for now Jun 2, 2026
@sjmiller609

Copy link
Copy Markdown
Collaborator Author

CI server firewall was messed up dropping guest -> host communication, which is why CI was failing:

TASK [nftables_firewall : Deploy nftables ruleset] ********************************************************************************************************************************
--- before: /etc/nftables.conf
+++ after: /Users/steven/.ansible/tmp/ansible-local-67508yu0zar6_/tmpbq_ycng0/nftables.conf.j2
@@ -20,6 +20,9 @@
...
+
+        # Trusted host-local interfaces
+        iifname "hm*" accept

@sjmiller609 sjmiller609 merged commit 43d1210 into main Jun 3, 2026
17 of 20 checks passed
@sjmiller609 sjmiller609 deleted the revert-260-hypeship/network-handoff-v2 branch June 3, 2026 14:56
rgarcia added a commit that referenced this pull request Jun 10, 2026
The Linux integration tests create per-bridge iptables FORWARD/NAT rules.
On the shared self-hosted runner these leak: the test-harness cleanup
called iptables without the -w lock-wait flag and ignored all errors, so
under xtables-lock contention from concurrent CI jobs cleanup failed
silently. Rules whose bridge interface no longer exists then accumulated
indefinitely (cleanupStaleLinkDownRoutes only handles still-present
linkdown routes), inflating every iptables operation over time and
contributing to flaky "instance did not reach Running within 45s"
timeouts.

- Add -w 5 to all harness iptables calls (matches lib/network convention)
- Surface cleanup errors to stderr instead of swallowing them; retry
  deletes on transient lock contention; treat already-gone routes/links
  as benign
- Sweep orphaned test iptables rules (interface gone) once per test
  binary under the existing subnet lock, scoped to the "hm" test prefix
  so a non-test hypeman process's "ha" rules are never touched
- Remove dead HYPEMAN_TEST_NETWORK_TMPDIR plumbing from test.yml (its Go
  reader was reverted in #268; re-wiring it would reintroduce per-run
  lock/lease isolation and break cross-run subnet coordination)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants