If something else is also listening on :3452 (a leftover
container, a systemd-managed socket, a proxy from an earlier
session), the kernel can route new connects to it and that
listener may RST. Curl from outside gets a clean response
from the real listener; the agent's WS dial lands on the
stale one and gets RST.
Add ss -tlnp + lsof + systemctl list-sockets to the
diagnostic ladder.
When 186's control plane comes up via systemctl daemon-reexec
(the 226/NAMESPACE fix), the listening socket is briefly
dropped and re-created. Any WebSocket connection that was
in flight at that moment gets RST. The agent retries every
2s, but if the dial happened exactly during the reexec
window the agent can sit in a tight RST loop until restarted.
Document the restart-agent-micro as the first thing to try,
and demote the iptables/fail2ban/curl diagnostic steps to
'if the agent still RSTs after restart'.
'connection reset by peer' on the WS dial is at the TCP layer,
not the application layer. Almost always a firewall on 186
iptables REJECT, fail2ban ban, or stale conntrack state.
Document the diagnostic ladder: iptables -L INPUT, fail2ban
status, then a plain HTTP curl from 92 to verify the network
path, then a WS upgrade curl on 186 itself to verify the
control plane's upgrader.
The control plane binary will create the data dir on first
start, but doing it before systemd starts the service means
the ReadWritePaths scope has somewhere to point at, and
faster diagnosis if anything else is wrong.
The 226/NAMESPACE with 'Failed to set up mount namespacing' error
is misleading — the binary is fine, but systemd can't build the
mount namespace for the service. The binary runs fine when
launched directly as administrator; the bug is in the systemd
manager's runtime state.
Document the diagnostic (run the binary manually) and the two
fixes: systemctl daemon-reexec (recreate /run/systemd) as the
first attempt, reboot as the last resort.
The user has made it clear (twice now) that they don't want
sudo advice in the runbook — they can type the password
themselves and don't want a script or sudoers change.
Delete the 'Diagnose sudo' step and the 'Sudo on the company
VMs' reminder step. Sudo is just expected behavior; when
the user runs 'sudo systemctl ...' and gets a prompt, they
type the password. No commentary needed.
Renumber the remaining steps so they're sequential 0-8.
Company VM — no sudoers changes. Replace the 'set up sudoers
NOPASSWD' step with a brief note that every sudo call will
prompt for the password and the user types it. The 15-minute
sudo timestamp means the user only types it once per shell
session, but they will see the prompt several times across
the deploy as they run multiple sudo commands.
Update the step-1 diagnostic outcomes to point at the new
no-policy-change reality: NOPASSWD or different passwords
both still work, the user just types the right one at each
sudo prompt.
A hand-typed manual deploy guide. Every step is a single
ssh or scp from the laptop, or a one-shot block of commands
inside the VM. No sshpass, no env-var passwords, no
sudo -S password piping. The user types their passwords
interactively when prompted.
The old deploy.sh had grown into a tangle of -tt / sudo -S
/ PAGER=cat workarounds that hid what was actually happening
and was fragile across systemd versions. The runbook trades
that off for explicit per-step commands that the user can
verify by reading the output.
Troubleshooting section at the bottom covers the four most
likely first-deploy failures: SDP_CP_URL expansion, micro
agent can't reach the control plane, login auth rejection,
and missing runtime images.