10 Commits

Author SHA1 Message Date
Achmad d5d5e5467d DEPLOY.md: two more WS-upgrade tests for the curl-works-but-RST case
When plain HTTP from 92 reaches the control plane but the
WebSocket dial RSTs, test the upgrader on each side:

1. curl from 186 to 127.0.0.1:3452 with WS upgrade headers:
   - 101 → control plane is fine, network is the issue.
   - RST/4xx → control plane is broken.

2. curl from 92 to 186:3452 with WS upgrade headers:
   - 101 → firewall allows WS traffic, agent's client is the issue.
   - RST → some middlebox matches on the Upgrade header.
   - 4xx → control plane rejects the upgrade.
2026-06-24 05:52:04 +00:00
Achmad f3da975eb7 DEPLOY.md: remove the 'agent on 92 can't reach 186:3452' section
The user's micro agent dials ws://172.18.139.186:3452/ws/agent
directly, bypassing nginx. The 'add a /ws/agent nginx proxy
block' workaround was for a different topology where the
agent goes through nginx on 80, which doesn't apply here.

That section was confusing — it suggested an nginx change
the user doesn't need. Delete it.
2026-06-24 05:48:10 +00:00
Achmad b8736f4ac3 DEPLOY.md: also check what's bound to port 3452 when curl works
If something else is also listening on :3452 (a leftover
container, a systemd-managed socket, a proxy from an earlier
session), the kernel can route new connects to it and that
listener may RST. Curl from outside gets a clean response
from the real listener; the agent's WS dial lands on the
stale one and gets RST.

Add ss -tlnp + lsof + systemctl list-sockets to the
diagnostic ladder.
2026-06-24 05:43:53 +00:00
Achmad fc768dbd85 DEPLOY.md: stale WS connection after daemon-reexec is the usual RST cause
When 186's control plane comes up via systemctl daemon-reexec
(the 226/NAMESPACE fix), the listening socket is briefly
dropped and re-created. Any WebSocket connection that was
in flight at that moment gets RST. The agent retries every
2s, but if the dial happened exactly during the reexec
window the agent can sit in a tight RST loop until restarted.

Document the restart-agent-micro as the first thing to try,
and demote the iptables/fail2ban/curl diagnostic steps to
'if the agent still RSTs after restart'.
2026-06-24 05:40:42 +00:00
Achmad 8c598ad69f DEPLOY.md: troubleshooting for agent-micro WS RST
'connection reset by peer' on the WS dial is at the TCP layer,
not the application layer. Almost always a firewall on 186
iptables REJECT, fail2ban ban, or stale conntrack state.

Document the diagnostic ladder: iptables -L INPUT, fail2ban
status, then a plain HTTP curl from 92 to verify the network
path, then a WS upgrade curl on 186 itself to verify the
control plane's upgrader.
2026-06-24 05:36:18 +00:00
Achmad 0569cede43 DEPLOY.md: pre-create ~/SDP/data on 186; create it in step 6a too
The control plane binary will create the data dir on first
start, but doing it before systemd starts the service means
the ReadWritePaths scope has somewhere to point at, and
faster diagnosis if anything else is wrong.
2026-06-24 05:31:51 +00:00
Achmad 92354252e5 DEPLOY.md: troubleshooting for status=226/NAMESPACE failure
The 226/NAMESPACE with 'Failed to set up mount namespacing' error
is misleading — the binary is fine, but systemd can't build the
mount namespace for the service. The binary runs fine when
launched directly as administrator; the bug is in the systemd
manager's runtime state.

Document the diagnostic (run the binary manually) and the two
fixes: systemctl daemon-reexec (recreate /run/systemd) as the
first attempt, reboot as the last resort.
2026-06-24 05:27:59 +00:00
Achmad 1f1ff2f173 DEPLOY.md: drop sudo discussion entirely
The user has made it clear (twice now) that they don't want
sudo advice in the runbook — they can type the password
themselves and don't want a script or sudoers change.

Delete the 'Diagnose sudo' step and the 'Sudo on the company
VMs' reminder step. Sudo is just expected behavior; when
the user runs 'sudo systemctl ...' and gets a prompt, they
type the password. No commentary needed.

Renumber the remaining steps so they're sequential 0-8.
2026-06-24 05:19:25 +00:00
Achmad d11723ee63 DEPLOY.md: drop NOPASSWD advice, document interactive sudo
Company VM — no sudoers changes. Replace the 'set up sudoers
NOPASSWD' step with a brief note that every sudo call will
prompt for the password and the user types it. The 15-minute
sudo timestamp means the user only types it once per shell
session, but they will see the prompt several times across
the deploy as they run multiple sudo commands.

Update the step-1 diagnostic outcomes to point at the new
no-policy-change reality: NOPASSWD or different passwords
both still work, the user just types the right one at each
sudo prompt.
2026-06-24 05:16:52 +00:00
Achmad 1eddef9f65 Add DEPLOY.md: copy-pasteable runbook, no deploy.sh
A hand-typed manual deploy guide. Every step is a single
ssh or scp from the laptop, or a one-shot block of commands
inside the VM. No sshpass, no env-var passwords, no
sudo -S password piping. The user types their passwords
interactively when prompted.

The old deploy.sh had grown into a tangle of -tt / sudo -S
/ PAGER=cat workarounds that hid what was actually happening
and was fragile across systemd versions. The runbook trades
that off for explicit per-step commands that the user can
verify by reading the output.

Troubleshooting section at the bottom covers the four most
likely first-deploy failures: SDP_CP_URL expansion, micro
agent can't reach the control plane, login auth rejection,
and missing runtime images.
2026-06-24 05:14:06 +00:00