DEPLOY.md: stale WS connection after daemon-reexec is the usual RST cause

When 186's control plane comes up via systemctl daemon-reexec (the 226/NAMESPACE fix), the listening socket is briefly dropped and re-created. Any WebSocket connection that was in flight at that moment gets RST. The agent retries every 2s, but if the dial happened exactly during the reexec window the agent can sit in a tight RST loop until restarted. Document the restart-agent-micro as the first thing to try, and demote the iptables/fail2ban/curl diagnostic steps to 'if the agent still RSTs after restart'.
2026-06-24 05:40:42 +00:00
parent 8c598ad69f
commit fc768dbd85
1 changed files with 18 additions and 16 deletions
@@ -291,9 +291,20 @@ If still failing, the systemd manager itself is in a bad state. Reboot the VM (l

 ### Agent-micro on 92 gets `connection reset by peer` connecting to 186:3452

-`connection reset by peer` is at the TCP layer — the SYN reaches the host kernel but something RSTs the connection before the control plane sees it. Common causes:
+`connection reset by peer` is at the TCP layer. The most common cause is a **stale WebSocket connection from before 186's `daemon-reexec`** (in step 6b). The kernel briefly drops the listening socket during re-exec; any in-flight connection that tries to use it gets RST. The agent retries every 2s, but if it connected right before the reexec and has been hammering the broken fd, just restart it:

-1. **iptables on 186 has a REJECT rule for 3452.** Check:
+```bash
+ssh administrator@172.18.136.92
+sudo systemctl restart sdp-agent-micro.service
+sudo journalctl -u sdp-agent-micro.service -n 20 --no-pager -f
+# Look for: "agent-micro connected as micro"
+# Ctrl-C to exit journalctl
+exit
+```
+
+If the journal shows the agent connecting and immediately seeing RST again (a tight loop with no successful connect), the path itself is being blocked. Run these on 186 to find the cause:
+
+1. **iptables on 186 has a REJECT rule for 3452:**

   ```bash
   ssh administrator@172.18.139.186
@@ -303,7 +314,7 @@ If still failing, the systemd manager itself is in a bad state. Reboot the VM (l

   If you see a REJECT rule for port 3452, drop or modify it. The control plane is on the same host, so there's no reason to filter loopback or local-subnet traffic to it.

-2. **fail2ban has banned the agent's IP.** Check:
+2. **fail2ban has banned 92's IP:**

   ```bash
   ssh administrator@172.18.139.186
@@ -314,16 +325,7 @@ If still failing, the systemd manager itself is in a bad state. Reboot the VM (l

   If 92's IP is in the banned list, add the SDP subnet to `ignoreip` in `/etc/fail2ban/jail.local` and `sudo fail2ban-client reload`.

-3. **The kernel's connection tracking has stale state.** Restart it (last resort):
-
-   ```bash
-   ssh administrator@172.18.139.186
-   sudo systemctl restart nftables 2>/dev/null
-   sudo systemctl restart firewalld 2>/dev/null
-   exit
-   ```
-
-4. **Verify the network path works at all** before debugging firewall rules. From 92, a plain HTTP request to the control plane's port:
+3. **Verify the network path works at all** (rules out firewall entirely). From 92, a plain HTTP request to the control plane's port:

   ```bash
   ssh administrator@172.18.136.92
@@ -331,9 +333,9 @@ If still failing, the systemd manager itself is in a bad state. Reboot the VM (l
   exit
   ```

-   If you get *any* HTTP response (even a Go HTTP 400 for "missing node query") → the path is open and the problem is the WebSocket upgrade. If you get `Connection reset by peer` again → the path is being blocked, look at the iptables/fail2ban angle.
+   If you get *any* HTTP response (a Go HTTP 404 "page not found" is the expected answer for `GET /`) → the path is open and the problem is the WebSocket connection state. Restart the agent as above. If you get `Connection reset by peer` again → the path is being blocked, look at iptables/fail2ban.

-5. **Verify the WS endpoint works on 186 itself** (rules out the network and confirms the upgrade logic):
+4. **Verify the WS endpoint accepts an upgrade on 186 itself** (rules out the control plane binary):

   ```bash
   ssh administrator@172.18.139.186
@@ -345,4 +347,4 @@ If still failing, the systemd manager itself is in a bad state. Reboot the VM (l
   exit
   ```

-   Should return HTTP 101 Switching Protocols. If it does, the network from 92 is the issue. If it doesn't, the control plane binary has a problem.
+   Should return HTTP 101 Switching Protocols. If it does, the network is the issue. If it doesn't, the control plane binary has a problem (look at `sdp-control-plane.service` journal for a startup error).