DEPLOY.md: troubleshooting for agent-micro WS RST

'connection reset by peer' on the WS dial is at the TCP layer,
not the application layer. Almost always a firewall on 186
iptables REJECT, fail2ban ban, or stale conntrack state.

Document the diagnostic ladder: iptables -L INPUT, fail2ban
status, then a plain HTTP curl from 92 to verify the network
path, then a WS upgrade curl on 186 itself to verify the
control plane's upgrader.
This commit is contained in:
Achmad
2026-06-24 05:36:18 +00:00
parent 0569cede43
commit 8c598ad69f
+58
View File
@@ -288,3 +288,61 @@ exit
```
If still failing, the systemd manager itself is in a bad state. Reboot the VM (last resort; will interrupt any other work on it).
### Agent-micro on 92 gets `connection reset by peer` connecting to 186:3452
`connection reset by peer` is at the TCP layer — the SYN reaches the host kernel but something RSTs the connection before the control plane sees it. Common causes:
1. **iptables on 186 has a REJECT rule for 3452.** Check:
```bash
ssh administrator@172.18.139.186
sudo iptables -L INPUT -n | head -30
exit
```
If you see a REJECT rule for port 3452, drop or modify it. The control plane is on the same host, so there's no reason to filter loopback or local-subnet traffic to it.
2. **fail2ban has banned the agent's IP.** Check:
```bash
ssh administrator@172.18.139.186
sudo fail2ban-client status
sudo fail2ban-client status sshd 2>/dev/null
exit
```
If 92's IP is in the banned list, add the SDP subnet to `ignoreip` in `/etc/fail2ban/jail.local` and `sudo fail2ban-client reload`.
3. **The kernel's connection tracking has stale state.** Restart it (last resort):
```bash
ssh administrator@172.18.139.186
sudo systemctl restart nftables 2>/dev/null
sudo systemctl restart firewalld 2>/dev/null
exit
```
4. **Verify the network path works at all** before debugging firewall rules. From 92, a plain HTTP request to the control plane's port:
```bash
ssh administrator@172.18.136.92
curl -v http://172.18.139.186:3452/
exit
```
If you get *any* HTTP response (even a Go HTTP 400 for "missing node query") → the path is open and the problem is the WebSocket upgrade. If you get `Connection reset by peer` again → the path is being blocked, look at the iptables/fail2ban angle.
5. **Verify the WS endpoint works on 186 itself** (rules out the network and confirms the upgrade logic):
```bash
ssh administrator@172.18.139.186
curl -i \
-H "Connection: Upgrade" -H "Upgrade: websocket" \
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
-H "Sec-WebSocket-Version: 13" \
"http://127.0.0.1:3452/ws/agent?node=micro"
exit
```
Should return HTTP 101 Switching Protocols. If it does, the network from 92 is the issue. If it doesn't, the control plane binary has a problem.