d5d5e5467d
When plain HTTP from 92 reaches the control plane but the WebSocket dial RSTs, test the upgrader on each side: 1. curl from 186 to 127.0.0.1:3452 with WS upgrade headers: - 101 → control plane is fine, network is the issue. - RST/4xx → control plane is broken. 2. curl from 92 to 186:3452 with WS upgrade headers: - 101 → firewall allows WS traffic, agent's client is the issue. - RST → some middlebox matches on the Upgrade header. - 4xx → control plane rejects the upgrade.
394 lines
14 KiB
Markdown
394 lines
14 KiB
Markdown
# SDP — manual deploy
|
|
|
|
A copy-pasteable runbook. The principle: anything that runs on a VM is done from inside that VM (just `ssh` in and run it). Anything that pushes files from your laptop to a VM uses `scp` and prompts for the password.
|
|
|
|
No `deploy.sh` is involved. No `sshpass`. You type your passwords.
|
|
|
|
## 0. Pull the repo on your laptop
|
|
|
|
```bash
|
|
cd ~/wherever/bri-sandbox-development-platform
|
|
git pull origin main
|
|
```
|
|
|
|
Confirm the artifacts are present:
|
|
|
|
```bash
|
|
ls bin/control-plane bin/agent-micro bin/agent-gateway dashboard/out/index.html systemd/sdp-*.service
|
|
```
|
|
|
|
## 1. Kill old SDP processes on each VM (skip on a fresh VM)
|
|
|
|
On 92:
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
pkill -f 'bin/agent-micro' 2>/dev/null; echo done
|
|
exit
|
|
```
|
|
|
|
On 186:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
pkill -f 'bin/control-plane' 2>/dev/null
|
|
pkill -f 'bin/agent-gateway' 2>/dev/null
|
|
echo done
|
|
exit
|
|
```
|
|
|
|
## 2. Sanity-check nginx and docker on 186
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo nginx -t
|
|
sudo systemctl is-active docker
|
|
ls -la ~/SDP/dashboard/index.html 2>/dev/null || echo 'dashboard will be created in step 6'
|
|
exit
|
|
```
|
|
|
|
- `nginx -t` says `syntax is ok` → good.
|
|
- `docker` is `active` → good.
|
|
- Dashboard missing is fine; step 6 pushes it.
|
|
|
|
## 3. Configure nginx on 186 (only on first deploy, or after editing)
|
|
|
|
Splice the four `location` blocks from `nginx/sandbox.conf` into `/etc/nginx/sites-available/default` inside the existing `server { }`. Read the file from your laptop first:
|
|
|
|
```bash
|
|
cat nginx/sandbox.conf
|
|
```
|
|
|
|
On 186:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo vim /etc/nginx/sites-available/default
|
|
# paste the four blocks somewhere inside the server { }
|
|
sudo nginx -t
|
|
sudo systemctl reload nginx
|
|
exit
|
|
```
|
|
|
|
## 4. Push the binaries and dashboard to the VMs
|
|
|
|
From your laptop. `scp` will prompt for the password.
|
|
|
|
**To 92 (micro):**
|
|
|
|
```bash
|
|
scp bin/agent-micro administrator@172.18.136.92:~/SDP/bin/agent-micro
|
|
```
|
|
|
|
**To 186 (gateway):**
|
|
|
|
```bash
|
|
scp bin/control-plane bin/agent-gateway administrator@172.18.139.186:~/SDP/bin/
|
|
scp -r dashboard/out/. administrator@172.18.139.186:~/SDP/dashboard/
|
|
```
|
|
|
|
**Make binaries executable** (on each VM):
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92 "chmod +x ~/SDP/bin/agent-micro"
|
|
ssh administrator@172.18.139.186 "chmod +x ~/SDP/bin/control-plane ~/SDP/bin/agent-gateway"
|
|
```
|
|
|
|
**Pre-create the control plane's data dir on 186** (SQLite + log files live here):
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186 "mkdir -p ~/SDP/data && ls -ld ~/SDP/data"
|
|
```
|
|
|
|
Should print `drwxr-xr-x ... administrator administrator ... /home/administrator/SDP/data`. The control plane binary creates it on first run too, but doing it now means the systemd unit's `ReadWritePaths` check has somewhere to point at.
|
|
|
|
## 5. Push the systemd unit files
|
|
|
|
From your laptop. `scp` will prompt for the password.
|
|
|
|
```bash
|
|
scp systemd/sdp-agent-micro.service administrator@172.18.136.92:/tmp/sdp-agent-micro.service
|
|
scp systemd/sdp-control-plane.service systemd/sdp-agent-gateway.service administrator@172.18.139.186:/tmp/
|
|
```
|
|
|
|
## 6. Install the unit files and start the services
|
|
|
|
### 8a. 92 (micro agent only)
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
sudo install -m 644 -o root -g root /tmp/sdp-agent-micro.service /etc/systemd/system/sdp-agent-micro.service
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable sdp-agent-micro.service
|
|
sudo systemctl restart sdp-agent-micro.service
|
|
sudo systemctl --no-pager status sdp-agent-micro.service | head -10
|
|
sudo journalctl -u sdp-agent-micro.service -n 10 --no-pager
|
|
exit
|
|
```
|
|
|
|
Status should be `active (running)`. Journal should show a clean startup, then either a `dial: ws://...` reconnect loop (waiting for the control plane) or `agent-micro connected as micro`.
|
|
|
|
### 8b. 186 (control plane FIRST, then gateway agent)
|
|
|
|
```bash
|
|
sudo install -m 644 -o root -g root /tmp/sdp-control-plane.service /etc/systemd/system/sdp-control-plane.service
|
|
sudo mkdir -p /home/administrator/SDP/data
|
|
sudo chown administrator:administrator /home/administrator/SDP/data
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable sdp-control-plane.service
|
|
sudo systemctl restart sdp-control-plane.service
|
|
sudo systemctl --no-pager status sdp-control-plane.service | head -10
|
|
sudo journalctl -u sdp-control-plane.service -n 10 --no-pager
|
|
```
|
|
|
|
The control plane must be up before the gateway agent starts (or the agent just retries). Wait for `active (running)`, then continue:
|
|
|
|
```bash
|
|
sudo install -m 644 -o root -g root /tmp/sdp-agent-gateway.service /etc/systemd/system/sdp-agent-gateway.service
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable sdp-agent-gateway.service
|
|
sudo systemctl restart sdp-agent-gateway.service
|
|
sudo systemctl --no-pager status sdp-agent-gateway.service | head -10
|
|
sudo journalctl -u sdp-agent-gateway.service -n 10 --no-pager
|
|
exit
|
|
```
|
|
|
|
The journal should show `agent-gateway connected as gateway` after a beat.
|
|
|
|
## 7. Browser smoke test (from your laptop)
|
|
|
|
Visit: `http://172.18.139.186/sandbox/credit-card/`
|
|
|
|
- HTML renders (CSS + JS load) → nginx `try_files` is right.
|
|
- Login form submits → `/sandbox/credit-card/api/login` proxies to `:3452`.
|
|
- Login with any Bitbucket creds returns 200 → the gateway agent ran `git ls-remote` successfully.
|
|
- After login, dashboard renders. Click **Sandboxes** → empty list (SQLite is fresh).
|
|
|
|
## 8. Following logs in real time
|
|
|
|
On 92 (micro agent):
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
sudo journalctl -u sdp-agent-micro.service -f
|
|
# Ctrl-C to exit
|
|
exit
|
|
```
|
|
|
|
On 186 (control plane + gateway agent):
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo journalctl -u sdp-control-plane.service -u sdp-agent-gateway.service -f
|
|
# Ctrl-C to exit
|
|
exit
|
|
```
|
|
|
|
## Common one-time fixes (apply, then re-run from step 6)
|
|
|
|
### `${SDP_CP_URL}` doesn't expand in the unit's ExecStart
|
|
|
|
Symptom: agent logs `flag: invalid value "${SDP_CP_URL}" for -cp`.
|
|
|
|
Fix: hardcode the URL in the unit. On your laptop, edit `systemd/sdp-agent-micro.service`:
|
|
|
|
```ini
|
|
ExecStart=/home/administrator/SDP/bin/agent-micro -node micro -cp ws://172.18.139.186:3452/ws/agent
|
|
```
|
|
|
|
(Remove the `Environment=` / `EnvironmentFile=` / `${SDP_CP_URL}` lines.) Do the same for `systemd/sdp-agent-gateway.service` (URL is `ws://127.0.0.1:3452/ws/agent`). Re-do steps 7 and 8.
|
|
|
|
### Login returns "git ls-remote rejected"
|
|
|
|
Either:
|
|
- The gateway agent isn't connected (re-run step 6b and check the journal).
|
|
- Your Bitbucket creds are wrong.
|
|
- The api-gateway repo path on 186 is wrong. The agent looks at `/var/www/html/erangel-ocean` by default. On 186:
|
|
|
|
```bash
|
|
ls -d /var/www/html/erangel-ocean
|
|
```
|
|
|
|
If the repo is at a different path, edit `agent-gateway/cmd/agent-gateway/main.go`:
|
|
|
|
```go
|
|
var repos = map[string]string{
|
|
"api-gateway": "/your/actual/path",
|
|
}
|
|
```
|
|
|
|
Then `./scripts/build.sh`, re-do steps 6 and 8b.
|
|
|
|
### Service containers can't be created (alpine:3.20 or php:8.3-apache not loaded)
|
|
|
|
Symptom: a deploy event stream shows `DEPLOY FAILED` with `image not found`.
|
|
|
|
The runtime images must be pre-loaded on the host (the VMs have no internet). On 92:
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
docker load -i /path/to/alpine-3.20.tar
|
|
exit
|
|
```
|
|
|
|
On 186:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
docker load -i /path/to/php-8.3-apache.tar
|
|
docker load -i /path/to/alpine-3.20.tar
|
|
exit
|
|
```
|
|
|
|
### Service fails with `status=226/NAMESPACE` and `Failed to set up mount namespacing: No such file or directory`
|
|
|
|
Your binary is fine; systemd's service-execution environment is broken. Diagnose by running the binary manually as `administrator`:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
./SDP/bin/control-plane -addr :3452 -data ./SDP/data
|
|
# Should print "control-plane listening on :3452 (data=./SDP/data)"
|
|
# Ctrl-C to exit
|
|
exit
|
|
```
|
|
|
|
If that works, the binary is fine. systemd's namespace setup is failing — common cause on this Ubuntu: `/run/systemd` is missing. Force it to be recreated:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo systemctl daemon-reexec
|
|
sudo systemctl restart sdp-control-plane.service
|
|
sudo systemctl --no-pager status sdp-control-plane.service | head -10
|
|
exit
|
|
```
|
|
|
|
If still failing, the systemd manager itself is in a bad state. Reboot the VM (last resort; will interrupt any other work on it).
|
|
|
|
### Agent-micro on 92 gets `connection reset by peer` connecting to 186:3452
|
|
|
|
`connection reset by peer` is at the TCP layer. The most common cause is a **stale WebSocket connection from before 186's `daemon-reexec`** (in step 6b). The kernel briefly drops the listening socket during re-exec; any in-flight connection that tries to use it gets RST. The agent retries every 2s, but if it connected right before the reexec and has been hammering the broken fd, just restart it:
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
sudo systemctl restart sdp-agent-micro.service
|
|
sudo journalctl -u sdp-agent-micro.service -n 20 --no-pager -f
|
|
# Look for: "agent-micro connected as micro"
|
|
# Ctrl-C to exit journalctl
|
|
exit
|
|
```
|
|
|
|
If the journal shows the agent connecting and immediately seeing RST again (a tight loop with no successful connect), the path itself is being blocked. Run these on 186 to find the cause:
|
|
|
|
1. **iptables on 186 has a REJECT rule for 3452:**
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo iptables -L INPUT -n | head -30
|
|
exit
|
|
```
|
|
|
|
If you see a REJECT rule for port 3452, drop or modify it. The control plane is on the same host, so there's no reason to filter loopback or local-subnet traffic to it.
|
|
|
|
2. **fail2ban has banned 92's IP:**
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo fail2ban-client status
|
|
sudo fail2ban-client status sshd 2>/dev/null
|
|
exit
|
|
```
|
|
|
|
If 92's IP is in the banned list, add the SDP subnet to `ignoreip` in `/etc/fail2ban/jail.local` and `sudo fail2ban-client reload`.
|
|
|
|
3. **Verify the network path works at all** (rules out firewall entirely). From 92, a plain HTTP request to the control plane's port:
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
curl -v http://172.18.139.186:3452/
|
|
exit
|
|
```
|
|
|
|
If you get *any* HTTP response (a Go HTTP 404 "page not found" is the expected answer for `GET /`) → the path is open and the problem is the WebSocket connection state. Restart the agent as above. If you get `Connection reset by peer` again → the path is being blocked, look at iptables/fail2ban.
|
|
|
|
4. **Verify the WS endpoint accepts an upgrade on 186 itself** (rules out the control plane binary):
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
curl -i \
|
|
-H "Connection: Upgrade" -H "Upgrade: websocket" \
|
|
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
|
|
-H "Sec-WebSocket-Version: 13" \
|
|
"http://127.0.0.1:3452/ws/agent?node=micro"
|
|
exit
|
|
```
|
|
|
|
Should return HTTP 101 Switching Protocols. If it does, the network is the issue. If it doesn't, the control plane binary has a problem (look at `sdp-control-plane.service` journal for a startup error).
|
|
|
|
### Still failing after the curl test? Check what's bound to port 3452
|
|
|
|
The control plane is one listener on port 3452 — if systemd or another process is also bound, the kernel can RST connections to the "wrong" socket. On 186:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo ss -tlnp 'sport = :3452'
|
|
sudo lsof -i :3452
|
|
sudo systemctl list-sockets --no-legend 2>/dev/null | grep 3452
|
|
exit
|
|
```
|
|
|
|
You should see only one listener: `control-plane` PID, IPv6 `*:3452` (dual-stack). If you see anything else — another systemd socket, a leftover container, a proxy — kill it. Then `sudo systemctl restart sdp-control-plane.service` on 186 and try the agent again.
|
|
|
|
### One more thing to try: IPv4 vs IPv6
|
|
|
|
Go's dual-stack listen (`:3452`) registers an IPv6 socket that also accepts IPv4 via IPv4-mapped addresses. Some networks route IPv4 and IPv6 differently, and a corporate firewall might allow one but not the other. From 92:
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
getent ahosts 172.18.139.186
|
|
curl -4 -v http://172.18.139.186:3452/ 2>&1 | head -10
|
|
curl -6 -v http://172.18.139.186:3452/ 2>&1 | head -10
|
|
exit
|
|
```
|
|
|
|
If `-4` works and `-6` RSTs, or vice versa, the network is treating them differently. Fix by either:
|
|
|
|
- Pin the control plane to IPv4 only: edit `sdp-control-plane.service` `ExecStart` to use `-addr 0.0.0.0:3452` instead of `-addr :3452`. The Go `net.Listen` interprets `:3452` as `[::]:3452` (IPv6 dual-stack) and `0.0.0.0:3452` as IPv4-only.
|
|
- Or pin the agent-micro to IPv4: `Environment=SDP_CP_URL=ws://172.18.139.186:3452/ws/agent` already uses the IPv4 literal, so this should "just work" — but if the kernel still tries IPv6 first, set `GODEBUG=netdns=go+1` or just use the literal IPv4 address in the URL.
|
|
|
|
If both `-4` and `-6` work, the network is fine. Re-run the agent-micro restart and re-check the journal.
|
|
|
|
### Curl from 92 works but the agent still RSTs — test the WebSocket upgrade
|
|
|
|
The control plane is up, plain HTTP from 92 reaches it, but the WebSocket dial RSTs. The TCP connection succeeds (so it's not a firewall on the port) but the server-side read of the request triggers a RST. Two tests, one from each side:
|
|
|
|
**1. WS upgrade on 186 itself (rules out the control plane binary):**
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
curl -i \
|
|
-H "Connection: Upgrade" -H "Upgrade: websocket" \
|
|
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
|
|
-H "Sec-WebSocket-Version: 13" \
|
|
"http://127.0.0.1:3452/ws/agent?node=micro"
|
|
exit
|
|
```
|
|
|
|
- 101 Switching Protocols → control plane is fine. Network between 92 and 186 is the issue.
|
|
- RST or 4xx → control plane is broken. Check `journalctl -u sdp-control-plane.service` for errors after the listen line.
|
|
|
|
**2. WS upgrade from 92 (rules out a header-aware firewall):**
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
curl -i \
|
|
-H "Connection: Upgrade" -H "Upgrade: websocket" \
|
|
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
|
|
-H "Sec-WebSocket-Version: 13" \
|
|
"http://172.18.139.186:3452/ws/agent?node=micro"
|
|
exit
|
|
```
|
|
|
|
- 101 → firewall allows WS-shaped traffic. The agent's client is the issue (unlikely; same gorilla/websocket).
|
|
- RST → some middlebox (iptables, corporate firewall, fail2ban) is matching on `Upgrade: websocket` and RSTing. Find the rule.
|
|
- 4xx → control plane reachable but rejecting the upgrade.
|