8c598ad69f
'connection reset by peer' on the WS dial is at the TCP layer, not the application layer. Almost always a firewall on 186 iptables REJECT, fail2ban ban, or stale conntrack state. Document the diagnostic ladder: iptables -L INPUT, fail2ban status, then a plain HTTP curl from 92 to verify the network path, then a WS upgrade curl on 186 itself to verify the control plane's upgrader.
349 lines
11 KiB
Markdown
349 lines
11 KiB
Markdown
# SDP — manual deploy
|
|
|
|
A copy-pasteable runbook. The principle: anything that runs on a VM is done from inside that VM (just `ssh` in and run it). Anything that pushes files from your laptop to a VM uses `scp` and prompts for the password.
|
|
|
|
No `deploy.sh` is involved. No `sshpass`. You type your passwords.
|
|
|
|
## 0. Pull the repo on your laptop
|
|
|
|
```bash
|
|
cd ~/wherever/bri-sandbox-development-platform
|
|
git pull origin main
|
|
```
|
|
|
|
Confirm the artifacts are present:
|
|
|
|
```bash
|
|
ls bin/control-plane bin/agent-micro bin/agent-gateway dashboard/out/index.html systemd/sdp-*.service
|
|
```
|
|
|
|
## 1. Kill old SDP processes on each VM (skip on a fresh VM)
|
|
|
|
On 92:
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
pkill -f 'bin/agent-micro' 2>/dev/null; echo done
|
|
exit
|
|
```
|
|
|
|
On 186:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
pkill -f 'bin/control-plane' 2>/dev/null
|
|
pkill -f 'bin/agent-gateway' 2>/dev/null
|
|
echo done
|
|
exit
|
|
```
|
|
|
|
## 2. Sanity-check nginx and docker on 186
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo nginx -t
|
|
sudo systemctl is-active docker
|
|
ls -la ~/SDP/dashboard/index.html 2>/dev/null || echo 'dashboard will be created in step 6'
|
|
exit
|
|
```
|
|
|
|
- `nginx -t` says `syntax is ok` → good.
|
|
- `docker` is `active` → good.
|
|
- Dashboard missing is fine; step 6 pushes it.
|
|
|
|
## 3. Configure nginx on 186 (only on first deploy, or after editing)
|
|
|
|
Splice the four `location` blocks from `nginx/sandbox.conf` into `/etc/nginx/sites-available/default` inside the existing `server { }`. Read the file from your laptop first:
|
|
|
|
```bash
|
|
cat nginx/sandbox.conf
|
|
```
|
|
|
|
On 186:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo vim /etc/nginx/sites-available/default
|
|
# paste the four blocks somewhere inside the server { }
|
|
sudo nginx -t
|
|
sudo systemctl reload nginx
|
|
exit
|
|
```
|
|
|
|
## 4. Push the binaries and dashboard to the VMs
|
|
|
|
From your laptop. `scp` will prompt for the password.
|
|
|
|
**To 92 (micro):**
|
|
|
|
```bash
|
|
scp bin/agent-micro administrator@172.18.136.92:~/SDP/bin/agent-micro
|
|
```
|
|
|
|
**To 186 (gateway):**
|
|
|
|
```bash
|
|
scp bin/control-plane bin/agent-gateway administrator@172.18.139.186:~/SDP/bin/
|
|
scp -r dashboard/out/. administrator@172.18.139.186:~/SDP/dashboard/
|
|
```
|
|
|
|
**Make binaries executable** (on each VM):
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92 "chmod +x ~/SDP/bin/agent-micro"
|
|
ssh administrator@172.18.139.186 "chmod +x ~/SDP/bin/control-plane ~/SDP/bin/agent-gateway"
|
|
```
|
|
|
|
**Pre-create the control plane's data dir on 186** (SQLite + log files live here):
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186 "mkdir -p ~/SDP/data && ls -ld ~/SDP/data"
|
|
```
|
|
|
|
Should print `drwxr-xr-x ... administrator administrator ... /home/administrator/SDP/data`. The control plane binary creates it on first run too, but doing it now means the systemd unit's `ReadWritePaths` check has somewhere to point at.
|
|
|
|
## 5. Push the systemd unit files
|
|
|
|
From your laptop. `scp` will prompt for the password.
|
|
|
|
```bash
|
|
scp systemd/sdp-agent-micro.service administrator@172.18.136.92:/tmp/sdp-agent-micro.service
|
|
scp systemd/sdp-control-plane.service systemd/sdp-agent-gateway.service administrator@172.18.139.186:/tmp/
|
|
```
|
|
|
|
## 6. Install the unit files and start the services
|
|
|
|
### 8a. 92 (micro agent only)
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
sudo install -m 644 -o root -g root /tmp/sdp-agent-micro.service /etc/systemd/system/sdp-agent-micro.service
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable sdp-agent-micro.service
|
|
sudo systemctl restart sdp-agent-micro.service
|
|
sudo systemctl --no-pager status sdp-agent-micro.service | head -10
|
|
sudo journalctl -u sdp-agent-micro.service -n 10 --no-pager
|
|
exit
|
|
```
|
|
|
|
Status should be `active (running)`. Journal should show a clean startup, then either a `dial: ws://...` reconnect loop (waiting for the control plane) or `agent-micro connected as micro`.
|
|
|
|
### 8b. 186 (control plane FIRST, then gateway agent)
|
|
|
|
```bash
|
|
sudo install -m 644 -o root -g root /tmp/sdp-control-plane.service /etc/systemd/system/sdp-control-plane.service
|
|
sudo mkdir -p /home/administrator/SDP/data
|
|
sudo chown administrator:administrator /home/administrator/SDP/data
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable sdp-control-plane.service
|
|
sudo systemctl restart sdp-control-plane.service
|
|
sudo systemctl --no-pager status sdp-control-plane.service | head -10
|
|
sudo journalctl -u sdp-control-plane.service -n 10 --no-pager
|
|
```
|
|
|
|
The control plane must be up before the gateway agent starts (or the agent just retries). Wait for `active (running)`, then continue:
|
|
|
|
```bash
|
|
sudo install -m 644 -o root -g root /tmp/sdp-agent-gateway.service /etc/systemd/system/sdp-agent-gateway.service
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable sdp-agent-gateway.service
|
|
sudo systemctl restart sdp-agent-gateway.service
|
|
sudo systemctl --no-pager status sdp-agent-gateway.service | head -10
|
|
sudo journalctl -u sdp-agent-gateway.service -n 10 --no-pager
|
|
exit
|
|
```
|
|
|
|
The journal should show `agent-gateway connected as gateway` after a beat.
|
|
|
|
## 7. Browser smoke test (from your laptop)
|
|
|
|
Visit: `http://172.18.139.186/sandbox/credit-card/`
|
|
|
|
- HTML renders (CSS + JS load) → nginx `try_files` is right.
|
|
- Login form submits → `/sandbox/credit-card/api/login` proxies to `:3452`.
|
|
- Login with any Bitbucket creds returns 200 → the gateway agent ran `git ls-remote` successfully.
|
|
- After login, dashboard renders. Click **Sandboxes** → empty list (SQLite is fresh).
|
|
|
|
## 8. Following logs in real time
|
|
|
|
On 92 (micro agent):
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
sudo journalctl -u sdp-agent-micro.service -f
|
|
# Ctrl-C to exit
|
|
exit
|
|
```
|
|
|
|
On 186 (control plane + gateway agent):
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo journalctl -u sdp-control-plane.service -u sdp-agent-gateway.service -f
|
|
# Ctrl-C to exit
|
|
exit
|
|
```
|
|
|
|
## Common one-time fixes (apply, then re-run from step 6)
|
|
|
|
### `${SDP_CP_URL}` doesn't expand in the unit's ExecStart
|
|
|
|
Symptom: agent logs `flag: invalid value "${SDP_CP_URL}" for -cp`.
|
|
|
|
Fix: hardcode the URL in the unit. On your laptop, edit `systemd/sdp-agent-micro.service`:
|
|
|
|
```ini
|
|
ExecStart=/home/administrator/SDP/bin/agent-micro -node micro -cp ws://172.18.139.186:3452/ws/agent
|
|
```
|
|
|
|
(Remove the `Environment=` / `EnvironmentFile=` / `${SDP_CP_URL}` lines.) Do the same for `systemd/sdp-agent-gateway.service` (URL is `ws://127.0.0.1:3452/ws/agent`). Re-do steps 7 and 8.
|
|
|
|
### Micro agent on 92 can't reach the control plane on 186:3452
|
|
|
|
Symptom: `sdp-agent-micro.service` journal shows `dial: ... connection refused` or `i/o timeout` to `172.18.139.186:3452`.
|
|
|
|
Fix: add a `/ws/agent` proxy block to 186's nginx (alongside the four from `nginx/sandbox.conf`):
|
|
|
|
```nginx
|
|
location /ws/agent {
|
|
proxy_pass http://127.0.0.1:3452;
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Upgrade $http_upgrade;
|
|
proxy_set_header Connection "upgrade";
|
|
proxy_set_header Host $host;
|
|
proxy_read_timeout 3600s;
|
|
}
|
|
```
|
|
|
|
On your laptop, edit `systemd/sdp-agent-micro.service` to dial through nginx on 80:
|
|
|
|
```ini
|
|
Environment=SDP_CP_URL=ws://172.18.139.186/ws/agent
|
|
```
|
|
|
|
(Port 80, no `:3452`.) Then on 186, reload nginx and re-do steps 7 and 8a.
|
|
|
|
### Login returns "git ls-remote rejected"
|
|
|
|
Either:
|
|
- The gateway agent isn't connected (re-run step 6b and check the journal).
|
|
- Your Bitbucket creds are wrong.
|
|
- The api-gateway repo path on 186 is wrong. The agent looks at `/var/www/html/erangel-ocean` by default. On 186:
|
|
|
|
```bash
|
|
ls -d /var/www/html/erangel-ocean
|
|
```
|
|
|
|
If the repo is at a different path, edit `agent-gateway/cmd/agent-gateway/main.go`:
|
|
|
|
```go
|
|
var repos = map[string]string{
|
|
"api-gateway": "/your/actual/path",
|
|
}
|
|
```
|
|
|
|
Then `./scripts/build.sh`, re-do steps 6 and 8b.
|
|
|
|
### Service containers can't be created (alpine:3.20 or php:8.3-apache not loaded)
|
|
|
|
Symptom: a deploy event stream shows `DEPLOY FAILED` with `image not found`.
|
|
|
|
The runtime images must be pre-loaded on the host (the VMs have no internet). On 92:
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
docker load -i /path/to/alpine-3.20.tar
|
|
exit
|
|
```
|
|
|
|
On 186:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
docker load -i /path/to/php-8.3-apache.tar
|
|
docker load -i /path/to/alpine-3.20.tar
|
|
exit
|
|
```
|
|
|
|
### Service fails with `status=226/NAMESPACE` and `Failed to set up mount namespacing: No such file or directory`
|
|
|
|
Your binary is fine; systemd's service-execution environment is broken. Diagnose by running the binary manually as `administrator`:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
./SDP/bin/control-plane -addr :3452 -data ./SDP/data
|
|
# Should print "control-plane listening on :3452 (data=./SDP/data)"
|
|
# Ctrl-C to exit
|
|
exit
|
|
```
|
|
|
|
If that works, the binary is fine. systemd's namespace setup is failing — common cause on this Ubuntu: `/run/systemd` is missing. Force it to be recreated:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo systemctl daemon-reexec
|
|
sudo systemctl restart sdp-control-plane.service
|
|
sudo systemctl --no-pager status sdp-control-plane.service | head -10
|
|
exit
|
|
```
|
|
|
|
If still failing, the systemd manager itself is in a bad state. Reboot the VM (last resort; will interrupt any other work on it).
|
|
|
|
### Agent-micro on 92 gets `connection reset by peer` connecting to 186:3452
|
|
|
|
`connection reset by peer` is at the TCP layer — the SYN reaches the host kernel but something RSTs the connection before the control plane sees it. Common causes:
|
|
|
|
1. **iptables on 186 has a REJECT rule for 3452.** Check:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo iptables -L INPUT -n | head -30
|
|
exit
|
|
```
|
|
|
|
If you see a REJECT rule for port 3452, drop or modify it. The control plane is on the same host, so there's no reason to filter loopback or local-subnet traffic to it.
|
|
|
|
2. **fail2ban has banned the agent's IP.** Check:
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo fail2ban-client status
|
|
sudo fail2ban-client status sshd 2>/dev/null
|
|
exit
|
|
```
|
|
|
|
If 92's IP is in the banned list, add the SDP subnet to `ignoreip` in `/etc/fail2ban/jail.local` and `sudo fail2ban-client reload`.
|
|
|
|
3. **The kernel's connection tracking has stale state.** Restart it (last resort):
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
sudo systemctl restart nftables 2>/dev/null
|
|
sudo systemctl restart firewalld 2>/dev/null
|
|
exit
|
|
```
|
|
|
|
4. **Verify the network path works at all** before debugging firewall rules. From 92, a plain HTTP request to the control plane's port:
|
|
|
|
```bash
|
|
ssh administrator@172.18.136.92
|
|
curl -v http://172.18.139.186:3452/
|
|
exit
|
|
```
|
|
|
|
If you get *any* HTTP response (even a Go HTTP 400 for "missing node query") → the path is open and the problem is the WebSocket upgrade. If you get `Connection reset by peer` again → the path is being blocked, look at the iptables/fail2ban angle.
|
|
|
|
5. **Verify the WS endpoint works on 186 itself** (rules out the network and confirms the upgrade logic):
|
|
|
|
```bash
|
|
ssh administrator@172.18.139.186
|
|
curl -i \
|
|
-H "Connection: Upgrade" -H "Upgrade: websocket" \
|
|
-H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
|
|
-H "Sec-WebSocket-Version: 13" \
|
|
"http://127.0.0.1:3452/ws/agent?node=micro"
|
|
exit
|
|
```
|
|
|
|
Should return HTTP 101 Switching Protocols. If it does, the network from 92 is the issue. If it doesn't, the control plane binary has a problem.
|