Files
bri-sandbox-development-pla…/DEPLOY.md
T
Achmad 8c598ad69f DEPLOY.md: troubleshooting for agent-micro WS RST
'connection reset by peer' on the WS dial is at the TCP layer,
not the application layer. Almost always a firewall on 186
iptables REJECT, fail2ban ban, or stale conntrack state.

Document the diagnostic ladder: iptables -L INPUT, fail2ban
status, then a plain HTTP curl from 92 to verify the network
path, then a WS upgrade curl on 186 itself to verify the
control plane's upgrader.
2026-06-24 05:36:18 +00:00

11 KiB

SDP — manual deploy

A copy-pasteable runbook. The principle: anything that runs on a VM is done from inside that VM (just ssh in and run it). Anything that pushes files from your laptop to a VM uses scp and prompts for the password.

No deploy.sh is involved. No sshpass. You type your passwords.

0. Pull the repo on your laptop

cd ~/wherever/bri-sandbox-development-platform
git pull origin main

Confirm the artifacts are present:

ls bin/control-plane bin/agent-micro bin/agent-gateway dashboard/out/index.html systemd/sdp-*.service

1. Kill old SDP processes on each VM (skip on a fresh VM)

On 92:

ssh administrator@172.18.136.92
pkill -f 'bin/agent-micro' 2>/dev/null; echo done
exit

On 186:

ssh administrator@172.18.139.186
pkill -f 'bin/control-plane' 2>/dev/null
pkill -f 'bin/agent-gateway' 2>/dev/null
echo done
exit

2. Sanity-check nginx and docker on 186

ssh administrator@172.18.139.186
sudo nginx -t
sudo systemctl is-active docker
ls -la ~/SDP/dashboard/index.html 2>/dev/null || echo 'dashboard will be created in step 6'
exit
  • nginx -t says syntax is ok → good.
  • docker is active → good.
  • Dashboard missing is fine; step 6 pushes it.

3. Configure nginx on 186 (only on first deploy, or after editing)

Splice the four location blocks from nginx/sandbox.conf into /etc/nginx/sites-available/default inside the existing server { }. Read the file from your laptop first:

cat nginx/sandbox.conf

On 186:

ssh administrator@172.18.139.186
sudo vim /etc/nginx/sites-available/default
# paste the four blocks somewhere inside the server { }
sudo nginx -t
sudo systemctl reload nginx
exit

4. Push the binaries and dashboard to the VMs

From your laptop. scp will prompt for the password.

To 92 (micro):

scp bin/agent-micro administrator@172.18.136.92:~/SDP/bin/agent-micro

To 186 (gateway):

scp bin/control-plane bin/agent-gateway administrator@172.18.139.186:~/SDP/bin/
scp -r dashboard/out/. administrator@172.18.139.186:~/SDP/dashboard/

Make binaries executable (on each VM):

ssh administrator@172.18.136.92 "chmod +x ~/SDP/bin/agent-micro"
ssh administrator@172.18.139.186 "chmod +x ~/SDP/bin/control-plane ~/SDP/bin/agent-gateway"

Pre-create the control plane's data dir on 186 (SQLite + log files live here):

ssh administrator@172.18.139.186 "mkdir -p ~/SDP/data && ls -ld ~/SDP/data"

Should print drwxr-xr-x ... administrator administrator ... /home/administrator/SDP/data. The control plane binary creates it on first run too, but doing it now means the systemd unit's ReadWritePaths check has somewhere to point at.

5. Push the systemd unit files

From your laptop. scp will prompt for the password.

scp systemd/sdp-agent-micro.service administrator@172.18.136.92:/tmp/sdp-agent-micro.service
scp systemd/sdp-control-plane.service systemd/sdp-agent-gateway.service administrator@172.18.139.186:/tmp/

6. Install the unit files and start the services

8a. 92 (micro agent only)

ssh administrator@172.18.136.92
sudo install -m 644 -o root -g root /tmp/sdp-agent-micro.service /etc/systemd/system/sdp-agent-micro.service
sudo systemctl daemon-reload
sudo systemctl enable sdp-agent-micro.service
sudo systemctl restart sdp-agent-micro.service
sudo systemctl --no-pager status sdp-agent-micro.service | head -10
sudo journalctl -u sdp-agent-micro.service -n 10 --no-pager
exit

Status should be active (running). Journal should show a clean startup, then either a dial: ws://... reconnect loop (waiting for the control plane) or agent-micro connected as micro.

8b. 186 (control plane FIRST, then gateway agent)

sudo install -m 644 -o root -g root /tmp/sdp-control-plane.service /etc/systemd/system/sdp-control-plane.service
sudo mkdir -p /home/administrator/SDP/data
sudo chown administrator:administrator /home/administrator/SDP/data
sudo systemctl daemon-reload
sudo systemctl enable sdp-control-plane.service
sudo systemctl restart sdp-control-plane.service
sudo systemctl --no-pager status sdp-control-plane.service | head -10
sudo journalctl -u sdp-control-plane.service -n 10 --no-pager

The control plane must be up before the gateway agent starts (or the agent just retries). Wait for active (running), then continue:

sudo install -m 644 -o root -g root /tmp/sdp-agent-gateway.service /etc/systemd/system/sdp-agent-gateway.service
sudo systemctl daemon-reload
sudo systemctl enable sdp-agent-gateway.service
sudo systemctl restart sdp-agent-gateway.service
sudo systemctl --no-pager status sdp-agent-gateway.service | head -10
sudo journalctl -u sdp-agent-gateway.service -n 10 --no-pager
exit

The journal should show agent-gateway connected as gateway after a beat.

7. Browser smoke test (from your laptop)

Visit: http://172.18.139.186/sandbox/credit-card/

  • HTML renders (CSS + JS load) → nginx try_files is right.
  • Login form submits → /sandbox/credit-card/api/login proxies to :3452.
  • Login with any Bitbucket creds returns 200 → the gateway agent ran git ls-remote successfully.
  • After login, dashboard renders. Click Sandboxes → empty list (SQLite is fresh).

8. Following logs in real time

On 92 (micro agent):

ssh administrator@172.18.136.92
sudo journalctl -u sdp-agent-micro.service -f
# Ctrl-C to exit
exit

On 186 (control plane + gateway agent):

ssh administrator@172.18.139.186
sudo journalctl -u sdp-control-plane.service -u sdp-agent-gateway.service -f
# Ctrl-C to exit
exit

Common one-time fixes (apply, then re-run from step 6)

${SDP_CP_URL} doesn't expand in the unit's ExecStart

Symptom: agent logs flag: invalid value "${SDP_CP_URL}" for -cp.

Fix: hardcode the URL in the unit. On your laptop, edit systemd/sdp-agent-micro.service:

ExecStart=/home/administrator/SDP/bin/agent-micro -node micro -cp ws://172.18.139.186:3452/ws/agent

(Remove the Environment= / EnvironmentFile= / ${SDP_CP_URL} lines.) Do the same for systemd/sdp-agent-gateway.service (URL is ws://127.0.0.1:3452/ws/agent). Re-do steps 7 and 8.

Micro agent on 92 can't reach the control plane on 186:3452

Symptom: sdp-agent-micro.service journal shows dial: ... connection refused or i/o timeout to 172.18.139.186:3452.

Fix: add a /ws/agent proxy block to 186's nginx (alongside the four from nginx/sandbox.conf):

location /ws/agent {
    proxy_pass http://127.0.0.1:3452;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_set_header Host $host;
    proxy_read_timeout 3600s;
}

On your laptop, edit systemd/sdp-agent-micro.service to dial through nginx on 80:

Environment=SDP_CP_URL=ws://172.18.139.186/ws/agent

(Port 80, no :3452.) Then on 186, reload nginx and re-do steps 7 and 8a.

Login returns "git ls-remote rejected"

Either:

  • The gateway agent isn't connected (re-run step 6b and check the journal).

  • Your Bitbucket creds are wrong.

  • The api-gateway repo path on 186 is wrong. The agent looks at /var/www/html/erangel-ocean by default. On 186:

    ls -d /var/www/html/erangel-ocean
    

    If the repo is at a different path, edit agent-gateway/cmd/agent-gateway/main.go:

    var repos = map[string]string{
        "api-gateway": "/your/actual/path",
    }
    

    Then ./scripts/build.sh, re-do steps 6 and 8b.

Service containers can't be created (alpine:3.20 or php:8.3-apache not loaded)

Symptom: a deploy event stream shows DEPLOY FAILED with image not found.

The runtime images must be pre-loaded on the host (the VMs have no internet). On 92:

ssh administrator@172.18.136.92
docker load -i /path/to/alpine-3.20.tar
exit

On 186:

ssh administrator@172.18.139.186
docker load -i /path/to/php-8.3-apache.tar
docker load -i /path/to/alpine-3.20.tar
exit

Service fails with status=226/NAMESPACE and Failed to set up mount namespacing: No such file or directory

Your binary is fine; systemd's service-execution environment is broken. Diagnose by running the binary manually as administrator:

ssh administrator@172.18.139.186
./SDP/bin/control-plane -addr :3452 -data ./SDP/data
# Should print "control-plane listening on :3452 (data=./SDP/data)"
# Ctrl-C to exit
exit

If that works, the binary is fine. systemd's namespace setup is failing — common cause on this Ubuntu: /run/systemd is missing. Force it to be recreated:

ssh administrator@172.18.139.186
sudo systemctl daemon-reexec
sudo systemctl restart sdp-control-plane.service
sudo systemctl --no-pager status sdp-control-plane.service | head -10
exit

If still failing, the systemd manager itself is in a bad state. Reboot the VM (last resort; will interrupt any other work on it).

Agent-micro on 92 gets connection reset by peer connecting to 186:3452

connection reset by peer is at the TCP layer — the SYN reaches the host kernel but something RSTs the connection before the control plane sees it. Common causes:

  1. iptables on 186 has a REJECT rule for 3452. Check:

    ssh administrator@172.18.139.186
    sudo iptables -L INPUT -n | head -30
    exit
    

    If you see a REJECT rule for port 3452, drop or modify it. The control plane is on the same host, so there's no reason to filter loopback or local-subnet traffic to it.

  2. fail2ban has banned the agent's IP. Check:

    ssh administrator@172.18.139.186
    sudo fail2ban-client status
    sudo fail2ban-client status sshd 2>/dev/null
    exit
    

    If 92's IP is in the banned list, add the SDP subnet to ignoreip in /etc/fail2ban/jail.local and sudo fail2ban-client reload.

  3. The kernel's connection tracking has stale state. Restart it (last resort):

    ssh administrator@172.18.139.186
    sudo systemctl restart nftables 2>/dev/null
    sudo systemctl restart firewalld 2>/dev/null
    exit
    
  4. Verify the network path works at all before debugging firewall rules. From 92, a plain HTTP request to the control plane's port:

    ssh administrator@172.18.136.92
    curl -v http://172.18.139.186:3452/
    exit
    

    If you get any HTTP response (even a Go HTTP 400 for "missing node query") → the path is open and the problem is the WebSocket upgrade. If you get Connection reset by peer again → the path is being blocked, look at the iptables/fail2ban angle.

  5. Verify the WS endpoint works on 186 itself (rules out the network and confirms the upgrade logic):

    ssh administrator@172.18.139.186
    curl -i \
      -H "Connection: Upgrade" -H "Upgrade: websocket" \
      -H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \
      -H "Sec-WebSocket-Version: 13" \
      "http://127.0.0.1:3452/ws/agent?node=micro"
    exit
    

    Should return HTTP 101 Switching Protocols. If it does, the network from 92 is the issue. If it doesn't, the control plane binary has a problem.