When plain HTTP from 92 reaches the control plane but the
WebSocket dial RSTs, test the upgrader on each side:
1. curl from 186 to 127.0.0.1:3452 with WS upgrade headers:
- 101 → control plane is fine, network is the issue.
- RST/4xx → control plane is broken.
2. curl from 92 to 186:3452 with WS upgrade headers:
- 101 → firewall allows WS traffic, agent's client is the issue.
- RST → some middlebox matches on the Upgrade header.
- 4xx → control plane rejects the upgrade.
The user's micro agent dials ws://172.18.139.186:3452/ws/agent
directly, bypassing nginx. The 'add a /ws/agent nginx proxy
block' workaround was for a different topology where the
agent goes through nginx on 80, which doesn't apply here.
That section was confusing — it suggested an nginx change
the user doesn't need. Delete it.
If something else is also listening on :3452 (a leftover
container, a systemd-managed socket, a proxy from an earlier
session), the kernel can route new connects to it and that
listener may RST. Curl from outside gets a clean response
from the real listener; the agent's WS dial lands on the
stale one and gets RST.
Add ss -tlnp + lsof + systemctl list-sockets to the
diagnostic ladder.
When 186's control plane comes up via systemctl daemon-reexec
(the 226/NAMESPACE fix), the listening socket is briefly
dropped and re-created. Any WebSocket connection that was
in flight at that moment gets RST. The agent retries every
2s, but if the dial happened exactly during the reexec
window the agent can sit in a tight RST loop until restarted.
Document the restart-agent-micro as the first thing to try,
and demote the iptables/fail2ban/curl diagnostic steps to
'if the agent still RSTs after restart'.
'connection reset by peer' on the WS dial is at the TCP layer,
not the application layer. Almost always a firewall on 186
iptables REJECT, fail2ban ban, or stale conntrack state.
Document the diagnostic ladder: iptables -L INPUT, fail2ban
status, then a plain HTTP curl from 92 to verify the network
path, then a WS upgrade curl on 186 itself to verify the
control plane's upgrader.
The control plane binary will create the data dir on first
start, but doing it before systemd starts the service means
the ReadWritePaths scope has somewhere to point at, and
faster diagnosis if anything else is wrong.
The 226/NAMESPACE with 'Failed to set up mount namespacing' error
is misleading — the binary is fine, but systemd can't build the
mount namespace for the service. The binary runs fine when
launched directly as administrator; the bug is in the systemd
manager's runtime state.
Document the diagnostic (run the binary manually) and the two
fixes: systemctl daemon-reexec (recreate /run/systemd) as the
first attempt, reboot as the last resort.
The user has made it clear (twice now) that they don't want
sudo advice in the runbook — they can type the password
themselves and don't want a script or sudoers change.
Delete the 'Diagnose sudo' step and the 'Sudo on the company
VMs' reminder step. Sudo is just expected behavior; when
the user runs 'sudo systemctl ...' and gets a prompt, they
type the password. No commentary needed.
Renumber the remaining steps so they're sequential 0-8.
Company VM — no sudoers changes. Replace the 'set up sudoers
NOPASSWD' step with a brief note that every sudo call will
prompt for the password and the user types it. The 15-minute
sudo timestamp means the user only types it once per shell
session, but they will see the prompt several times across
the deploy as they run multiple sudo commands.
Update the step-1 diagnostic outcomes to point at the new
no-policy-change reality: NOPASSWD or different passwords
both still work, the user just types the right one at each
sudo prompt.
A hand-typed manual deploy guide. Every step is a single
ssh or scp from the laptop, or a one-shot block of commands
inside the VM. No sshpass, no env-var passwords, no
sudo -S password piping. The user types their passwords
interactively when prompted.
The old deploy.sh had grown into a tangle of -tt / sudo -S
/ PAGER=cat workarounds that hid what was actually happening
and was fragile across systemd versions. The runbook trades
that off for explicit per-step commands that the user can
verify by reading the output.
Troubleshooting section at the bottom covers the four most
likely first-deploy failures: SDP_CP_URL expansion, micro
agent can't reach the control plane, login auth rejection,
and missing runtime images.
With -tt allocating a remote PTY, systemctl and journalctl would
sometimes open a pager (less/more) even with --no-pager, leaving
the script blocked until the user hits q or Ctrl-C.
Force PAGER=cat and SYSTEMD_PAGER=cat inside every remote sudo
call and inside the status_block journalctl command. Add
--output=cat to journalctl too as belt-and-suspenders.
Status output is also piped through | head -3 / | head -20 to
guarantee a finite output even if the pager or color escape
handling misbehaves.
Adds password piping so the script works without a sudoers NOPASSWD
rule, on the assumption that the SSH login password is the same as
the sudo password (common on these VMs).
- ssh -tt now forces a TTY allocation; sudo -S requires one and
was failing with 'sudo: no tty present' over plain non-interactive
ssh.
- New run_remote_sudo helper pipes the per-VM password to
'sudo -S -p ""' so each remote call authenticates without a
prompt. The empty -p suppresses '[sudo] password for ...' from
appearing in journal tail output.
- install_unit, restart_unit, and the journalctl call in
status_block all go through run_remote_sudo. systemctl status
no longer needs sudo (the unit is owned by administrator and
status doesn't require root for it).
- If your sudo password differs from the login password, the
script will silently no-op the install/restart steps. Fix by
setting the right password via SDP_92_PASS / SDP_186_PASS, or
add a NOPASSWD rule in /etc/sudoers.d/sdp-deploy and revert
this change.
- systemd/sdp-control-plane.service: plain host process on 186,
listens on :3452, data dir ~/SDP/data. MemoryMax=512M,
Restart=always, ReadWritePaths scoped to the data dir.
- systemd/sdp-agent-micro.service: plain host process on 92,
default SDP_CP_URL=ws://172.18.139.186:3452/ws/agent. Operator
can drop /etc/default/sdp-agent-micro to override. Depends on
docker.service so the dockerd is up before the agent starts.
- systemd/sdp-agent-gateway.service: plain host process on 186,
default SDP_CP_URL=ws://127.0.0.1:3452/ws/agent (loopback since
both live on the same VM). Same env-file override pattern.
- All three use Type=simple, Restart=always, RestartSec=2s. The
agents already reconnect on transient network drops, so
restart-on-crash is the right policy.
- The agents talk to the host dockerd via /var/run/docker.sock to
spawn the actual service containers (sdp-<repo>). Service
containers are managed by docker, not systemd — only the
long-running agents and the control plane are under systemd.
- scripts/deploy.sh: now a one-shot — scp's binaries, dashboard,
and unit files; systemctl daemon-reload + enable --now + restart
each service in the right order (control plane first on 186 so
the gateway agent has something to dial). Prints status + last
10 journal lines per service so the user can see it came up.
- AGENTS.md, README.md: layout tree updated, deploy section
rewritten, the systemd units documented alongside the agents
and control plane.
The original auth commit shipped the in-memory session store with
just Issue and Valid. The Slice-2 /api/logout handler and the
audit-trail (user column on each deployment) need:
- User(tok): look up the username for a valid session.
- Revoke(tok): drop a session; used by /api/logout.
Tiny follow-up — kept as its own commit because the rest of the
auth work had already shipped in the parent commit by the time the
dashboard's logout button and the deployment-audit-trail surfaced
the need for these methods.
- control-plane default listen addr is now :3452 (was :8080). An
unusual port to avoid collisions on the VM.
- agent-micro and agent-gateway default SDP_CP_URL points at
ws://localhost:3452/ws/agent. docker-compose.yml updates the
control plane command, host port mapping, and agent -cp URLs.
- nginx/nginx.conf (the legacy root-mount reference) uses
127.0.0.1:3452 for the upstream. nginx/sandbox.conf is the new
deployment config: four location blocks for the /sandbox/credit-card
mount — _next/static serves cached chunks, /api/ and /ws/ proxy
to 127.0.0.1:3452, /sandbox/credit-card serves the static
dashboard with try_files for SPA routing.
- scripts/patch-nginx.sh: deleted. The user configures nginx on 186
by hand. scripts/deploy.sh no longer calls it.
- AGENTS.md: new file. Documents the build/lint/test commands
(with the golang:1.24-alpine container — local Go can't fetch
the toolchain), the wire protocol, the Slice-2 conventions
(sdp-<repo> container naming, snapshot persistence,
PreGitReset/AfterStart hooks), the repo-path gotcha, and the
build-artifacts-in-git rationale.
- dashboard/out: now tracked in git, alongside bin/. The dashboard
static export is scp'd to 186 on deploy; the VMs have no
internet so they can't regenerate it. .gitignore comment
explains this and warns against re-ignoring.
- README.md / REQUIREMENTS.md: status updated to 'Slice 2 done',
per-feature checklist marked. Erangel repo path corrected to
/var/www/html/erangel-ocean (was wrongly ~/SDP in earlier docs).
- New /dashboard layout with a top nav (Quick Deploy / Sandboxes /
Templates / Environments / History) and a Logout button that
invalidates the session.
- Quick Deploy: stage list switches per repo (Go vs PHP, so the
composer-install stage is shown for the gateway), env-var textarea,
host-port input.
- Sandboxes: list, create, clone-from-template, delete.
- Sandbox detail: live <key>_url map from the gateway's config.php,
per-route toggle (OCP / sandbox override with a URL input),
microservice deploys with per-service host port and env, branch
picker.
- Templates / Environments: list + create + delete.
- History: filterable deployment list with state badges.
- Sandbox detail page is a server component with generateStaticParams
that delegates to a client component; required for the static export.
- API client: prefix all /api and /ws URLs with NEXT_PUBLIC_BASE_PATH
(set in next.config.js) so the dashboard works under a non-root
basePath.
- next.config.js: basePath and assetPrefix set to /sandbox/credit-card
so asset URLs and internal Link hrefs resolve under the sub-path.
NEXT_PUBLIC_BASE_PATH env is exposed to the browser bundle for the
fetch() prefix.
- store: add tables and CRUD for sandboxes (with services), templates
(with services, clone-into-sandbox), environments (named key/value
sets), and routes (per-sandbox <service>_url overrides).
- api: split into one file per resource. handleSandboxes/handleSandboxByID
covers CRUD + 'clone from template' + 'deploy one service in a sandbox'
(which merges the sandbox's env into the request, picks the port,
and dispatches the deploy frame to the right node).
handleTemplates/handleTemplateByID, handleEnvironments/handleEnvironmentByID,
handlePushRoutes cover the rest. The control plane's repo->node
resolution still lives in resolveNode (api-gateway -> gateway,
everything else -> micro).
- protocol: add RepoInfo, RouteOverride; add HostPort, SandboxID to DeployRequest.
- ws hub: add CallAgent for sync request/response RPCs over the agent WS,
and DeliverAgentReply to route {op:reply} frames back to the caller.
UnregisterAgent now also fails any pending RPCs so callers don't hang.
- agent-micro: new op handlers list_repos, list_branches, probe.
Wire protocol.Event frames use json.RawMessage so each op decodes
its own data shape.
- agent-gateway: same op handlers (list_repos/list_branches/probe) plus
push_routes, which the gateway uses to rewrite the api-gateway
config.php. Detailed in a later commit.
- control-plane login: validateViaAgent now calls CallAgent('probe')
against the gateway agent (git ls-remote), replacing the
accept-any-creds stub.
- control-plane repos: handleListRepos and handleListBranches forward
to the agents via list_repos / list_branches RPCs, replacing the
hardcoded fixtures.
- control-plane deployments: split into its own file. handleListDeployments
reads from SQLite (was hardcoded []). handleCreateDeployment now
supports sandbox-scoped deploys with a host port + env merge.
handleStopDeployment looks up the node from the deployment row.
- store: split into store.go + deployments.go. The Deployment type
adds sandboxId, containerId, hostPort. StartDeploymentInSandbox,
SetContainerID, ListDeployments, GetDeployment, LatestDeploymentBySandboxService
are new.
- store_test.go: round-trips every Slice-2 path (env, sandbox,
template, clone, routes, deployment).
- .gitignore: track bin/ — the build runs on a separate Linux box
with the golang:1.24 toolchain, and the binaries are SCPed from
there to the company VMs (92 / 186). The VMs have no internet.
- Tracked bin/{control-plane,agent-micro,agent-gateway}.
- New agentlib module (gitutil + deployer with NewGo / NewPHP) replaces
agent-micro/internal so both agents can share it (Go's internal/ rule
was blocking agent-gateway from importing agent-micro's packages).
- Migrate agents from legacy github.com/docker/docker/client to the
current github.com/moby/moby/client v0.5.0 / moby/moby/api v1.55.0.
- Fix compile errors in the original committed code: missing
gorilla/websocket import in control-plane/internal/ws/handlers.go,
unaliased dockerclient reference, wrong SQLite driver name
(sqlite3 -> sqlite), Dialer.Dial 3-return-value mismatch.
- scripts/build.sh: Go 1.23 -> 1.24, apk add git, safe.directory for
bind-mounted host tree, chmod inside container (host can't chmod
files owned by container root).
- README and REQUIREMENTS updated to reflect the actual architecture
(Go + SQLite, no Spring Boot, moby SDK, per-deploy no image build)
with a per-feature status checklist at the end of REQUIREMENTS.
Build now uses scripts/build.sh (Docker cross-compile, no Go install
needed). Add Prereqs, docker-compose dev section, Architecture notes,
and a list of intentional MVP stubs so reviewers know what's still
scaffolded vs what's real.
The go.work file enables workspace mode, which only allows -mod=readonly
or -mod=vendor. -mod=mod fails the build with:
go: -mod may only be set to readonly or vendor when in workspace mode
Drop the GOFLAGS line and let workspace mode pick the default
(readonly). Update go.work.sum to track module checksums.
Sandbox Deployment Platform — Go control plane + agents, NextJS dashboard,
nginx reverse proxy. Cross-compile via Docker; deploy via sshpass to
172.18.136.92 (micro) and 172.18.139.186 (gateway).
- control-plane: HTTP API, WS hub, SQLite (modernc.org/sqlite) for
progress, .log files for log persistence
- agent-micro / agent-gateway: alpine:3.20 + bind-mounted repo,
binary exec'd in container, no Dockerfile build step
- dashboard: NextJS static export + shadcn/ui components, single
WebSocket hook
- docker-compose.yml: three services on alpine:latest with docker
socket bind for agents
- scripts/: build.sh (golang:1.23-alpine cross-compile), deploy.sh,
patch-nginx.sh (idempotent nginx splice), ssh wrappers
Runtime model: pass-through Bitbucket creds per deploy, never logged or
persisted on the agent. Control plane never touches git or docker
directly — agents do all the work locally.