Files
Achmad 574e6d207b Slice 2: agents and control plane run under systemd
- systemd/sdp-control-plane.service: plain host process on 186,
  listens on :3452, data dir ~/SDP/data. MemoryMax=512M,
  Restart=always, ReadWritePaths scoped to the data dir.
- systemd/sdp-agent-micro.service: plain host process on 92,
  default SDP_CP_URL=ws://172.18.139.186:3452/ws/agent. Operator
  can drop /etc/default/sdp-agent-micro to override. Depends on
  docker.service so the dockerd is up before the agent starts.
- systemd/sdp-agent-gateway.service: plain host process on 186,
  default SDP_CP_URL=ws://127.0.0.1:3452/ws/agent (loopback since
  both live on the same VM). Same env-file override pattern.
- All three use Type=simple, Restart=always, RestartSec=2s. The
  agents already reconnect on transient network drops, so
  restart-on-crash is the right policy.
- The agents talk to the host dockerd via /var/run/docker.sock to
  spawn the actual service containers (sdp-<repo>). Service
  containers are managed by docker, not systemd — only the
  long-running agents and the control plane are under systemd.
- scripts/deploy.sh: now a one-shot — scp's binaries, dashboard,
  and unit files; systemctl daemon-reload + enable --now + restart
  each service in the right order (control plane first on 186 so
  the gateway agent has something to dial). Prints status + last
  10 journal lines per service so the user can see it came up.
- AGENTS.md, README.md: layout tree updated, deploy section
  rewritten, the systemd units documented alongside the agents
  and control plane.
2026-06-24 04:54:28 +00:00

7.1 KiB

AGENTS.md — Sandbox Deployment Platform

Build, lint, test

The build script is the only way to compile — local Go can't fetch the 1.24 toolchain. Run:

./scripts/build.sh        # cross-compiles 3 Go binaries + builds the Next.js dashboard
./scripts/deploy.sh       # SSHes artifacts + systemd units to 92 and 186, then enables+starts them; needs sshpass

The script uses a golang:1.24-alpine container with a persistent sdp-gocache named volume. GO_IMAGE=... overrides the image. Outputs: bin/{control-plane,agent-micro,agent-gateway} (Linux/amd64, static) and dashboard/out/.

Per-module Go work uses the same container:

docker run --rm -v "$PWD:/src" -w /src/<module> golang:1.24-alpine sh -c \
  "apk add --no-cache git >/dev/null && git config --global --add safe.directory /src && go vet ./..."

For a single test:

docker run --rm -v "$PWD:/src" -w /src/control-plane golang:1.24-alpine sh -c \
  "apk add --no-cache git >/dev/null && git config --global --add safe.directory /src && go test ./internal/store/..."

There is one test file today: control-plane/internal/store/store_test.go (round-trips all Slice-2 CRUD).

The dashboard has no separate typecheck or lint script — npm run build runs both. cd dashboard && npm run build locally is fine; node_modules is gitignored.

Layout

Five Go modules in a workspace (go.work):

  • protocol/ — wire types shared by CP and agents. Keep small.
  • agentlib/gitutil (askpass-via-stdin credential helper; git ls-remote, fetch, checkout, pull, for-each-ref, reset --hard) and deployer (per-deployment state machine; NewGo for microservices, NewPHP for erangel).
  • control-plane/ — HTTP API + WS hub + SQLite. Routes split across internal/api/{login,sandboxes,templates,environments,routes,deployments,repos}.go. internal/ws/hub.go exposes CallAgent for sync RPCs.
  • agent-micro/ — runs on 172.18.136.92.
  • agent-gateway/ — runs on 172.18.139.186; owns erangel at /var/www/html/erangel-ocean and the <service>_url patching.
  • systemd/ — unit files for the three long-running services (sdp-control-plane.service, sdp-agent-micro.service, sdp-agent-gateway.service). All three are plain host processes managed by systemd; the agents talk to the host's dockerd via /var/run/docker.sock to spawn the actual service containers (sdp-<repo>) for each deploy. Service containers are NOT managed by systemd — that's docker's job.

Dashboard is a separate next build static export under dashboard/src/app/. Static export means dynamic routes need generateStaticParams (see the sandboxes/[id] page for the pattern).

Wire protocol

The agent → control-plane channel is one protocol.Event per WS text message. The control-plane → agent channel is an ad-hoc envelope {op, id, data}. op values: deploy, stop, list_repos, list_branches, list_routes, probe, push_routes. RPC replies have {op:"reply", id, ok, error?} and a data field. The two shapes are disambiguated by kind (event) vs op (rpc reply). New ops go in agentlib/.../main.go's switch and the control-plane's repos.go / sandboxes.go / routes.go handlers — there is no central registry.

Conventions

  • ponytail: comments mark intentional shortcuts and "TODO: real impl"-style carve-outs. They survive into main. Don't remove without fixing the underlying limitation.
  • Slice-2 stable container name: sdp-<repo> (no deployment id). The next deploy force-removes the existing one. One live container per repo at a time.
  • Gateway agent persists the per-branch OCP-default snapshot to <repoPath>/.sdp/ocp-defaults.json. Re-captured on every deploy so branch switches don't break "Restore OCP" buttons.
  • NewPHP runs git reset --hard before fetch (via Spec.PreGitReset), and the agent passes an AfterStart closure that re-applies active route overrides after the container is up. This is what survives git reset --hard + checkout.
  • protocol.Event.ContainerID is set on the deployer side; the deployer writes it back via Store.SetContainerID. (Currently the field on the event is unused; container id is recorded in SQLite.)
  • Cookie auth: sdp_session HttpOnly cookie; the withAuth middleware skips /api/login. WebSocket endpoints are NOT auth-gated by the middleware — they rely on the agent being on a private network.
  • Crendentials travel with each deploy/probe/push_routes frame from control plane to agent. Never logged. Never persisted on the agent.

Gotchas

  • Host Go (/usr/bin/go) is older than the go 1.24 modules require and the toolchain download is blocked. Use the golang:1.24-alpine container. Do not edit code expecting go build to work locally.
  • The micro agent and gateway agent main.go files duplicate most logic (dial / writer / readLoop / runDeploy). The shared code is in agentlib/. When adding a new op, both files need a switch case.
  • moby/moby/client v0.5.0 uses netip.Addr for PortBinding.HostIP, not a string.
  • sdp-<repo> containers must be in a state where docker rm -f works (the Slice-2 "one live per repo" rule). Don't manually docker run a second container with the same name.
  • The erangel repo path is /var/www/html/erangel-ocean on 186, NOT ~/SDP (README's earlier value is wrong; the spec was fixed in Slice 2). APACHE_DOCUMENT_ROOT is set to the same path so the gateway is served at /erangel/.
  • agent-gateway/.../main.go re-imports the routesState type and uses rs as both a value and a parameter name in some helpers. Compiles fine; just be aware when grepping.
  • Static-export dynamic routes: generateStaticParams must return at least one placeholder; the actual id is read at runtime in the client component. See dashboard/src/app/dashboard/sandboxes/[id]/.

Verifying changes locally

# Typecheck + build everything
./scripts/build.sh

# Run the only Go test
docker run --rm -v "$PWD:/src" -w /src/control-plane golang:1.24-alpine sh -c \
  "apk add --no-cache git >/dev/null && git config --global --add safe.directory /src && go test ./..."

# Smoke the control plane
./bin/control-plane -addr :3452 -data /tmp/sdp-data &
curl -i -X POST http://127.0.0.1:3452/api/login -d '{"username":"x","password":"y"}'
# Expects 401 ("login failed — git ls-remote rejected") when no gateway agent is connected.

Out of scope

RBAC, suspend/resume, sandbox cloning beyond "clone template into sandbox", per-sandbox Docker networks, per-sandbox resource limits, health monitoring, the 172.18.136.93 infra agent, notifications. These are listed as later in REQUIREMENTS.md.

Do not

  • Do not commit or push unless the user explicitly says "commit" or "push".
  • Do not change the gateway repo path back to ~/SDP (old docs say so; reality is /var/www/html/erangel-ocean).
  • Do not rebuild the dashboard via next start for production; the output is served by nginx on 186. Configure nginx by hand; the reference config is in nginx/nginx.conf and uses root /home/administrator/SDP/dashboard; (i.e. the path deploy.sh scp's the static export to).
  • Do not log or persist Bitbucket creds anywhere.