Files
Achmad 574e6d207b Slice 2: agents and control plane run under systemd
- systemd/sdp-control-plane.service: plain host process on 186,
  listens on :3452, data dir ~/SDP/data. MemoryMax=512M,
  Restart=always, ReadWritePaths scoped to the data dir.
- systemd/sdp-agent-micro.service: plain host process on 92,
  default SDP_CP_URL=ws://172.18.139.186:3452/ws/agent. Operator
  can drop /etc/default/sdp-agent-micro to override. Depends on
  docker.service so the dockerd is up before the agent starts.
- systemd/sdp-agent-gateway.service: plain host process on 186,
  default SDP_CP_URL=ws://127.0.0.1:3452/ws/agent (loopback since
  both live on the same VM). Same env-file override pattern.
- All three use Type=simple, Restart=always, RestartSec=2s. The
  agents already reconnect on transient network drops, so
  restart-on-crash is the right policy.
- The agents talk to the host dockerd via /var/run/docker.sock to
  spawn the actual service containers (sdp-<repo>). Service
  containers are managed by docker, not systemd — only the
  long-running agents and the control plane are under systemd.
- scripts/deploy.sh: now a one-shot — scp's binaries, dashboard,
  and unit files; systemctl daemon-reload + enable --now + restart
  each service in the right order (control plane first on 186 so
  the gateway agent has something to dial). Prints status + last
  10 journal lines per service so the user can see it came up.
- AGENTS.md, README.md: layout tree updated, deploy section
  rewritten, the systemd units documented alongside the agents
  and control plane.
2026-06-24 04:54:28 +00:00

160 lines
7.1 KiB
Markdown

# AGENTS.md — Sandbox Deployment Platform
## Build, lint, test
The build script is the only way to compile — local Go can't fetch the
1.24 toolchain. Run:
```
./scripts/build.sh # cross-compiles 3 Go binaries + builds the Next.js dashboard
./scripts/deploy.sh # SSHes artifacts + systemd units to 92 and 186, then enables+starts them; needs sshpass
```
The script uses a `golang:1.24-alpine` container with a persistent
`sdp-gocache` named volume. `GO_IMAGE=...` overrides the image. Outputs:
`bin/{control-plane,agent-micro,agent-gateway}` (Linux/amd64, static) and
`dashboard/out/`.
Per-module Go work uses the same container:
```
docker run --rm -v "$PWD:/src" -w /src/<module> golang:1.24-alpine sh -c \
"apk add --no-cache git >/dev/null && git config --global --add safe.directory /src && go vet ./..."
```
For a single test:
```
docker run --rm -v "$PWD:/src" -w /src/control-plane golang:1.24-alpine sh -c \
"apk add --no-cache git >/dev/null && git config --global --add safe.directory /src && go test ./internal/store/..."
```
There is one test file today: `control-plane/internal/store/store_test.go`
(round-trips all Slice-2 CRUD).
The dashboard has no separate typecheck or lint script — `npm run build`
runs both. `cd dashboard && npm run build` locally is fine; node_modules
is gitignored.
## Layout
Five Go modules in a workspace (`go.work`):
- `protocol/` — wire types shared by CP and agents. Keep small.
- `agentlib/``gitutil` (askpass-via-stdin credential helper;
`git ls-remote`, `fetch`, `checkout`, `pull`, `for-each-ref`,
`reset --hard`) and `deployer` (per-deployment state machine; `NewGo`
for microservices, `NewPHP` for erangel).
- `control-plane/` — HTTP API + WS hub + SQLite. Routes split across
`internal/api/{login,sandboxes,templates,environments,routes,deployments,repos}.go`.
`internal/ws/hub.go` exposes `CallAgent` for sync RPCs.
- `agent-micro/` — runs on 172.18.136.92.
- `agent-gateway/` — runs on 172.18.139.186; owns erangel at
`/var/www/html/erangel-ocean` and the `<service>_url` patching.
- `systemd/` — unit files for the three long-running services
(`sdp-control-plane.service`, `sdp-agent-micro.service`,
`sdp-agent-gateway.service`). All three are plain host processes
managed by systemd; the agents talk to the host's dockerd via
`/var/run/docker.sock` to spawn the actual service containers
(`sdp-<repo>`) for each deploy. Service containers are NOT
managed by systemd — that's docker's job.
Dashboard is a separate `next build` static export under
`dashboard/src/app/`. Static export means dynamic routes need
`generateStaticParams` (see the `sandboxes/[id]` page for the pattern).
## Wire protocol
The agent → control-plane channel is one `protocol.Event` per WS text
message. The control-plane → agent channel is an ad-hoc envelope
`{op, id, data}`. `op` values: `deploy`, `stop`, `list_repos`,
`list_branches`, `list_routes`, `probe`, `push_routes`. RPC replies have
`{op:"reply", id, ok, error?}` and a `data` field. The two shapes are
disambiguated by `kind` (event) vs `op` (rpc reply). New ops go in
`agentlib/.../main.go`'s switch and the control-plane's `repos.go` /
`sandboxes.go` / `routes.go` handlers — there is no central registry.
## Conventions
- `ponytail:` comments mark intentional shortcuts and "TODO: real
impl"-style carve-outs. They survive into main. Don't remove without
fixing the underlying limitation.
- Slice-2 stable container name: `sdp-<repo>` (no deployment id). The
next deploy force-removes the existing one. One live container per
repo at a time.
- Gateway agent persists the per-branch OCP-default snapshot to
`<repoPath>/.sdp/ocp-defaults.json`. Re-captured on every deploy so
branch switches don't break "Restore OCP" buttons.
- `NewPHP` runs `git reset --hard` before fetch (via
`Spec.PreGitReset`), and the agent passes an `AfterStart` closure
that re-applies active route overrides after the container is up.
This is what survives `git reset --hard` + checkout.
- `protocol.Event.ContainerID` is set on the deployer side; the
deployer writes it back via `Store.SetContainerID`. (Currently the
field on the event is unused; container id is recorded in SQLite.)
- Cookie auth: `sdp_session` HttpOnly cookie; the `withAuth` middleware
skips `/api/login`. WebSocket endpoints are NOT auth-gated by the
middleware — they rely on the agent being on a private network.
- Crendentials travel with each deploy/probe/push_routes frame from
control plane to agent. Never logged. Never persisted on the agent.
## Gotchas
- Host Go (`/usr/bin/go`) is older than the `go 1.24` modules require
and the toolchain download is blocked. Use the `golang:1.24-alpine`
container. Do not edit code expecting `go build` to work locally.
- The micro agent and gateway agent `main.go` files duplicate most
logic (dial / writer / readLoop / runDeploy). The shared code is in
`agentlib/`. When adding a new op, both files need a switch case.
- `moby/moby/client` v0.5.0 uses `netip.Addr` for `PortBinding.HostIP`,
not a string.
- `sdp-<repo>` containers must be in a state where `docker rm -f` works
(the `Slice-2` "one live per repo" rule). Don't manually `docker run`
a second container with the same name.
- The erangel repo path is `/var/www/html/erangel-ocean` on 186, NOT
`~/SDP` (README's earlier value is wrong; the spec was fixed in
Slice 2). `APACHE_DOCUMENT_ROOT` is set to the same path so the
gateway is served at `/erangel/`.
- `agent-gateway/.../main.go` re-imports the `routesState` type and
uses `rs` as both a value and a parameter name in some helpers.
Compiles fine; just be aware when grepping.
- Static-export dynamic routes: `generateStaticParams` must return at
least one placeholder; the actual id is read at runtime in the
client component. See `dashboard/src/app/dashboard/sandboxes/[id]/`.
## Verifying changes locally
```
# Typecheck + build everything
./scripts/build.sh
# Run the only Go test
docker run --rm -v "$PWD:/src" -w /src/control-plane golang:1.24-alpine sh -c \
"apk add --no-cache git >/dev/null && git config --global --add safe.directory /src && go test ./..."
# Smoke the control plane
./bin/control-plane -addr :3452 -data /tmp/sdp-data &
curl -i -X POST http://127.0.0.1:3452/api/login -d '{"username":"x","password":"y"}'
# Expects 401 ("login failed — git ls-remote rejected") when no gateway agent is connected.
```
## Out of scope
RBAC, suspend/resume, sandbox cloning beyond "clone template into
sandbox", per-sandbox Docker networks, per-sandbox resource limits,
health monitoring, the 172.18.136.93 infra agent, notifications.
These are listed as `later` in REQUIREMENTS.md.
## Do not
- Do not commit or push unless the user explicitly says "commit" or
"push".
- Do not change the gateway repo path back to `~/SDP` (old docs say
so; reality is `/var/www/html/erangel-ocean`).
- Do not rebuild the dashboard via `next start` for production; the
output is served by nginx on 186. Configure nginx by hand; the
reference config is in `nginx/nginx.conf` and uses
`root /home/administrator/SDP/dashboard;` (i.e. the path
`deploy.sh` scp's the static export to).
- Do not log or persist Bitbucket creds anywhere.