How I Built 8 Specialist AI Agents for Claude Code — and Used Them to Ship a Game I’d Never Built Before

10 min read3 days ago

Someone left a comment on my LinkedIn post: “These agents are good for small projects, not big complex ones.”

Fair. I’d just shown them building a cookie clicker. So I decided to prove them wrong in the most direct way I could think of. I’d build something genuinely complex. Something I had zero experience with. Something that would fail loudly if the agents couldn’t handle it.

I’d never built a game. Not a simple one. Not ever.

I picked a cybersecurity tower defense game with a server-authoritative WebSocket engine, a Factorio-inspired production chain mechanic, real-time attack waves, and a multi-service deployment across Vercel and Fly.io. Then I typed one command, went to make dinner, and came back to a deployed, playable game PipeWar.

This is how the agents actually work — the architecture, the prompt structure, the safety mechanisms, the memory system — and how you can build the same thing yourself.

The real problem with single-AI coding

When you ask one AI to architect, build, test, and deploy in a single conversation, it context-switches constantly. It forgets the auth model it defined 20 minutes ago. It skips tests because you didn’t explicitly ask. It makes UX decisions a designer would reject.

The problem isn’t intelligence. The problem is scope.

Real engineering teams don’t work this way. An architect designs the system. A designer specs the screens. A developer builds from those specs. QA tries to break it. Security audits before launch. Every person has a defined role, defined inputs, and a clean handoff to the next.

I built that structure as AI agents — Navox Agents. 8 specialists for Claude Code, each scoped to one job, each running on its own context window, orchestrated to work together the way a real team does.

The architecture: orchestration, handshakes, and three types of human control

Most multi-agent write-ups show you a diagram. What they skip are the mechanisms that make it actually work. There are three: orchestration, handshakes, and human-in-the-loop controls. Each is a different thing.

Orchestration is how the team is managed. The Architect is not just a design agent — it’s the active project manager for the entire build. When you run /agency-run, the Architect reads your prompt, decides which agents are needed, determines the sequence, identifies what runs in parallel, and produces a RECOMMENDED TEAM block before a single line of code is written:

RECOMMENDED TEAM:
1. Architect — DESIGN — system design before anyone builds
2. UX — FLOW → SPEC — map every screen and state
3. Security — DESIGN-REVIEW — auth model audit before build starts

PARALLEL AGENTS (can run simultaneously):
• UX + Security - both depend only on the Architect's output

BLOCKERS TO RESOLVE FIRST:
• Auth strategy undefined - need to know: JWT or session-based?

No agent decides on its own when to start. No agent picks up your raw prompt and starts guessing. The Architect maps the team, flags blockers, and briefs each agent specifically.

Handshakes are how knowledge moves between agents. When the Architect finishes, it doesn’t pass a doc and move on. It produces structured handoff notes — a prepared brief for each downstream agent containing exactly what that agent needs to start its specific job:

HANDOFF NOTES:
→ UI/UX Agent: Design login, signup, password reset flows.
   Auth is JWT. No OAuth in v1. Show token expiry state.
→ Full Stack Agent: Implement JWT auth per the security model.
   Access token: 15 min. Refresh: stored in Redis, rotated on use.
→ Security Agent: Auth model uses short-lived JWTs + Redis refresh.
   Threat surface: login endpoint, token refresh, session fixation.
→ DevOps Agent: Deploy to Vercel (frontend) + Cloudflare Workers (backend).

The UX agent doesn’t read the full system design doc. It receives the UX brief extracted from it. Security receives a different brief from the same source. Same document, different handshakes. This is what keeps agents from contradicting each other — they don’t share a context window and they don’t read each other’s work directly.

Human-in-the-loop controls come in three distinct types, and understanding the difference matters:

GATE — a hard stop before downstream agents can proceed. The Architect produces its design. Nothing moves until you type APPROVED. Agents that hit a gate output:

⚠️ HITL REQUIRED — GATE
Please review and respond: APPROVED | REVISION NEEDED: [notes]

CHECKPOINT — a review point between stages. Lower stakes. You scan the output and respond CONTINUE or FEEDBACK: [notes]. The Local Review agent runs here — it starts your app locally, opens the browser, takes a screenshot, and waits for you to type LGTM, FEEDBACK: [what to change], or STOP. If you say FEEDBACK, it loops back to Full Stack with your notes. If you say STOP, it kills the server and exits immediately.

ESCALATION — the agent self-pauses when it hits something it cannot or should not decide alone. Two valid approaches with a business-level tradeoff. A destructive action like dropping a table. Anything that contradicts the Architect's design. The agent stops and outputs exactly what decision you need to make before it continues.

On top of all three, there’s a dangerous command interception layer. Commands matching rm -rf, drop, truncate, --force, production, or deploy surface for human approval before execution — every time, without exception.

How to build an agent: the actual prompt structure

Each agent is a single markdown file. No code. No dependencies. No platform. Just a system prompt that Claude Code loads from .claude/agents/.

Here’s the skeleton every agent in the team is built on:

---
name: _agentname
description: One sentence. What this agent does and when Claude
             should load it. Include trigger keywords.
tools: Read, Write, Edit, Bash, Glob, Grep
model: claude-sonnet-4-6
---

## Identity
You are a [role] specialist with [X] years shipping [domain].
You think in [framing]. You are [position in team].

## Role in the Team
You own [specific slice]. You never [what this agent must not do].
You receive from: [upstream agent] — [what you get].
You hand off to: [downstream agent] — [what you produce].

## Modes

### [MODE: PLAN]
Entry point when the user isn't sure what they need.
Deliver: situation assessment + recommended next mode.

### [MODE: PRIMARY]
Full execution. Deliver: [specific artifacts].
Never omit: [non-negotiables].

### [MODE: VERIFY]
Check the work. Deliver: [pass/fail findings + fixes].

## Hard constraints
- Never [action that belongs to another agent]
- Always [non-negotiable behavior]
- Never proceed past a GATE without explicit human approval —
  output ⚠️ HITL REQUIRED and state exactly what's needed

## Project memory
At the start of every run:
  cat .claude/memory/agentname.md 2>/dev/null || echo "No memory yet"

After completing your task, update your memory with decisions made,
patterns observed, and what to remember for next time.

Two things worth pointing out. First: model routing. The Architect and Security agents run on claude-opus-4-6 — the hardest thinking jobs get the most capable model. Every other agent runs on claude-sonnet-4-6. This is intentional and it matters for cost at scale.

Second: the memory system. Every agent reads and writes its own memory file at .claude/memory/agentname.md. There's also a shared .claude/project-memory.md that the orchestrator updates after every run. The agents remember stack decisions, auth patterns, what failed last time, and why certain choices were made — across sessions. For a solo builder this is significant. The team never starts from zero.

The modes: how you actually talk to the agents

Every agent supports PLAN as a safe entry point. From there, each has specific modes for specific workflows. You can run the full chain or reach into any single agent directly.

The pattern is consistent across all agents: a PLAN mode that scopes the work, execution modes that do it, and verification modes that check the result. Security's LAUNCH-AUDIT is a hard gate — nothing deploys without a pass/fail verdict from it.

# Run the whole team end to end
/navox-agents:agency-run Build a task manager with user auth and Supabase

# Reach into one agent for a specific job
/navox-agents:architect DIAGNOSE
/navox-agents:security LAUNCH-AUDIT
/navox-agents:fullstack DEBUG — here's the error: [paste]
/navox-agents:qa REGRESSION

The stress test: PipeWar

A cookie clicker is a proof of concept. I needed something that would genuinely break an under-engineered system — multiple services, a real-time game engine, WebSocket communication, production deployment across two platforms, and enough complexity that any gap in the agent coordination would show up immediately.

PipeWar is a cybersecurity tower defense game. You build a factory on a 20×20 grid — miners extract ore, smelters produce plates, assemblers combine inputs into circuits. Connect everything with directional conveyor belts. Win condition: produce 20 Advanced Circuits while keeping uptime above 95%.

Iron Ore ──→ Smelter ──→ Iron Plate ──────────────────────┐
                                                           ├──→ Assembler ──→ Advanced Circuit
Copper Ore ──→ Smelter ──→ Copper Plate ──→ Copper Wire ───┘

The twist: once your factory generates enough traffic, attack waves spawn from the east edge. The attackers are real threat types — DDoS bots swarm in numbers, SQL injection probes seek undefended paths, Zero-Day Exploits arrive every fifth wave as 300 HP bosses. Your defenses are real security tools — Rate Limiters, WAFs, Auth Middleware, Circuit Breakers — each with distinct mechanics.

The Architect chose the stack: Next.js 15 with TypeScript and HTML Canvas on the frontend, FastAPI running a 20 tick/second server-authoritative game engine on the backend, SQLite via aiosqlite, WebSocket for real-time state sync, deployed across Vercel and Fly.io. I didn’t pick any of it. I approved it.

Then I ran one command:

/navox-agents:agency-run Build PipeWar — a Factorio-inspired tower defense game
themed around cybersecurity. 20x20 grid, production chains, attack waves,
WebSocket real-time engine. Deploy to Vercel + Fly.io.

What the build actually looked like

The orchestrator paused twice while I was cooking. Once to confirm login to Fly.io — the DevOps agent intercepted the deploy command and surfaced it before executing. Once at the Local Review checkpoint — the agent started the app, opened the browser, took a screenshot, and waited. I walked over, typed LGTM, went back to the kitchen.

Three hours later: deployed, playable, tests passing.

After launch I found 8 production bugs. The production chain wasn’t producing circuits. The circuit breaker defense wasn’t activating. Belt direction glyphs showed question marks. I ran:

/navox-agents:architect DIAGNOSE

The Architect scanned every file in the codebase and returned all 8 bugs with exact files, exact lines, and exact fixes. I confirmed. Full Stack fixed all 8 and wrote 9 new unit tests. 65 tests passing. DevOps redeployed.

Something unexpected happened during that parallel run. Claude Code pulled in an agent from a completely different plugin I had installed — without me asking. It saw that all my agents were busy working in parallel and nobody was reviewing the code being written, so it brought in an outside agent to fill the gap. Like a project manager who notices the team is at capacity and calls in a freelance reviewer. I didn’t orchestrate that. Claude Code did. That moment told me something fundamental about how multi-agent systems actually work at scale — the orchestration layer isn’t just following your instructions. It’s evaluating the situation.

The context isolation insight: after 8 hours of continuous agent work, my main Claude Code session had used 26% of its context window. Each agent runs in its own isolated context — its own token budget, its own reasoning space. They can’t contaminate each other’s state. This is the architecture decision most people miss when building multi-agent systems. Isolation isn’t a limitation. It’s what makes long-running, parallel work possible without the whole system degrading.

Install and run it yourself

/plugin marketplace add https://github.com/navox-labs/agents
/plugin install navox-agents
/reload-plugins

If you hit an SSH error first time:

git config --global url."https://github.com/".insteadOf "git@github.com:"

For a new project, copy a stack template first — the agents read it automatically and never ask you to re-explain your setup:

cp ~/.claude/templates/nextjs.CLAUDE.md ./CLAUDE.md          # Next.js + Vercel
cp ~/.claude/templates/python-fastapi.CLAUDE.md ./CLAUDE.md  # FastAPI + Fly.io
cp ~/.claude/templates/node-api.CLAUDE.md ./CLAUDE.md        # Express + Railway

Then start with the Architect:

/navox-agents:architect DIAGNOSE

It reads your request, maps the team, flags blockers, and tells you exactly which agents to run and in what order. From there, either follow its recommendations manually or hand everything to the orchestrator:

/navox-agents:agency-run Build a [describe what you want]

The agents are free, MIT licensed, and your code never leaves your machine. Everything is markdown — fork it, modify the prompts, build your own specialist. The source is at github.com/navox-labs/agents. PipeWar — built, debugged, and deployed entirely by the agent team — is playable here. The code is public if you want to see exactly what 8 agents produce in three hours.

If 2026 is the year of the AI team, the interesting question isn’t whether agents can build complex software. PipeWar already answers that. The question is how you design the team.

The agents are free, MIT licensed, open source: github.com/navox-labs/agents. Your code never leaves your machine.

Built by Navox Labs