DEV Community: Tijo Gaucher

Implementing A2A Protocol for Multi-Agent Communication

Tijo Gaucher — Sat, 18 Apr 2026 03:42:08 +0000

If you've ever wired two AI agents together, you know the drill. Custom JSON schemas, bespoke HTTP endpoints, and a growing pile of adapter code that nobody wants to maintain. Google's A2A (Agent-to-Agent) protocol is the answer to that mess, and I've been implementing it across OpenClaw and Hermes agents on Rapid Claw for the past few weeks. Here's what the implementation actually looks like.

What A2A solves (and what it doesn't)

A2A standardizes the message envelope between independent agents. Think of it as the TCP/IP of agent communication — it defines how agents discover each other, exchange structured messages, delegate tasks, and return results. It doesn't care what framework you're using internally.

The key distinction: MCP (Model Context Protocol) handles agent-to-tool communication. A2A handles agent-to-agent communication. You need both in any serious multi-agent deployment, and they compose cleanly because an A2A peer is essentially a tool with an agent on the other end.

The envelope format

Every A2A message carries the same required fields. The interesting bits go in payload:

envelope = {
    "a2a_version": "1.0",
    "message_id": f"msg_{uuid4().hex}",
    "correlation_id": "conv_01HZKXR7...",  # ties the conversation together
    "trace": {
        "trace_id": "4bf92f3577b34da6...",
        "span_id": "00f067aa0ba902b7",
    },
    "sender": {"agent_id": "planner-openclaw-prod-01", "framework": "openclaw"},
    "recipient": {"agent_id": "executor-hermes-prod-03", "framework": "hermes"},
    "intent": "task.delegate",
    "payload": {
        "task": "summarize_and_file",
        "inputs": {"url": "https://example.com/report.pdf"},
        "constraints": {"max_tokens": 4000, "deadline_ms": 30000}
    },
    "reply_to": "https://agents.rapidclaw.dev/a2a/planner/inbox",
    "expires_at": "2026-04-18T12:34:56Z"
}

Three fields do the heavy lifting: correlation_id threads multi-agent conversations into a single trace, trace carries OpenTelemetry-compatible span context so your existing APM stitches everything together, and intent is the verb recipients dispatch on — not a URL path.

Publishing an OpenClaw agent as an A2A endpoint

An OpenClaw agent becomes an A2A peer by exposing an inbox and registering with a platform registry. The agent doesn't need to know who will call it — only how to respond:

from fastapi import FastAPI, HTTPException
from openclaw import Agent, Task
from a2a import Envelope, verify_signature, sign

app = FastAPI()
planner = Agent.from_config("planner.yaml")

@app.post("/a2a/inbox")
async def inbox(envelope: Envelope):
    if not verify_signature(envelope, allowed=TRUSTED_SIGNERS):
        raise HTTPException(401, "signature verification failed")

    if envelope.intent == "task.delegate":
        task = Task(
            name=envelope.payload["task"],
            inputs=envelope.payload["inputs"],
            trace=envelope.trace,
        )
        result = await planner.run(task)

        reply = Envelope(
            intent="result.return",
            correlation_id=envelope.correlation_id,
            trace=envelope.trace,
            sender={"agent_id": AGENT_ID, "framework": "openclaw"},
            recipient=envelope.sender,
            payload={"status": "ok", "result": result.to_dict()},
        )
        return sign(reply, PRIVATE_KEY).dict()

The caller discovers executors by label, not URL — this is the part A2A gets right. No hardcoded hostnames:

executor = await lookup(
    intent="task.execute",
    labels={"framework": "hermes", "env": "prod"},
)

Three patterns worth implementing

Request/reply is the simplest. Planner calls executor, waits for the reply envelope, acts on it. Use for sub-tasks with clear deadlines.

Fan-out/fan-in dispatches the same intent to a pool of executors in parallel, correlates replies by correlation_id, and takes the first good answer or aggregates. This is how you build research-agent ensembles.

Async with callback fires a task.delegate with a reply_to URL and returns immediately. The callee POSTs a result.return when done. You get durability without holding an HTTP connection open.

The platform layer matters

The protocol is the easy part. Production A2A needs five things at the platform layer: a registry for discovery, identity and mTLS per agent, routing with network policy, observability that stitches traces across agents, and per-agent rate limits. You can build all five yourself — Postgres registry, Vault for keys, Envoy for mTLS, OTEL collector, Redis for rate limits — or use something like Rapid Claw that ships them preconfigured.

If you're thinking about multi-agent architectures more broadly, I wrote up the common orchestration patterns (planner/executor, supervisor, blackboard) that pair well with A2A as the transport layer.

A2A isn't revolutionary — it's the boring infrastructure piece that was missing. And boring infrastructure is exactly what you want when you're trying to ship agent systems that actually work in production.

[Patterns] AI Agent Error Handling That Actually Works

Tijo Gaucher — Fri, 17 Apr 2026 08:47:16 +0000

Most AI agent tutorials show the happy path. Your agent calls an LLM, gets a response, does the thing. Ship it.

Then production happens. Rate limits. Timeouts. Malformed responses. Context window overflows. Your agent goes from "demo-ready" to "incident-generating" in about 48 hours.

I run a small operation — 5 agents max, solo founder. Every failure that wakes me up at 3am is one I should have handled in code. Here are the patterns that actually work.

Classify Your Errors First

Not all errors deserve the same treatment. The first thing I do in any agent system is classify failures into two buckets:

Transient errors: Rate limits (429), timeouts, temporary network blips, model overload. These will probably work if you try again.

Permanent errors: Invalid API keys, malformed prompts, context window exceeded, model doesn't exist. Retrying won't help.

class ErrorClassifier:
    TRANSIENT_CODES = {429, 500, 502, 503, 504}

    @staticmethod
    def classify(error):
        if hasattr(error, 'status_code'):
            if error.status_code in ErrorClassifier.TRANSIENT_CODES:
                return "transient"
        if "timeout" in str(error).lower():
            return "transient"
        return "permanent"

This classification drives everything downstream. Transient errors get retries. Permanent errors get logged, reported, and gracefully degraded. When you're thinking about agent security patterns, error classification also matters — permanent auth errors need different alerting than transient network hiccups.

Retry Strategies That Don't Make Things Worse

The naive approach — retry immediately, retry forever — is how you turn a rate limit into a ban. Exponential backoff with jitter is the baseline:

import random
import time

def retry_with_backoff(fn, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except Exception as e:
            if ErrorClassifier.classify(e) == "permanent":
                raise  # Don't retry permanent errors

            if attempt == max_retries - 1:
                raise

            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)

Key details: jitter prevents thundering herd when multiple agents hit the same limit. And always cap your retries — 3 is usually enough. If it hasn't worked in 3 tries, it's not going to work in 30.

Circuit Breakers for LLM Calls

Retries handle individual failures. Circuit breakers handle systemic ones. If your LLM provider is having a bad day, you don't want every request queuing up and timing out.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_time=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_time = recovery_time
        self.last_failure_time = None
        self.state = "closed"  # closed = normal, open = blocking

    def call(self, fn):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.recovery_time:
                self.state = "half-open"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = fn()
            if self.state == "half-open":
                self.state = "closed"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "open"
            raise

I wrap every external LLM call in a circuit breaker. When the circuit opens, agents fall back to cached responses or simpler logic instead of piling up failures. If you're taking an observability-first approach, you'll want to track circuit state transitions — they're one of the best early warning signals.

Fallback Chains: Your Safety Net

When your primary model fails, having a fallback chain prevents total outage:

FALLBACK_CHAIN = [
    {"provider": "anthropic", "model": "claude-sonnet-4-20250514"},
    {"provider": "openai", "model": "gpt-4o-mini"},
    {"provider": "local", "model": "cached_response"},
]

def call_with_fallback(prompt, chain=FALLBACK_CHAIN):
    errors = []
    for option in chain:
        try:
            return call_model(option["provider"], option["model"], prompt)
        except Exception as e:
            errors.append(f"{option['provider']}: {e}")
            continue
    raise AllProvidersFailedError(
        f"All {len(chain)} providers failed: {'; '.join(errors)}"
    )

The chain degrades gracefully: premium model → cheaper model → cached/static response. Your users get something even when everything is on fire.

Timeout Handling

LLM calls are slow. An agent waiting 120 seconds for a response that's never coming is wasting resources and blocking downstream work.

import asyncio

async def call_with_timeout(coro, timeout_seconds=30):
    try:
        return await asyncio.wait_for(coro, timeout=timeout_seconds)
    except asyncio.TimeoutError:
        raise TimeoutError(f"LLM call exceeded {timeout_seconds}s limit")

Set aggressive timeouts. For most agent tasks, if you haven't gotten a response in 30 seconds, something is wrong. I default to 30s for completions and 10s for embeddings.

Putting It All Together

Here's how these patterns compose in a real agent:

async def agent_execute(task):
    breaker = get_circuit_breaker("llm_calls")

    try:
        result = breaker.call(
            lambda: retry_with_backoff(
                lambda: call_with_fallback(task.prompt),
                max_retries=3
            )
        )
        return AgentResult(status="success", data=result)

    except CircuitOpenError:
        return AgentResult(
            status="degraded",
            data=get_cached_response(task),
            note="Using cached response - LLM circuit open"
        )
    except AllProvidersFailedError:
        return AgentResult(
            status="failed",
            data=None,
            note="All providers unavailable"
        )

The key insight: every layer has a defined failure mode. Timeouts prevent hangs. Retries handle blips. Circuit breakers prevent cascading failures. Fallbacks provide degraded-but-functional responses.

What I Track

Error handling is only useful if you know it's working. For my small setup, I track:

Error classification distribution — am I seeing more transient or permanent errors?
Circuit breaker state changes — how often are circuits opening?
Fallback chain depth — how far down the chain are requests going?
Retry success rate — are retries actually recovering errors?

Having real-time error monitoring changed how I build agents. Instead of finding out about failures from users, I catch patterns before they become outages.

The Boring Truth

None of these patterns are novel. Circuit breakers come from distributed systems. Retry with backoff is older than most of us. Fallback chains are just failover by another name.

But applying them specifically to AI agents — where failures are probabilistic, responses are non-deterministic, and costs compound with every retry — that's where the craft is. Start with error classification, layer on retries, add circuit breakers, and build fallback chains. Your 3am self will thank you.

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Tijo Gaucher — Fri, 17 Apr 2026 08:43:05 +0000

I've been running a small AI automation shop — just me, a handful of agents, and a self-hosted stack that needs to stay observable without blowing the budget. When I started instrumenting my LLM pipelines, I found that most observability guides assumed you'd use a managed platform. But if you're like me and prefer to own your data and infrastructure, OpenTelemetry gives you a solid, vendor-neutral foundation.

Here's what I've learned getting OpenTelemetry working for LLM agent traces on a self-hosted setup in 2026.

Why OpenTelemetry for LLM Workloads?

OpenTelemetry (OTel) has become the de facto standard for distributed tracing, metrics, and logs. The ecosystem matured significantly through 2025, and the semantic conventions for generative AI — covering LLM calls, token usage, model parameters — landed as stable in early 2026.

For LLM workloads specifically, OTel gives you a few things that matter:

Trace continuity across agent steps. When your agent calls an LLM, retrieves from a vector store, then calls another LLM, each step is a span in a single trace. You see the full chain, not just isolated API calls.

Token and cost attribution. The gen_ai semantic conventions include attributes like gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, which let you track per-request costs without bolting on a separate billing layer.

Vendor neutrality. Whether you're calling OpenAI, Anthropic, or a local model via vLLM, the instrumentation shape is the same. Swap providers without rewriting your observability code.

The Self-Hosted Stack

My setup is modest — a single VPS running the collection and storage layer, with agents deployed separately. Here's the architecture:

[Your LLM Agents]
       |
       v
[OTel Collector]  ← receives traces via OTLP/gRPC
       |
       v
[Tempo / Jaeger]  ← trace storage
[Prometheus]      ← metrics storage
[Grafana]         ← visualization

If you've looked at the self-hosted vs managed cost comparison, you know the economics are favorable when you're running fewer than five agents. The managed platforms charge per span or per seat, which adds up quickly even at small scale.

Setting Up the OTel Collector

The Collector is the central hub. It receives telemetry from your agents, processes it, and exports to your storage backends. Here's a minimal config for LLM traces:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Nothing exotic here. The batch processor keeps things efficient, and we're exporting traces to Tempo and metrics to Prometheus. If you want a deeper walkthrough on getting this into production, the production deployment guide covers Docker Compose configs and health checks.

Instrumenting LLM Calls

The actual instrumentation depends on your language and SDK. I'll show Python since that's what most agent code runs on.

First, install the packages:

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-requests

Then set up a tracer and wrap your LLM calls:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://your-collector:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("llm-agent")

def call_llm(prompt, model="claude-sonnet-4-20250514"):
    with tracer.start_as_current_span("llm.call") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)

        response = your_llm_client.complete(prompt=prompt, model=model)

        span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
        span.set_attribute("gen_ai.response.model", response.model)

        return response.content

The key is using the gen_ai.* semantic conventions consistently. This means your Grafana dashboards, alerts, and queries work the same regardless of which model or provider you're hitting.

Tracing Multi-Step Agent Workflows

Where this gets really useful is tracing a full agent workflow. Each tool call, retrieval step, and LLM invocation becomes a child span:

def run_agent(task):
    with tracer.start_as_current_span("agent.run") as parent:
        parent.set_attribute("agent.task", task)

        # Step 1: retrieve context
        with tracer.start_as_current_span("retrieval.vector_search"):
            context = search_vector_store(task)

        # Step 2: call LLM with context
        result = call_llm(f"Context: {context}\nTask: {task}")

        # Step 3: maybe call a tool
        if needs_tool_call(result):
            with tracer.start_as_current_span("tool.execute") as tool_span:
                tool_span.set_attribute("tool.name", "web_search")
                tool_result = execute_tool(result)
                result = call_llm(f"Tool result: {tool_result}\nOriginal task: {task}")

        return result

When you view this in Grafana via Tempo, you get a waterfall trace showing exactly where time was spent — was it the vector search? The first LLM call? The tool execution? This is the kind of visibility that makes debugging agent behavior tractable instead of guesswork.

What You Actually See in the Dashboard

Once everything is wired up, your self-hosted observability dashboard shows you:

Latency breakdown per agent step — which spans are slow, and whether it's network or model inference
Token usage over time — catch runaway prompts before they drain your API budget
Error rates by model/provider — spot degraded model endpoints early
Trace search — find the exact trace where an agent went off the rails

For a solo operator running a few agents, this level of visibility is the difference between confidently shipping agent workflows and crossing your fingers every deploy.

Rough Edges and Honest Takes

A few things that are still annoying in 2026:

Auto-instrumentation for LLM SDKs is patchy. The OpenAI Python SDK has decent OTel support now, but Anthropic's is still experimental. You'll likely write some manual spans.

Trace volume can surprise you. Agents that loop — retries, multi-turn conversations — generate a lot of spans. Set up sampling early. A simple tail-based sampler that keeps error traces and samples 10% of success traces works well.

Grafana dashboards take time to build. The gen_ai semantic conventions are new enough that there aren't many pre-built dashboards. Budget an afternoon to set up your panels.

Wrapping Up

OpenTelemetry for LLM observability isn't a silver bullet, but it's the most practical foundation I've found for self-hosted setups. The semantic conventions are mature enough to use in production, the Collector is rock-solid, and the cost of running your own Tempo + Grafana stack is a fraction of what you'd pay for a managed platform.

If you're running a handful of agents and want to actually understand what they're doing, this stack is worth the setup time.

[Guide] How to Debug AI Agents in Production

Tijo Gaucher — Fri, 17 Apr 2026 08:42:31 +0000

I run a small outfit — a few AI agents handling tasks like lead qualification, document processing, and customer support triage. Nothing at massive scale. But even with just a handful of agents in production, debugging them has been one of the hardest parts of the job.

Traditional software bugs are predictable. An agent bug? It might only surface when a specific combination of user input, API latency, and model temperature aligns just right. Here's what I've learned about debugging AI agents in the real world.

The Problem with Agent Debugging

When a regular API endpoint fails, you get a status code and a stack trace. When an agent fails, you might get... a confidently wrong answer. Or a tool call loop. Or a response that technically works but costs $4.50 because it made 47 unnecessary API calls.

The core challenge is that agents are non-deterministic systems making autonomous decisions. You can't just write a unit test that covers every scenario. You need a different approach entirely.

Scenario 1: The Silent Wrong Answer

This is the scariest failure mode. Your agent completes its task, returns a result, and everyone moves on — except the result is wrong.

I had a document processing agent that was supposed to extract invoice amounts. It worked great for months until a client started sending invoices with a slightly different format. The agent still extracted numbers confidently, but they were line item totals instead of invoice totals. No error, no warning.

What helped: Adding assertion checks on agent outputs. Not just "did it return something" but "does this value fall within expected ranges." I also started logging the full reasoning chain so I could audit decisions after the fact. Having solid agent observability in place made it possible to catch these kinds of drift issues before they compounded.

Scenario 2: The Runaway Tool Call Loop

Agents that can call tools will sometimes get stuck in loops. Call tool A, get a result, decide it needs to call tool A again with slightly different parameters, repeat forever.

This usually happens when the agent's prompt doesn't clearly define exit conditions, or when a tool returns ambiguous results that the agent keeps trying to "fix."

What helped: Implementing hard limits on tool call counts per session. I cap mine at 15 calls per task — if an agent hits that limit, it stops and flags for human review. I also started using tracing to visualize the full sequence of tool calls. Being able to trace agent tool calls in a timeline view made it immediately obvious when an agent was spinning its wheels.

Scenario 3: Cascading Failures Across Agents

When you have multiple agents that depend on each other, a failure in one can cascade in unexpected ways. Agent A summarizes a document, Agent B uses that summary to make a decision, Agent C acts on that decision. If Agent A's summary is subtly off, you get a game of telephone that ends badly.

What helped: Treating agent handoffs like API contracts. Each agent validates its inputs before proceeding. I also added trace IDs that follow a request across all agents, so when something goes wrong at the end of a chain, I can trace it back to the originating agent.

Practical Log Analysis Patterns

Here are the patterns I actually use day-to-day:

1. Structured logging with context. Every agent action gets logged with: the task ID, the agent name, the tool being called, input parameters, output summary, latency, and token count. JSON-structured logs make it possible to query across all these dimensions later.

2. Diff logging for retries. When an agent retries a tool call, log what changed between attempts. This is usually where bugs hide — the agent is trying to correct something but its correction strategy is wrong.

3. Cost tracking per task. This might sound like a finance concern, not a debugging one, but unexpected cost spikes are one of the best early warning signals. If a task that normally costs $0.03 suddenly costs $0.30, something changed in the agent's behavior. I use a simple calculator to estimate debugging overhead costs and set alerts when any task exceeds 3x its rolling average.

4. Output sampling. Randomly sample 5-10% of agent outputs for human review. This catches the silent wrong answers that no automated check will find.

Handling Production Incidents

When something breaks in production with an agent, here's my playbook:

First, check the trace for that specific request. Look at every tool call, every decision point. Usually the problem is obvious once you can see the full sequence.

Second, check if the failure is reproducible. With agents, sometimes it is and sometimes it isn't — the same input might produce different behavior on the next run. If it's not reproducible, you need to look at what external state might have contributed (API responses, database state, etc.).

Third, check for upstream changes. Did an API you depend on change its response format? Did someone update the system prompt? Did the model provider do a quiet update? These are the most common root causes in my experience.

Tools and Setup That Actually Help

You don't need an elaborate observability stack. Here's what I actually run:

Structured JSON logs shipped to a searchable store
Trace IDs that propagate across agent boundaries
Hard limits on tool calls, tokens, and cost per task
Automated output validation with sensible thresholds
A weekly sample review of agent outputs

The key insight is that agent debugging is more like debugging a distributed system than debugging a single program. You need traces, not just logs. You need to see the full picture of what an agent decided, why, and what happened next.

Wrapping Up

Debugging AI agents in production is genuinely hard, and I don't think anyone has it fully figured out yet. But the basics — good logging, tracing, output validation, and cost monitoring — go a long way. Start with those, and add complexity only when you hit a problem that the basics can't solve.

If you're running agents in production too, I'd love to hear what patterns have worked for you. Drop a comment or find me on Twitter.

Self-Hosting AI Agents vs Managed: Honest Trade-offs From the Trenches

Tijo Gaucher — Tue, 14 Apr 2026 10:55:13 +0000

[Self-Hosting AI Agents vs Managed: Honest Trade-offs]
A few months in, I keep coming back to the same conversation with people building on agents: should you self-host, or just pay someone to run them for you? It sounds like a procurement question. In practice it's a question about how much weirdness you're willing to live with, and how much of the weirdness you want to be yours.

I run a small AI agent service called RapidClaw. My brother Brandon is the tech lead and we cap the number of concurrent agents we run at five — not as a marketing line, as an honest constraint. Five is the number where I can still look at every trace, name every memory key, and tell you what each agent did yesterday. Past that, I start lying to myself about what I actually understand. So I'd rather be small and clear than big and fuzzy.

That bias colors everything below. I'm not trying to talk anyone out of using a managed platform. I'm trying to write down the trade-offs the way I actually experienced them, in case it saves someone a weekend.

The honest pitch for managed

If you've never run an agent in production, start managed. I mean it. The boring stuff — retries, queueing, evals harness, secret rotation, log shipping, a UI someone other than you can use — is six to eight weeks of work that doesn't move your product forward. You're paying a managed provider to skip that, and skipping it is correct when the agent isn't yet the thing your customers love.

The catch is that "managed" is doing a lot of work in that sentence. There's managed-as-in-hosted (your prompts, their runtime), and there's managed-as-in-opinionated (their prompts, their runtime, their memory model). The second kind feels great in week one and starts to chafe in week six, when you realize you can't see why an agent decided what it decided, and your only recourse is a support ticket.

The questions I'd push on before signing anything:

Can I export every trace, every tool call, every memory write — as JSON, on demand, without a CSV button hidden three menus deep?
When a run fails, do I get the actual model response, or a sanitized "something went wrong"?
If I want to swap the underlying model next quarter, is that a config change or a rewrite?
What does the bill look like at 10x my current usage? At 100x? Is it linear, or is there a cliff at the "enterprise" tier?

If the answers are clean, managed is a fine home for a long time.

Why I ended up self-hosting anyway

For RapidClaw the deciding factor wasn't cost. It was the loop time on debugging. We were chasing a memory bug where an agent kept hallucinating a customer's preferred timezone. On a managed runtime I could see the final output and the tool calls, but not the actual sequence of memory reads the agent did before responding. Two days of poking later, Brandon stood up a small local runtime and we found it in twenty minutes — the agent was reading a stale snapshot because the memory write from the prior turn hadn't been flushed before the next read.

That's the kind of bug you can only catch when you can stop the world and look at it. Managed platforms are getting better at this, but "better" is not the same as "I can drop a print statement wherever I want."

The other thing that pushed us was the customer mix. Most of our customers want their agents running in their own VPC, on their own keys, talking to their own internal data. "Send your data to our SaaS" is a non-starter for them. So a self-hosted setup wasn't a nice-to-have — it was the product.

What self-hosting actually costs (the parts nobody warns you about)

Compute is the easy line item. Even the embarrassingly inefficient version of running five agents on a single mid-tier box costs less per month than one good lunch. That's not where the bill is.

Where the bill is:

Observability. You will reinvent some version of structured tracing for agent steps. Tool call in, model response out, memory delta, retry attempts, token counts. You can lean on OpenTelemetry, but the agent-shaped semantics are still yours to define. Budget two weeks the first time and another week every quarter to keep it honest.

Eval harness. Without a managed eval surface, you need to build the small, ugly version yourself. A folder of scenarios, a runner that hits each one, a diff viewer for outputs. It can be a hundred lines of Python. It cannot be zero lines of Python.

On-call. The first time an agent loops forever at 3am, you find out whether you have an on-call rotation. We didn't. Now we do. It's two people taking turns and a Pagerduty free tier, but it exists.

Memory state, specifically. This is the one I underestimated most. Agents that hold any state across turns — which is most useful agents — turn small bugs in your memory layer into very weird behavior in the model layer. I now spend more time thinking about how memory is read, written, snapshotted, and pruned than I spend thinking about prompts. If I were starting again, I'd build the memory inspector before I built the second agent.

The middle path most people end up at

Almost no team I've talked to runs a pure managed or pure self-hosted setup for long. The shape that keeps emerging is: managed for the orchestration and the model gateway, self-hosted for the memory and the tool layer. You give up a little observability on the orchestration side, you keep all the observability on the parts where bugs actually live.

That hybrid is what we ended up shipping for our own customers. The agent dashboard runs as a managed service so people don't have to host a UI, but the agents themselves run wherever the customer wants — their cluster, their keys, their VPC. It's not the cleanest architecture story, and it took us longer than I'd like to admit to stop apologizing for it.

What I'd tell past-me

Three things, none of them clever:

Pick the option that makes your debugging loop shorter, not the one that makes your slide deck better. If you can't see why an agent did what it did, you don't have an agent — you have a wishing well.
Cap the number of concurrent agents at a number you can mentally model. For us that's five. For a bigger team it might be twenty. It is almost certainly not "as many as the platform supports."
Write the boring runbooks early. The retry policy, the memory snapshot policy, the rollback procedure. They feel like overkill until the first real outage, after which they feel like the only adult thing in the room.

If any of this is useful, or if you want to compare notes on what you've broken, I'd love to hear about it — the RapidClaw team is small enough that you'll get an actual human, probably me or Brandon.

We're still figuring this out. I just wanted to write down what we've found so far, while it still feels true.

Why We Built a Managed Platform for OpenClaw Agents (And What We Learned)

Tijo Gaucher — Mon, 13 Apr 2026 02:41:43 +0000

We spent six months wrestling with deploying AI agents before we decided to just build the thing ourselves. This is that story — the ugly parts included.

The Problem Nobody Talks About

Everyone's building AI agents right now. The demos look incredible. You wire up some tools, connect an LLM, and suddenly you've got an agent that can research, plan, and execute tasks autonomously.

Then you try to put it in production.

Suddenly you're dealing with container orchestration, secret management, scaling workers up and down, monitoring token spend, handling failures gracefully, and figuring out why your agent decided to retry the same API call 47 times at 3am.

We were building on OpenClaw — an open-source agent framework that we really liked because it didn't try to do too much. It gave you the primitives and got out of the way. But "getting out of the way" also meant we were on our own for everything else.

What Running Agents in Production Actually Looks Like

Here's a simplified version of what our deploy pipeline looked like before RapidClaw existed:

# Our old "deploy an agent" workflow (simplified, but not by much)
steps:
  - name: Build agent container
    run: docker build -t agent-${{ agent.name }} .

  - name: Push to registry
    run: docker push $REGISTRY/agent-${{ agent.name }}

  - name: Update k8s deployment
    run: |
      kubectl set image deployment/$AGENT_NAME \
        agent=$REGISTRY/agent-${{ agent.name }}:$SHA

  - name: Configure secrets
    run: |
      kubectl create secret generic agent-secrets \
        --from-literal=OPENAI_KEY=${{ secrets.OPENAI }} \
        --from-literal=ANTHROPIC_KEY=${{ secrets.ANTHROPIC }} \
        # ... 12 more provider keys

  - name: Set up monitoring
    run: |
      # Prometheus config, Grafana dashboards, 
      # alerting rules, log aggregation...
      # This alone was 200+ lines of YAML

That's the happy path. We're not even talking about rollback strategies, canary deployments, or what happens when your agent starts hallucinating and burning through your API budget at 2x the normal rate.

We had an incident early on where an agent got stuck in a loop generating images. By the time we noticed, it had burned through about $400 in API calls in under an hour. That was our wake-up call.

Why OpenClaw

We evaluated a bunch of agent frameworks. Most of them wanted to own your entire stack — your prompts, your tool definitions, your execution model, everything.

OpenClaw was different. It's more like a protocol than a framework. You define your agent's capabilities, wire up your tools, and it handles the execution loop. But it's deliberately minimal about infrastructure opinions.

That minimalism is what attracted us, and also what made us realize there was a gap. OpenClaw gives you a great way to build agents. It doesn't give you a great way to run them.

What RapidClaw Does Differently

RapidClaw is basically the managed infrastructure layer that sits underneath your OpenClaw agents. Think of it as the platform that handles all the boring-but-critical stuff:

Deploy flow (what it looks like now):

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│  Your Agent  │────▶│  RapidClaw   │────▶│   Production    │
│  (OpenClaw)  │     │   Platform   │     │   Environment   │
└─────────────┘     └──────────────┘     └─────────────────┘
       │                    │                      │
       │              ┌─────┴─────┐          ┌─────┴─────┐
       │              │ Secrets   │          │ Auto-scale │
       │              │ Mgmt      │          │ Monitor    │
       │              │ Isolation  │          │ Cost caps  │
       │              │ Versioning │          │ Rollback   │
       │              └───────────┘          └───────────┘
       │
  rapidclaw deploy my-agent --env production
  # That's it. One command.

The whole point is that you focus on your agent logic — what tools it has, how it reasons, what it's good at — and we handle the infrastructure. Secrets get injected securely, scaling happens automatically, and if your agent starts going off the rails, cost caps kick in before your cloud bill becomes a horror story.

You can dig into the security model if you want the details on how we handle isolation and secret management. It was one of the hardest parts to get right.

What We Learned (The Honest Version)

1. Agents fail in weird ways.

Traditional software fails predictably. API returns 500, you handle it. Database times out, you retry. Agents fail creatively. They'll find edge cases in your tools you never imagined. They'll interpret instructions in ways that are technically correct but completely wrong. Building good guardrails is less about error handling and more about understanding the problem space deeply enough to anticipate creative failures.

2. Cost management is a first-class concern.

This isn't like running a web server where your costs are roughly proportional to traffic. Agent costs can spike 10x in minutes if the agent decides it needs to "think harder" about something. We built per-agent budgets, per-session caps, and anomaly detection into the platform from day one. Should have done it from day negative-one.

3. Observability for agents is fundamentally different.

You can't just look at request/response logs. You need to see the agent's reasoning chain, understand why it chose one tool over another, and track how its behavior drifts over time. We built a trace viewer that shows the full execution tree — every tool call, every LLM interaction, every decision point. It's the feature our users care about most, and it was an afterthought in our original design. Embarrassing.

4. The open-source community taught us more than we expected.

We initially built RapidClaw as a purely internal tool. OpenClaw contributors kept asking us how we were running agents in production, and their questions shaped about 60% of our roadmap. Turns out the problems we were solving weren't unique to us — they were universal. That community feedback loop was the single most valuable thing in our development process.

5. You will underestimate state management.

Agents that run for minutes or hours need persistent state. They need checkpointing. They need the ability to resume after failures. And they need all of that without you having to think about it as an agent developer. Getting this right took us three complete rewrites. Three. We're still not 100% happy with it.

Where We Are Now

RapidClaw is running in production for a handful of teams. It's not perfect — our documentation needs work, our onboarding could be smoother, and there are definitely edge cases we haven't hit yet.

But the core loop works: write your OpenClaw agent, push it to RapidClaw, and it runs reliably in production with monitoring, scaling, and cost management built in. No more 200-line YAML files. No more 3am incidents because an agent went rogue.

If you're running OpenClaw agents (or thinking about it), I'd genuinely love to hear how you're handling the infrastructure side. We're at rapidclaw.dev/try if you want to kick the tires.

What's the gnarliest production issue you've hit with AI agents? I'll bet we've either seen it too or it'll end up on our roadmap. Drop it in the comments — I read every single one.

How I Cut Our AI Agent Token Costs by 73% Without Sacrificing Quality

Tijo Gaucher — Mon, 13 Apr 2026 02:16:50 +0000

Every month I'd open our cloud billing dashboard and wince. Running AI agents in production at RapidClaw meant our token costs were climbing faster than our revenue. Sound familiar?

After three months of aggressive optimization, we cut our monthly token spend by 73% while actually improving agent response quality. Here's exactly how we did it — no vague advice, just the specific techniques that moved the needle.

The Problem: Death by a Thousand Tokens

When you're running AI agents that handle real workloads — deployment automation, infrastructure monitoring, code review — every unnecessary token adds up. Our agents were processing ~2M tokens per day across various tasks. At GPT-4-class pricing, that's not pocket change.

The root causes were predictable once we actually measured:

Bloated system prompts copied-and-pasted across agents (avg 2,400 tokens each)
No caching layer — identical queries hitting the LLM every time
Redundant context stuffed into every request "just in case"
Wrong model for the job — using frontier models for classification tasks

Strategy 1: Prompt Compression (Saved ~30%)

The biggest win was the simplest. We audited every system prompt and applied aggressive compression.

# BEFORE: 847 tokens
SYSTEM_PROMPT_BEFORE = """
You are a helpful deployment assistant for our cloud infrastructure.
You should help users deploy their applications to our Kubernetes cluster.
You have access to kubectl commands and can help troubleshoot issues.
When a user asks you to deploy something, you should first check if 
the namespace exists, then validate the manifest, then apply it.
You should always be polite and professional in your responses.
You should explain what you're doing at each step.
If something goes wrong, provide clear error messages and suggestions.
Always confirm before making destructive changes.
Remember to check resource limits and quotas before deploying.
"""

# AFTER: 196 tokens
SYSTEM_PROMPT_AFTER = """
Role: K8s deployment agent.
Tools: kubectl
Flow: check namespace → validate manifest → apply
Rules: confirm destructive ops, check resource quotas, explain steps
"""

Same behavior, 77% fewer tokens. The key insight: LLMs don't need the verbose instructions we think they do. They need structured, precise constraints.

We built a simple compression pipeline:

import tiktoken

def audit_prompt(prompt: str, model: str = "gpt-4") -> dict:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(prompt)

    # Flag prompts over 500 tokens for review
    return {
        "token_count": len(tokens),
        "needs_review": len(tokens) > 500,
        "estimated_daily_cost": len(tokens) * CALLS_PER_DAY * COST_PER_TOKEN
    }

# Run this on every agent prompt quarterly
for agent in get_all_agents():
    report = audit_prompt(agent.system_prompt)
    if report["needs_review"]:
        print(f"⚠️  {agent.name}: {report['token_count']} tokens "
              f"(${report['estimated_daily_cost']:.2f}/day)")

Strategy 2: Semantic Caching (Saved ~25%)

This was the highest-ROI engineering investment. We added a semantic similarity cache in front of our LLM calls.

import hashlib
import numpy as np
from redis import Redis

class SemanticCache:
    def __init__(self, redis_url: str, similarity_threshold: float = 0.95):
        self.redis = Redis.from_url(redis_url)
        self.threshold = similarity_threshold

    def get_embedding(self, text: str) -> np.ndarray:
        """Use a cheap embedding model — not the expensive LLM."""
        # text-embedding-3-small costs ~$0.02/1M tokens
        return embed_model.encode(text)

    def lookup(self, query: str) -> str | None:
        query_emb = self.get_embedding(query)

        # Check against recent cached queries
        for key in self.redis.scan_iter("cache:emb:*"):
            cached_emb = np.frombuffer(self.redis.get(key))
            similarity = np.dot(query_emb, cached_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(cached_emb)
            )
            if similarity >= self.threshold:
                response_key = key.decode().replace("emb:", "resp:")
                return self.redis.get(response_key).decode()
        return None

    def store(self, query: str, response: str, ttl: int = 3600):
        key_hash = hashlib.sha256(query.encode()).hexdigest()[:16]
        emb = self.get_embedding(query)
        self.redis.setex(f"cache:emb:{key_hash}", ttl, emb.tobytes())
        self.redis.setex(f"cache:resp:{key_hash}", ttl, response)

The 0.95 similarity threshold was critical. Too low and you get stale/wrong cached responses. Too high and your cache hit rate tanks. We tuned this per agent type — deployment agents got 0.97 (precision matters), monitoring summarizers got 0.92 (more tolerance for variation).

Cache hit rates after one week:

Infrastructure status queries: 67% hit rate
Deployment validation: 41% hit rate
Code review suggestions: 12% hit rate (too unique, as expected)

Strategy 3: Model Routing (Saved ~18%)

Not every task needs a frontier model. We built a lightweight router that directs requests to the cheapest capable model:

MODEL_TIERS = {
    "classification": "gpt-4o-mini",     # $0.15/1M input
    "extraction": "gpt-4o-mini",          # Simple structured output
    "summarization": "gpt-4o",            # Needs nuance
    "reasoning": "gpt-4o",               # Complex decisions
    "code_generation": "claude-sonnet-4-6", # Best for code
}

def route_request(task_type: str, complexity_score: float) -> str:
    """Route to cheapest capable model based on task type and complexity."""
    base_model = MODEL_TIERS.get(task_type, "gpt-4o")

    # Override: bump up if complexity is high
    if complexity_score > 0.8 and base_model.endswith("mini"):
        return base_model.replace("-mini", "")

    return base_model

We score complexity using a fast heuristic — input length, number of distinct entities, presence of code blocks, and whether the request involves multi-step reasoning. The heuristic itself runs on the cheapest model as a pre-filter.

Strategy 4: Context Window Management

This one's underrated. Instead of dumping the entire conversation history into every request, we implemented a sliding window with smart summarization:

def prepare_context(messages: list, max_tokens: int = 2000) -> list:
    """Keep recent messages verbatim, summarize older ones."""
    recent = messages[-4:]  # Last 2 exchanges verbatim
    older = messages[:-4]

    if not older:
        return recent

    # Summarize older context with a cheap model
    summary = summarize(older, model="gpt-4o-mini")

    return [{"role": "system", "content": f"Prior context: {summary}"}] + recent

This alone saved 15-20% on our longer agent conversations without any measurable quality drop.

Measuring What Matters

None of this works without observability. We track three metrics for every agent:

Cost per successful task — not just cost per request
Quality score — automated eval comparing optimized vs. unoptimized outputs
Latency — cache hits are 50-100x faster than LLM calls

We built a simple dashboard that shows these per agent, per day. When cost-per-task creeps up, we investigate. When quality drops below threshold, we roll back.

At RapidClaw, we've baked these patterns into our agent deployment pipeline so every new agent starts with sane defaults — compressed prompts, caching enabled, model routing configured. It's not glamorous work, but it's the difference between an AI agent project that's a cost center and one that actually scales.

The Bottom Line

After implementing all four strategies:

Metric	Before	After	Change
Daily token spend	~2M	~540K	-73%
Monthly cost	$1,840	$497	-73%
Avg response latency	2.3s	0.8s	-65%
Task success rate	91%	94%	+3%

The latency improvement was an unexpected bonus — cache hits are basically free and instant.

If you're deploying AI agents and haven't optimized token costs yet, start with prompt compression. It's the fastest win with zero infrastructure changes. Then add caching. Then model routing. Each layer compounds on the last.

We're building more of these optimization primitives into the RapidClaw platform — if you're running agents in production and want to stop bleeding money on tokens, check it out.

I'm Tijo, founder of RapidClaw. I write about the unglamorous but critical parts of running AI in production. Follow me for more posts on agent ops, infra, and building startups with AI.

Running Gemma 4 next to your agent runtime: notes from a small shop

Tijo Gaucher — Mon, 06 Apr 2026 08:04:51 +0000

My brother Brandon and I run RapidClaw. Most days it's just the two of us, a handful of customers, and a few agents chugging along in production. A few months ago we started putting small open-weight models on the same box as the agent runtime — mostly Gemma 4, a bit of Phi-4 for comparison, some Qwen. This is a short write-up of what's actually worked and what hasn't.

Nothing revolutionary here. I'm writing it because I searched for "agent + local Gemma" a bunch of times last quarter and mostly found benchmark posts, not lived-experience notes.

The thing we noticed

The newest small models are small enough that they fit on the same machine as the agent loop. That's the whole observation. Gemma 4 4B runs fine on a 24 GB GPU next to a Node process running our agent code. Phi-4 14B is tight but works. A year ago you needed a separate inference box, which meant a network hop, which meant we just paid a hosted API and moved on.

Now the tradeoff is different. You can keep the hosted model for the hard stuff and quietly route the cheap, high-volume calls to the local model. Hybrid, not replacement.

What we actually do

We have four agents running in production right now. One of them — the one that classifies incoming support messages and decides which of the other agents to hand off to — used to make a hosted-model call per message. That single agent was roughly 80% of our inference spend because it ran on every message, even the obvious ones.

We moved that classifier to Gemma 4 4B on the same box. The agent framework is unchanged, it just points at a local OpenAI-compatible endpoint (we're using Ollama for now, llama.cpp's server also works). The other three agents still call the hosted models when they need to reason about something real.

That's it. One local model, four agents, one box. No Kubernetes, no model router, no fancy fallback chain.

Numbers from our box

Single machine, RTX 4090, one of our production workers. Measured over a week in March on real traffic, not a synthetic benchmark.

Path	Median latency	p95	Cost per 1k calls
Hosted Sonnet-class	1.8s	4.2s	~$4.50
Hosted mini/flash-class	0.9s	2.1s	~$0.60
Gemma 3 4B, local, same box	0.25s	0.6s	~$0.04*

*Local cost is amortized GPU + power on a box we were already paying for. If you had to rent a GPU just for this, the numbers flip hard — more on that below.

For the classifier workload specifically, Gemma 4 is good enough. It's not as sharp as the big hosted models, but "is this message a billing question or a bug report" doesn't need the big hosted models. We compared a week of its outputs against the hosted model's outputs on the same messages — they agreed on about 94% of them. The 6% where they disagreed were mostly ambiguous messages where the hosted model wasn't obviously right either.

Gotchas we hit

Cold starts are real. First request after the model unloads was 8–15 seconds. We pin the model in memory with a keepalive. Obvious in hindsight.

VRAM math is tighter than you think. Gemma 4 4B at Q4, plus an 8k context window, plus our Node process, plus the occasional burst of parallel requests: we hit OOM twice in the first week. We now cap concurrent local calls at 3 and queue the rest. Nothing fancy.

Prompt formats drift. A prompt that worked cleanly on the hosted model produced mush on Gemma. Small models are less forgiving of vague instructions. We ended up maintaining two prompt versions — one terse and explicit for Gemma, one more conversational for the hosted model. Not ideal but it's only two prompts.

Eval is annoying but necessary. You can't just swap models and hope. We built a small eval set (about 200 labeled messages) and run it whenever we change the local model or the prompt. Takes five minutes. Worth it.

When not to bother

Honestly, most people reading this probably shouldn't do this yet. A few cases where it doesn't make sense:

Low volume. If you're making under ~10k inference calls a day, the hosted APIs are cheaper than any GPU you'd rent. Local only wins at volume.
You don't already have a box. If you're renting a GPU purely to run Gemma 3, the math only works if you're saturating it. We could do this because we already had machines running the agent runtime with idle GPU capacity.
The task actually needs the big model. If you're doing code generation or multi-step planning, Gemma 4 4B will frustrate you. Use the hosted model and stop fighting it.
You're early. If you're pre-product-market-fit, every hour spent on inference optimization is an hour not spent on the thing users actually care about. We only did this after the classifier bill started showing up in the monthly.

What I'd try next

Phi-4 14B for one of the agents that does light reasoning over structured data. We haven't moved it yet because the quality bar is higher and I haven't built the eval set for it. Probably in April.

Also curious about Qwen 2.5 for a multilingual case we have, but that's further out.

That's the whole post. Nothing dramatic — a classifier moved, a bill went down, we learned some boring operational lessons. Small open-weight models finally being small enough to share a box with the agent runtime is, for us, the thing that made any of this viable.

Tijo Bear runs RapidClaw (rapidclaw.dev) with his brother Brandon — managed hosting for AI agents. If you're running agents and curious about hybrid local/hosted setups, the site has more.

Token Cost Optimization for AI Agents: 7 Patterns That Cut Our Bill by 73%

Tijo Gaucher — Mon, 06 Apr 2026 07:15:49 +0000

Token Cost Optimization for AI Agents: 7 Patterns That Cut Our Bill by 73%

Six months ago our monthly LLM bill at RapidClaw hit a number I'd rather not print. We were running production AI agents across customer workloads, and every "let's just add one more tool call" was quietly compounding into a four-figure surprise on the invoice.

I'm Tijo Bear, founder of RapidClaw. We build infrastructure for teams who want to ship AI agents without becoming full-time prompt engineers. After spending a quarter obsessing over our own token economics, we cut spend by 73% — without degrading agent quality. Here are the seven patterns that mattered most.

1. Prompt caching is the cheapest 90% win you'll ever ship

If you're sending the same system prompt, tool definitions, or RAG context on every turn, you're paying full freight for tokens the model has already seen. Anthropic, OpenAI, and most major providers now support prompt caching with cache hits priced at roughly 10% of normal input tokens.

# Before: 4,200 input tokens/turn at full price
messages = [
    {"role": "system", "content": LARGE_SYSTEM_PROMPT + TOOL_DEFINITIONS},
    {"role": "user", "content": user_input}
]

# After: same prompt, marked cacheable
messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": LARGE_SYSTEM_PROMPT + TOOL_DEFINITIONS,
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    {"role": "user", "content": user_input}
]

One line of config. ~85% cost reduction on the cached portion. There's no excuse not to ship this today.

2. Route by complexity, not by habit

Not every task needs your most expensive model. We built a tiny router that classifies incoming agent requests into three buckets and dispatches them to the cheapest model that can plausibly handle the job:

def route_model(task_type: str, context_size: int) -> str:
    if task_type in ("classify", "extract", "format"):
        return "haiku"          # ~$0.25/M input
    if context_size > 50_000 or task_type == "reason":
        return "sonnet"         # ~$3/M input
    return "haiku"              # default to cheap

We escalate to the bigger model only when the cheap one returns low confidence or fails validation. Roughly 68% of our agent calls now resolve on the small model. That alone moved the needle more than any other optimization.

3. Trim your tool definitions ruthlessly

Tool/function schemas are tokens too. We audited ours and found 11 tools with descriptions averaging 180 tokens each, half of which were redundant explanation the model didn't actually need.

Cut every tool description down to its single most informative sentence. Move worked examples into a separate retrievable doc the agent can fetch only when it needs guidance. We saved ~1,400 tokens per turn just by editing JSON.

4. Stop re-feeding the entire conversation history

The naive agent loop ships the full message history on every turn. By turn 12 you're paying for turns 1–11 again. Three things help:

Sliding window — keep only the last N turns verbatim
Summary compaction — once history exceeds a threshold, ask a cheap model to summarize older turns into a 200-token recap
Memory extraction — pull stable facts (user prefs, project IDs, decisions) into a structured memory store, then inject only the relevant rows

def compact_history(messages, threshold=20):
    if len(messages) < threshold:
        return messages
    old, recent = messages[:-10], messages[-10:]
    summary = cheap_summarize(old)
    return [{"role": "system", "content": f"Earlier context: {summary}"}] + recent

5. Cap your tool-call loops

The single biggest money pit in agent systems isn't the model — it's the runaway loop. An agent that retries a flaky tool 14 times will quietly burn through more budget than 200 normal sessions.

Hard cap iterations. Add exponential backoff. Surface a clear error to the user instead of letting the model keep paying to re-try. Our default is 8 tool calls per turn with a budget guardrail that aborts the session if input tokens exceed a configured ceiling. You can read more about how we handle this in our agent runtime docs.

6. Stream and short-circuit

If your agent's output gets parsed and acted on, you don't need to wait for the full completion. Stream the response and short-circuit as soon as you've got the structured field you need. We saved roughly 22% of output tokens on long-form generations by stopping early when a <done> sentinel was emitted.

async for chunk in client.messages.stream(...):
    buffer += chunk.text
    if "<done>" in buffer:
        break  # stop paying for more tokens

7. Self-host the cheap stuff

Not every step in an agent pipeline needs a frontier model. Embeddings, classification, reranking, simple extraction — these run beautifully on small open models you can deploy on a single GPU box for a fixed monthly cost.

We moved embeddings and intent classification onto a self-hosted setup and the marginal cost dropped to effectively zero. The frontier model still handles the hard reasoning, but the surrounding plumbing now runs on infrastructure we control. If you're curious how we deploy and scale these, we wrote up the full architecture on the RapidClaw blog.

The numbers

Stacked together, here's what each pattern contributed to our 73% cut:

Pattern	Contribution
Prompt caching	31%
Model routing	19%
Self-hosting plumbing	11%
History compaction	6%
Tool definition trim	3%
Loop caps + budget guard	2%
Stream short-circuit	1%

The lesson isn't that any single trick is magical — it's that token economics is additive. Five mediocre optimizations beat one heroic one, and they're far easier to ship.

What I'd do first if I were starting over

If I had to rebuild this from scratch tomorrow with one week to optimize, I'd ship in this order: prompt caching → loop caps → model routing → history compaction. Those four alone get you to roughly 60% savings and require no infrastructure changes.

Everything else is polish.

If you're building production agents and want a runtime that bakes these patterns in by default, that's exactly what we're building at RapidClaw. I'd love to hear how you're handling token economics in your own stack — drop a comment or hit me up.

— Tijo

I replaced myself with AI agents and now my startup runs 60% faster

Tijo Gaucher — Mon, 30 Mar 2026 02:45:03 +0000

So about 6 months ago I was basically drowning. Solo founder, trying to build an AI platform, doing everything myself — investor outreach, pitch decks, dev work, customer support, content, SEO... you know the drill. I was working 14 hour days and still falling behind.

Then I started using AI agents for real. Not just ChatGPT for writing emails — I mean actual autonomous agents that handle entire workflows end to end. And honestly it kinda changed everything about how I run my startup.

what actually happened

I'm building Rapid Claw — its a platform for deploying and managing OpenClaw AI agents. OpenClaw is an open-source AI co-founder framework, and we make it stupid easy to spin up instances and manage them without dealing with all the infra headaches.

But before we had the platform ready, I was running these agents manually. Setting up servers, configuring environments, managing state, handling crashes at 2am... it was a mess. I was spending more time babysitting the agents than actually building my product.

The irony of building an agent hosting platform while struggling to host your own agents is not lost on me lol.

the numbers that made me rethink everything

Here's roughly what I was spending per month running agents the "hard way":

VPS instances: ~$400/mo (3 servers on Hetzner)
API costs (OpenAI + Anthropic): ~$800/mo
My time on devops/firefighting: ~25 hrs/mo (thats worth... a lot when you're a solo founder)
Random tools and services: ~$200/mo

Total: ~$1,400/mo + 25 hours of my life

After I dogfooded our own platform and moved everything to managed Rapid Claw instances, it dropped to about $600/mo total and I spend maybe 3-4 hours a month on agent ops. The rest of that time goes into actually building features and talking to users.

what the agents actually do for me

I'm not just using agents for one thing — I have multiple OpenClaw instances handling different parts of the business:

Research agent — scrapes competitor pricing, tracks Product Hunt launches in my space, monitors relevant subreddits. Used to spend 5+ hours a week on this manually.

Content agent — drafts blog posts, helps with SEO research, generates social media content. I still edit everything but starting from a solid draft vs a blank page saves me hours.

Dev assistant agent — reviews PRs, writes tests, handles repetitive code tasks. This one alone probably saves me 10 hours a week.

Outreach agent — personalizes cold emails for investor outreach, researches potential partners. Way better than the generic templates I was sending before.

the part nobody warns you about

The thing that surprised me most wasnt the cost savings or the time savings. It was how much mental energy it freed up.

When you're a solo founder, context switching is the real killer. Going from writing code to researching competitors to drafting emails to fixing a server — your brain never gets to go deep on anything.

Having agents handle the repetitive stuff means I can actually focus on the 2-3 things that matter most each day. Thats honestly been the biggest win.

why I built rapid claw

After going through all this pain myself, I realized other founders are dealing with the exact same thing. Everyone wants to use AI agents but nobody wants to deal with:

Setting up and maintaining servers
Managing multiple agent instances
Handling security and permissions (you do NOT want an agent with unrestricted access to your systems btw)
Monitoring and logging
Scaling up when things get busy

So thats basically why Rapid Claw exists. You pick your OpenClaw agent template, configure it, deploy it, and we handle all the infra. We've got this permission firewall thing that lets you control exactly what each agent can access, which honestly should be table stakes for anyone running agents in production but most people just... don't do it.

if you're thinking about trying agents

Few things I wish someone told me when I started:

Start with one workflow. Don't try to automate everything at once. Pick the most repetitive task you do and agent-ify that first. For me it was competitor research.
Expect to iterate. Your first agent config will suck. Thats fine. The second one will be way better. By the third you'll have a solid sense of what works.
Don't give agents more access than they need. Seriously. An agent with write access to your production database is a disaster waiting to happen. Principle of least privilege, always.
Track the time savings. Its easy to underestimate how much time agents save you. I started logging it and was genuinely surprised — went from ~60 hrs/week of work to about 35 hrs for the same output.

wrapping up

I went from being a burned out solo founder working insane hours to actually having time to think strategically about my business. The agents aren't perfect and they definitely need supervision, but they've basically become my team.

If you're a founder or indie hacker whos been curious about agents but hasn't taken the plunge — just start. Even a basic research agent will change how you work.

Happy to answer questions about my setup in the comments. Been running this way for a few months now and have learned a ton about what works and what definitely doesnt.

btw if you want to try running your own OpenClaw agents without the infra pain, check out rapidclaw.dev — we're still early but the free tier is enough to get started.