Forem

AI Coding Tools Have a Context Problem — Here's the Fix

RapidKit — Tue, 21 Apr 2026 08:11:47 +0000

The Wrong Unit of Context

Most AI coding tools work at the file level.

That's fine for a React component. A component is self-contained — the context needed to help you fits in the file.

Backend services aren't self-contained. They live inside environments. They share infrastructure. They depend on modules installed at the workspace level.

This is why AI backend debugging suggestions are often... almost right. They're missing environment context.

What a Backend AI Actually Needs

Take this error:

redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379

A file-level AI tells you: Redis isn't running.

A workspace-aware AI knows:

You have redis-cache module installed in auth-api
Your Workspace Health check already flagged this
You're using Docker Compose conventions (RapidKit workspace)

The second answer is specific. The first is a starting point you still have to work from.

The Workspace as Context Unit

In Workspai, when AI responds to a debug action, it receives:

{
  "project": "auth-api",
  "type": "fastapi.standard", 
  "modules": ["jwt-auth", "redis-cache"],
  "python": "3.12.3",
  "health_warnings": ["Redis not reachable at localhost:6379"],
  "error": "ConnectionRefusedError at line 89"
}

Not file contents. A structured workspace snapshot. The response is grounded from the first message.

Why the Workspace Format Matters

This only works because RapidKit defines a structured workspace format. It knows:

Which projects exist and what type they are
Which modules are installed at each project
The runtime version
The current health state

Without this structure, you'd have to infer context from file contents — slow, unreliable, incomplete.

With it, context assembly is deterministic. The AI starts informed.

What's Available Now (v0.20)

@workspai Chat Participant — use @workspai /ask for full-context Q&A scoped to your active project, or @workspai /debug for structured root-cause + fix + prevention, directly in the VS Code Chat panel
AI Create with presets — describe a project in plain language (or pick a smart preset), and AI plans the workspace, picks a kit, and selects modules
AI Debug Actions — lightbulb in Python/TS/JS/Go files with workspace-aware context
Doctor Fix with AI — one-click AI resolution for workspace health issues
Module Advisor — compatible module suggestions based on what you're building
Workspace Memory — persistent AI context scoped to the workspace, carried across sessions

All on top of the existing RapidKit workspace platform. No changes to CLI, kits, or modules.

The Bigger Picture

The teams that establish workspace structure now will leverage AI more effectively as the tools improve. Workspace-aware AI will become the baseline expectation — the file level will feel like working blind.

🔗 workspai.com

🔗 Workspai — VS Code Marketplace

🔗 getrapidkit.com

The Planning Tax: Why Your AI Agent Feature Might Be Your Worst Investment

Cornel Stefanache — Tue, 21 Apr 2026 08:05:07 +0000

Your best feature may be destroying your margins, and your engineering team has no idea.

This article isn’t about AI as a productivity tool. It’s about AI as a cost structure, embedded in your product, triggered by your users, and scaling with your revenue.

The AI agents embedded in your product are generating a cost structure your pricing model probably didn’t account for. Not a server bill. Not a licensing fee.

A variable, compounding AI infrastructure cost that grows with engagement, spikes with complexity, and, unlike every other line in your budget, gets worse the more your product succeeds.

Every interaction with an LLM-powered feature is a fresh purchase from a model provider, billed per token, at rates that compound with every feature you add to make the product smarter.

The model provider captures guaranteed revenue on every interaction regardless of whether your business ever makes money on that customer. As Andreessen Horowitz has argued, the total cost of ownership for generative AI is reshaping the economics of an entire software category.

AI is running at your expense, not your users

There is a quiet structural problem sitting at the centre of nearly every LLM-powered product business: the more useful your product becomes, the more expensive it is to run.

This is not a temporary inefficiency that engineering will eventually optimise away. It is the defining economic characteristic of a new category of software, and most product teams are not treating it with the strategic gravity it deserves.

The Paradox of the Power User

The most celebrated features of LLM-powered products, personalisation at scale, natural language interfaces, conversational support that actually resolves issues, intelligent document summarisation, share a common characteristic: they get more expensive with use.

The user who engages most deeply generates the most value and the most AI agent cost simultaneously. This inverts one of the foundational assumptions of the SaaS business model. In traditional software, your heaviest users are your best customers.

They renew, they expand, they refer others. In LLM-powered products, your heaviest users may be your least profitable ones.

The user who loves your product enough to use it every day is the one most likely to be costing you more than they pay.

The evidence is not theoretical. GitHub Copilot launched at $10 per month per developer. Microsoft’s internal calculations later revealed that the average developer was costing roughly $30 in Azure compute, with heavy coders consuming up to $80 per month in inference, a product that was operating at negative gross margin from day one for a meaningful subset of its user base.

Microsoft subsequently raised pricing to $19 per month, not because the feature had improved, but because the original pricing had no defensible unit economics.

Sam Altman confirmed publicly that ChatGPT Pro, priced at $200 per month, was losing money on users generating 20,000 or more queries. Cursor, Replit, and others have made similar mid-course corrections, shifting from flat-rate to consumption-based pricing once the distribution of actual usage became visible.

You Can’t Budget What You Can’t Predict

Traditional compute scales linearly: you set a subscription price, model your cohorts, and the unit economics hold. AI agent costs break that contract entirely. You charge your customer a fixed monthly fee decided in a boardroom, while on the other side of that transaction, you are paying a dynamic, usage-driven price to a model provider that doesn’t care about your pricing page.

A user who opens your product twice a month and one who runs complex queries for three hours a day pay you the same amount. They do not cost you the same amount.

The gap between those two numbers isn’t an edge case to be managed — it is the fundamental structural risk of building a subscription business on top of a consumption-based cost model. As Sequoia Capital’s analysis highlights, the AI industry faces a $600 billion question around whether revenue can ever justify the infrastructure spend. You’ve sold certainty to your customer while absorbing all the variability yourself.

You’re not paying per query. You’re paying for every decision, retry, context window, and failure your product accumulates, the per-query figure is just where the math starts.

Start with context window growth. In a multi-turn conversation, each new response requires the model to process every prior token in the session. A 10-turn conversation doesn’t cost 10 times the price of a single turn, it costs closer to 55 times (the sum of 1 through 10), because each turn re-processes everything that came before. Product features designed around conversational depth have costs that escalate with engagement, not proportionally to it.

Then consider the multiplier effect of making your product smarter. Add multi-step reasoning, tool use, or chained agents, and the multiplier compounds further. Research into agentic software engineering found that in multi-agent systems, iterative code review and refinement stages alone consumed nearly 60 per cent of all tokens in a task — not the generation, but the verification loops.

The Reflexion architecture, which gives LLM agents the ability to reflect on and correct their own outputs across multiple trials, achieves impressive accuracy gains precisely because it runs multiple full inference passes per task. Each improvement in output quality is purchased with a corresponding increase in model API costs.

A reasonable unit economics model makes the failure cost concrete. Consider a product with 1,000 daily user interactions, a 70 per cent success rate, and an average lifetime value of $200 per customer.

The 300 daily failures each carry a recovery cost of at least one additional inference call, an escalation probability, and an amortised churn risk. Even conservative assumptions produce a total daily loss that frequently exceeds the entire inference budget. The cost per transaction you’re tracking is the visible part of a larger number.

How Do You Calculate the True Cost of an AI Agent?

There is a mathematical reality about agentic systems that is uncomfortable to confront in a board meeting: the more steps an agent takes, the more likely it is to fail, even when each individual step has a high probability of success.

If an agent executes a ten-step task and achieves 85% accuracy at each step, the compound probability of a fully correct end-to-end outcome is approximately 19%. Four out of every five autonomous task completions produce a result that is wrong somewhere. The arithmetic is a function of sequential dependency, and it does not improve unless you shorten the chain.

The true cost of an agentic system is expressed by this formula:

Expected Agentic ROI = (Task Value × Success Rate × Volume) − (Development Cost + Runtime Cost + Failure Cost)

The term most internal business cases leave blank is Failure Cost. When an agent fails in production, you incur the engineering labor required to diagnose and remediate, plus the business impact of lost customer value. An enterprise deployment processing 1,000 tickets per day at a 70% success rate generates 300 failures daily.

At a conservative $10 per failure, the monthly failure cost reaches $90,000, often exceeding the compute budget. As McKinsey’s State of AI report notes, organisations that fail to account for these hidden costs are systematically underestimating their total cost of ownership.

A demo that works 80 percent of the time is impressive. A production system that fails 20 percent of the time is useless.

5 Proven Strategies to Reduce AI Agent Costs and Architect for Margin

The AI cost structure described above is not fixed. It is simply the default you accept if you deploy without engineering the economics. You should treat unit economics as a first-class architectural concern from day one.

When building cost-effective, production-ready AI agents for enterprise clients, we apply five core AI cost optimization strategies to fundamentally alter the dollar-per-decision profile:

Model Routing by Task Complexity

The costliest assumption in the industry is that every single step of a workflow requires a premium, frontier model. It doesn’t. You wouldn’t pay a senior executive to handle basic data entry, and you shouldn’t pay a frontier model to do it either.

We design heterogeneous architectures that act as intelligent traffic controllers: they route complex, high-entropy planning to advanced models, but immediately delegate the execution of those plans to highly efficient, fine-tuned Small Language Models (SLMs).

This approach isolates the cost of “expensive intelligence” only to the moments it is genuinely necessary, lowering execution costs by 10x to 30x for procedural, repetitive tasks without sacrificing output quality.

Temporal Scheduling & Compute Arbitrage

Not all agentic work is time-sensitive, yet default setups treat every request like an emergency. Heavy computational tasks — like end-of-day batch summarisation, large-scale data extraction, or automated inbox triaging — do not need sub-second latency. We architect systems that explicitly separate real-time user needs from asynchronous background work.

By scheduling heavy processing during off-peak infrastructure hours and batching requests intelligently, we drastically reduce model API costs and prevent latency spikes for the users who actually need real-time responses.

Constraining the Agent’s Latitude

Planning capability is an incredible feature; unconstrained planning is a blank check. Without boundaries, agents will often fall down “rabbit holes,” exploring vast solution spaces and burning tokens in endless loops just to be thorough.

We implement explicit step budgets, tight system guardrails, and hard termination conditions. An agent instructed to resolve a problem in three steps or fewer will often arrive at the exact same result as one told to “do whatever it takes,” but at a fraction of the cost per interaction. This ensures that your per-transaction costs remain predictable and strictly capped.

Prompt Engineering as Infrastructure

Too many development teams treat prompt design as a quick launch prerequisite rather than core, scalable infrastructure. We treat prompts as highly optimised code. By implementing token-budget-aware reasoning, we mathematically force the model to be concise.

Furthermore, we deploy semantic caching at the architectural level. If a customer asks a question today that is contextually similar to one asked yesterday, our system recognises the intent and serves the answer directly from a vector-embedded cache. This bypasses the model provider entirely, routinely slashing direct API costs by 50% to 70% in environments with recurring request patterns.

Difficulty-Aware Adaptive Reasoning

We build automatic cognitive caps into the agent’s reasoning loop to prevent the system from overthinking. Informed by dual-process theories of cognition — distinguishing between rapid, intuitive responses and slow, deliberate analysis — we calibrate our architectures to allocate intensive planning resources only to tasks that actually warrant them.

In AI reasoning, there is a strict point of diminishing returns where accuracy plateaus. We identify exactly where that plateau is for your specific business operations, ensuring you aren’t paying a premium for extra “thinking” that yields zero incremental correctness.

As research on cost-efficient query routing demonstrates, matching model capability to task difficulty is one of the highest-leverage AI cost optimisation moves available.

References

Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv. https://arxiv.org/abs/2303.11366
Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv. https://arxiv.org/abs/2305.05176
Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Ruhle, V., Lakshmanan, L.V.S., & Awadallah, A.H. (2024). Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing. ICLR 2024. https://arxiv.org/abs/2404.14618
Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J.E., Kadous, M.W., & Stoica, I. (2024). RouteLLM: Learning to Route LLMs with Preference Data. ICLR 2025. https://arxiv.org/abs/2406.18665
Regmi, S. & Pun, C.P. (2024). GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching. arXiv. https://arxiv.org/abs/2411.05276
Salim, M., Latendresse, J., Khatoonabadi, S.H., & Shihab, E. (2026). Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering. arXiv. https://arxiv.org/abs/2601.14470
Singla, A., Sukharevsky, A., Yee, L. et al. (2025). The State of AI: How Organizations Are Rewiring to Capture Value. McKinsey & Company / QuantumBlack. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-how-organizations-are-rewiring-to-capture-value
Cahn, D. (2024). AI’s $600B Question. Sequoia Capital. https://sequoiacap.com/article/ais-600b-question/
Jaipuria, T. (2025). The State of AI Gross Margins in 2025. Tanay Jaipuria’s Substack. https://www.tanayj.com/p/the-gross-margin-debate-in-ai
Kappelhoff, K. (2025). Unit Economics for AI SaaS Companies: A Survival Guide for CFOs. Drivetrain.ai. https://www.drivetrain.ai/post/unit-economics-of-ai-saas-companies-cfo-guide-for-managing-token-based-costs-and-margins
Casado, M. & Wang, S. (2023). The Economic Case for Generative AI and Foundation Models. Andreessen Horowitz. https://a16z.com/the-economic-case-for-generative-ai-and-foundation-models/
Anthropic. (2024). Introducing the Message Batches API. Anthropic Blog. https://claude.com/blog/message-batches-api
Friedman, D. (2025). AI Startups Are SaaS Minus the Margins. Substack. https://davefriedman.substack.com/p/ai-startups-are-saas-minus-the-margins
Chaddha, N. (2025). Why AI Margins Matter More Than You Think. Mayfield Fund. https://www.mayfield.com/why-ai-margins-matter-more-than-you-think/

Configuring My Site for AI Discoverability

Dennis Morello — Tue, 21 Apr 2026 08:02:58 +0000

A growing share of web traffic doesn't come from people anymore. It comes from models reading on their behalf. ChatGPT, Claude, Perplexity, Copilot. They fetch a handful of pages, summarize, and ship the answer back. If your site isn't readable by those agents, you don't exist to them.

People are calling this GEO, short for Generative Engine Optimization. It overlaps with SEO but the priorities are different. Agents don't care about your layout. They care about your prose, your metadata, and how many tokens it costs them to read you.

This post covers how I configured this site for GEO. The first half is framework-agnostic. The second half is specific to my setup on Cloudflare, and includes a deliberate choice that fails a popular GEO audit. I'll explain why.

Part 1: general GEO techniques

Serve raw Markdown alongside HTML

The single biggest GEO win is giving agents a version of each page without the navigation, styling, and scripts. HTML is designed for browsers. Markdown is designed for readers, human or otherwise. Agents spend their context window on your prose, not your DOM.

Every blog post on this site has a mirror URL with a .md suffix:

/blog/my-post is the full HTML page for humans
/blog/my-post.md is the raw Markdown, served as text/markdown

In Astro, this is a two-line route at src/pages/blog/[slug].md.ts:

export const GET = async ({ params }) => {
  const post = await getPostById(params.slug);
  return new Response(formatPostMarkdown(post), {
    headers: { "Content-Type": "text/markdown; charset=utf-8" },
  });
};

Both variants are pre-generated at build time. Same content, roughly half the tokens for an agent to consume.

Advertise the Markdown version in `<head>`

Agents landing on the HTML need to know the Markdown exists. A single <link> in the head does it:

<link rel="alternate" type="text/markdown" href="/blog/my-post.md" />

Browsers ignore this tag. Agents that parse the head follow it.

Publish an `llms.txt` index

llms.txt is a convention for a Markdown file at the root of your site listing your content with short descriptions and links. Think of it as a sitemap an LLM can actually read.

I ship two variants:

/llms.txt is the index. Title, description, one line per post with a link to its .md version.
/llms-full.txt is the full corpus. Every post body concatenated into a single response.

Why both? An agent researching a specific topic can fetch llms.txt, pick the relevant links, and pull them. An agent doing deep research on the site as a whole fetches llms-full.txt once and has everything it needs in one request. Either way there's no crawling.

Declare your AI stance in `robots.txt`

robots.txt now carries a Content-Signal directive for AI use. Mine reads:

User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /
Sitemap: https://morello.dev/sitemap-index.xml

Three independent knobs:

search=yes lets search engines index
ai-train=no says my content is not for training data
ai-input=yes says my content can be retrieved and used as input for AI answers

This is the stance I'm comfortable with. I want to show up when someone asks Claude about something I've written; I just don't want my posts absorbed into the next base model.

Whether any given operator actually honors this is another question. The signal's there regardless, and I'd rather be on record than silent about it.

Add structured data that actually describes the content

Most blogs ship JSON-LD schema by reflex. Few of them include the fields that help a generative engine decide whether your article is worth fetching.

On each post I emit a BlogPosting graph with:

wordCount and timeRequired (ISO 8601 duration), so an agent can estimate how much context it'll spend before fetching
articleBody, the full text machine-readable, with no HTML parsing required
author linked to a Person node with knowsAbout so the entity is grounded in real topics
BreadcrumbList for site hierarchy

All of it goes into a single @graph per page rather than scattered <script> tags, which makes it cheaper for an engine to walk from post to author to site without cross-referencing.

A sitemap that actually tracks freshness

If you regenerate your sitemap once and never look at it again, you're wasting a signal. Every URL in mine carries a lastmod timestamp pulled from the post's updatedDate frontmatter, falling back to pubDate. When I edit an old post, its lastmod moves forward and crawlers reprioritize it.

Validate with real tools

Two tools I found useful while iterating on all of the above:

isitagentready.com audits across five categories: discoverability, content accessibility, bot access control, protocol discovery, and commerce. The bot access control checks (Content-Signal, Web Bot Auth, AI bot rules) are the part that actually influences how agents treat your content.
acceptmarkdown.com has a narrower focus. It checks whether your site responds to Accept: text/markdown with a Markdown body, includes Vary: Accept, returns 406 for unsupported types, and parses q-values correctly.

I'll come back to the second one at the end of the post, because my site deliberately fails it.

Part 2: the Cloudflare-specific setup

General GEO gets you most of the way there. The rest is delivery. How fast you respond, whether the edge caches correctly, and how you advertise your agent-facing resources without waiting for someone to parse your HTML.

Static assets, zero Worker invocations

My wrangler.jsonc points a ./dist directory at Cloudflare's assets deployment, with no main entry:

{
  "name": "morellodev",
  "compatibility_date": "2026-04-18",
  "assets": {
    "directory": "./dist",
    "html_handling": "drop-trailing-slash",
    "not_found_handling": "404-page",
  },
}

Every request goes straight from the edge asset cache. HTML, Markdown, llms.txt, sitemap, RSS. Same path for all of them, and no Worker ever runs. On the Workers Free tier this matters. A crawler sweep that would otherwise eat into 100k daily invocations now costs me nothing. Agents, for better or worse, don't fingerprint politely.

Advertise discovery endpoints in a `Link` header

Cloudflare's _headers file lets you ship response headers without any server code. I use it to tell every response, not just HTML ones, where the agent-facing files live:

/*
  Link: </sitemap-index.xml>; rel="sitemap",
        </rss.xml>; rel="alternate"; type="application/rss+xml"; title="RSS",
        </llms.txt>; rel="describedby"; type="text/plain",
        </llms-full.txt>; rel="describedby"; type="text/plain"

A crawler doing a HEAD against any URL on the site sees all four links before it parses a single byte of HTML. One round-trip, no body, full discovery.

Long-lived cache for hashed assets

Astro emits fingerprinted filenames under /_astro/, so those can sit in cache for a year:

/_astro/*
  Cache-Control: public, max-age=31536000, immutable

Faster first paint for humans, cheaper crawls for agents. Same lever.

Why I skipped `Accept: text/markdown` content negotiation

acceptmarkdown.com will tell you this site doesn't do content negotiation. No Vary: Accept, no 406, no Markdown from the canonical URL. That's not an oversight. I tried it, shipped it briefly, and rolled it back.

The reason is Cloudflare's free plan. Custom cache keys are Enterprise-only, and their docs are explicit that Vary: Accept is ignored for caching decisions. The edge collapses every variant of /blog/my-post into one cache entry, so the first requester's format poisons the cache for everyone else until TTL expires.

The workaround is a Worker that bypasses the edge cache. But now every /blog/* request burns a Worker invocation, humans included, and the Workers Free plan gives you 100k per day and 10ms of CPU each. That's a real budget to share across humans and bots, for no functional gain over a static .md URL.

So I deleted the Worker. The only thing I lost is curl -H "Accept: text/markdown" …/blog/my-post returning Markdown. Between llms.txt, <link rel="alternate">, and the /blog/[slug].md convention, no mainstream agent I've seen actually needs Accept: negotiation. It's the more elegant protocol; alternate URLs are the more robust one on a free-tier CDN. On a paid plan I'd probably do both.

Where this leaves things

Every page exists in two forms, both served from the edge. Agent-facing resources are advertised in response headers on every request, before any HTML gets parsed. Structured data tells engines what the article is and how much context it takes to read. robots.txt says what I'll allow and what I won't.

GEO is still very new. The standards are half-drafted, the tools disagree with each other, and half the signals I described above didn't exist two years ago. I fully expect to be rewriting parts of this post within six months, probably with a different opinion about Accept-based negotiation, once I've either moved off the free plan or found a workaround that doesn't involve a Worker. But for now: serve agents a version they can cheaply consume, be explicit about what you'll allow, and accept that the defaults aren't on your side.

If you're reading this via a summary from some assistant, hi. Thanks for the traffic.

Less Human AI Agents, Please!

Mariano Gobea Alcoba — Tue, 21 Apr 2026 08:01:31 +0000

The Uncanny Valley of AI Agent Interaction: Beyond Human Mimicry

The burgeoning field of AI agents, designed to autonomously perform tasks and interact with users, presents a complex design challenge. As highlighted in recent discussions, a prevalent tendency is to imbue these agents with human-like characteristics, language, and even personality traits. While seemingly intuitive, this approach often leads to an undesirable outcome: the "uncanny valley" of human-AI interaction. This article delves into the technical and user experience implications of this human-centric design philosophy and explores alternative, more effective paradigms for AI agent development.

The Allure and Peril of Anthropomorphism

Anthropomorphism, the attribution of human characteristics to non-human entities, is a deeply ingrained cognitive bias. In the context of AI, this manifests as designing agents that speak, reason, and behave as closely to humans as possible. The motivations for this are varied:

Familiarity and Ease of Use: Users are inherently familiar with human communication and interaction patterns. Designing AI agents that mirror these patterns can, in theory, reduce the learning curve and make adoption smoother.
Emotional Connection and Trust: Some believe that a more "human" agent can foster greater trust and a sense of connection with the user, leading to more positive user experiences.
Simulating Human Capabilities: The ultimate goal for many AI agents is to replicate or surpass human performance in specific tasks. This often leads to designing agents that think and communicate in ways that mimic human cognitive processes.

However, this pursuit of human likeness is fraught with peril. When an AI agent almost succeeds at mimicking human behavior but falls short in subtle yet crucial ways, it can evoke feelings of unease, creepiness, or even revulsion. This is the AI equivalent of the uncanny valley, first described by roboticist Masahiro Mori in relation to humanoid robots.

Technical Manifestations of the Uncanny Valley:

Linguistic Inconsistencies:
- Overly Formal or Stilted Language: While aiming for politeness, agents might use phrasing that is grammatically correct but unnatural in spoken conversation.
- Inappropriate Tone: An agent attempting empathy might produce responses that feel hollow, insincere, or misaligned with the user's emotional state.
- Repetitive Phrasing: Limited generative capacity can lead to predictable and repetitive conversational patterns, signaling the artificial nature of the agent.
- Misinterpretation of Nuance: Sarcasm, irony, humor, and colloquialisms are notoriously difficult for AI to grasp. A failed attempt to engage with these can be jarring.
Behavioral Discrepancies:
- Lack of True Agency: Agents that claim to "understand" or "feel" but then act purely based on deterministic logic create a disconnect.
- Inconsistent Persona: An agent that fluctuates between being overly casual and then strictly professional can be disorienting.
- Unrealistic Pacing: Immediate responses to complex queries can feel unnatural, as humans typically require time to process information. Conversely, overly long pauses can also break the flow.
- Failure to Adapt to Context: An agent that forgets previous turns in a conversation or fails to acknowledge evolving user needs demonstrates a lack of true intelligence and makes the "human" facade crumble.
Task Performance Mismatch:
- Over-promising and Under-delivering: An agent that uses human-like language to suggest it can perform complex reasoning but then fails to do so effectively highlights its limitations.
- Misaligned Expectations: Users might expect the emotional intelligence or common sense reasoning of a human, which current AI agents generally lack.

The Case for "Less Human" AI Agents

Instead of striving for human mimicry, a more effective approach might be to design AI agents that embrace their artificial nature. This paradigm shift focuses on transparency, efficiency, and clarity of purpose, rather than a flawed attempt at emulation.

Key Principles of "Less Human" AI Agents:

Transparency and Honesty:
- Clearly State AI Identity: The agent should explicitly identify itself as an AI. There should be no ambiguity.
- Acknowledge Limitations: Instead of trying to bluff its way through, the agent should be programmed to admit when it doesn't know something, can't perform a task, or requires human intervention.
- Explain Capabilities and Purpose: Users should understand what the agent can do and why it exists. This sets realistic expectations.
Efficiency and Directness:
- Focus on Task Completion: The primary goal of an AI agent is to efficiently and accurately perform its designated tasks. Human-like chit-chat or personality embellishments can be distractions.
- Precise Language: Use clear, unambiguous language. Avoid jargon where possible, but prioritize accuracy and conciseness over conversational filler.
- Structured Interaction: For complex tasks, a more structured, form-based, or step-by-step interaction might be more efficient than an open-ended conversation.
Predictability and Reliability:
- Consistent Behavior: The agent's responses and actions should be predictable based on its programming and the input it receives. This builds trust through reliability.
- Defined Scope: Clearly defined operational boundaries prevent unexpected or undesirable behavior.
Functional Design:
- User Interface (UI) and User Experience (UX) Driven by Function: The interface and interaction flow should be optimized for task completion, not for mimicking human conversation. This might involve dashboards, clear forms, and direct controls rather than free-form text input.
- Error Handling as a Feature: Robust error handling, with clear explanations and actionable steps, is more valuable than an apology that rings hollow.

Technical Implementation Strategies

Adopting a "less human" approach doesn't mean creating robotic, unfriendly interfaces. It means prioritizing functional excellence and transparency in design and implementation.

1. Communication Protocols and Language Models

Intent Recognition and Slot Filling: For task-oriented agents, sophisticated Natural Language Understanding (NLU) models focusing on intent recognition and slot filling are crucial. These models should be trained to extract specific information rather than engaging in broad conversational discourse.

# Example using a hypothetical NLU library
from nlu_service import NLUClient

client = NLUClient(api_key="YOUR_API_KEY")

user_utterance = "I want to book a flight from London to New York for two people next Tuesday."
result = client.analyze(user_utterance)

# Expected output focuses on structured data extraction
# {
#     "intent": "book_flight",
#     "slots": {
#         "origin": "London",
#         "destination": "New York",
#         "passengers": 2,
#         "date": "next Tuesday"
#     }
# }

# The agent then uses these structured slots to query a booking system.

Controlled Generative Models: If generative capabilities are needed, they should be carefully constrained. Fine-tuning Large Language Models (LLMs) on specific, task-oriented dialogue datasets can produce helpful, concise responses without venturing into overly human-like or speculative language. Techniques like Reinforcement Learning from Human Feedback (RLHF) can be used to steer generation towards helpfulness and factual accuracy, rather than "humanness."

# Hypothetical example of constrained generation
from llm_service import LLMClient

llm_client = LLMClient(model="task_oriented_model")

prompt = """
User Request: "What is the status of my order #12345?"

System Instruction: Respond concisely with factual information only.
If information is unavailable, state "Information not available."
Do not speculate or offer apologies.
"""
response = llm_client.generate(prompt)
# Expected response: "Order #12345 is currently in transit. Estimated delivery: 2023-10-27."
# Or: "Information for order #12345 is not available."

Explicit AI Identification: The system should prepend or append clear disclaimers.

def generate_ai_response(core_response: str) -> str:
    prefix = "System AI: "
    return f"{prefix}{core_response}"

user_query = "Book a meeting with John Doe tomorrow at 2 PM."
# ... logic to process query and find availability ...
meeting_details = "Meeting with John Doe scheduled for tomorrow at 2 PM."
print(generate_ai_response(meeting_details))
# Output: System AI: Meeting with John Doe scheduled for tomorrow at 2 PM.

2. State Management and Context Handling

Session State: Maintain a clear, explicit representation of the conversation state. This includes recognized intents, extracted slots, user preferences, and task progress.

Contextual Awareness: The agent needs to understand the immediate context of the current turn as well as relevant historical context from the session. However, this context should be used to inform task execution, not to build a "personality."

class ConversationState:
    def __init__(self):
        self.current_intent = None
        self.slots = {}
        self.task_progress = "idle"
        self.user_id = None
        self.history = [] # Limited history relevant to task

    def update_state(self, intent, new_slots):
        self.current_intent = intent
        self.slots.update(new_slots)
        self.history.append({"intent": intent, "slots": new_slots})
        # Logic to advance task progress based on intent and slots

state = ConversationState()
# User says: "I need to reorder my usual coffee."
# NLU identifies intent="reorder_item", slots={"item": "usual coffee"}
state.update_state("reorder_item", {"item": "usual coffee"})
# Agent uses state.slots["item"] to query order history.

3. Error Handling and Fallback Strategies

Informative Error Messages: When an error occurs, the agent should provide a clear explanation of what went wrong and, if possible, suggest concrete next steps.

def handle_booking_error(error_type: str, context: dict) -> str:
    if error_type == "slot_missing":
        missing_slot = context.get("missing_slot", "required information")
        return f"I cannot proceed without {missing_slot}. Please provide it."
    elif error_type == "api_failure":
        return "An internal error occurred while processing your request. Please try again later."
    else:
        return "An unexpected error occurred. Please contact support."

# Agent encounters an error
print(handle_booking_error("slot_missing", {"missing_slot": "departure date"}))
# Output: I cannot proceed without departure date. Please provide it.

Graceful Degradation: If an agent cannot fulfill a request, it should offer alternatives or clearly state its inability to help, rather than generating nonsensical or misleading information.

def handle_unfulfillable_request(request: str) -> str:
    # Check against agent's capabilities
    if not agent_can_handle(request):
        return f"I am designed to assist with [specific tasks]. I cannot help with '{request}'."
    return "This request cannot be fulfilled at this time."

print(handle_unfulfillable_request("Analyze my company's stock market trends for the next decade."))
# Output: I am designed to assist with booking appointments and sending reminders. I cannot help with 'Analyze my company's stock market trends for the next decade.'

4. User Interface Design for Clarity

Visual Cues: Use UI elements that clearly indicate the agent's function and status. Progress indicators, clear labels, and distinct input/output areas can be more effective than chat bubbles.
Structured Input: For complex data entry, use forms, dropdowns, calendars, and other structured input fields instead of relying solely on natural language. This reduces ambiguity and ensures all necessary information is captured.

Actionable Output: Present information and results in a clear, organized, and actionable manner. Buttons for confirmation, links to further information, or summaries of actions taken are beneficial.

<!-- Example of a structured UI element for booking -->
<div class="booking-form">
    <h3>Flight Booking</h3>
    <label for="origin">Origin:</label>
    <input type="text" id="origin" placeholder="e.g., London">

    <label for="destination">Destination:</label>
    <input type="text" id="destination" placeholder="e.g., New York">

    <label for="departure-date">Departure Date:</label>
    <input type="date" id="departure-date">

    <button id="search-flights">Search Flights</button>
</div>

The Benefits of a Functionalist Approach

Moving away from the pursuit of human-like interaction offers several advantages:

Reduced User Frustration: By setting realistic expectations and providing clear, efficient interactions, users are less likely to be frustrated by an agent's perceived shortcomings.
Increased Trust and Reliability: An agent that is honest about its capabilities and consistently performs its functions accurately builds more genuine trust than one that fakes empathy or understanding.
Improved Efficiency: Focusing on task completion rather than conversational pleasantries can lead to faster and more direct resolution of user needs.
Scalability: Functionalist agents are often easier to scale and maintain, as their behavior is more predictable and less dependent on the nuances of human language and emotion.
Ethical Considerations: Avoiding the creation of artificial "personalities" can mitigate concerns around emotional manipulation and the blurring of lines between human and machine relationships.

Conclusion: Embracing Artificiality

The quest to make AI agents "less human" is not about creating cold, unfeeling interfaces. It is about a pragmatic recognition of current AI capabilities and a user-centered design philosophy that prioritizes clarity, efficiency, and honesty. By embracing the artificial nature of these agents, developers can build systems that are more reliable, trustworthy, and ultimately more helpful to users. The uncanny valley of human mimicry is a trap that can be avoided by focusing on what AI agents do best: process information, execute tasks, and communicate results with precision and transparency.

We invite you to explore further advancements and discuss these principles in the context of your own projects. For expert guidance and consulting services in AI agent development and conversational interface design, please visit https://www.mgatc.com.

Originally published in Spanish at www.mgatc.com/blog/less-human-ai-agents-please/

We open sourced our Unity MCP server

Daniel Fang (Glade) — Tue, 21 Apr 2026 08:01:05 +0000

Many “AI for game dev” tools still stop at code generation.

They can suggest a script, maybe explain an error, maybe even produce something close to what you want. But in actual Unity workflows, that is usually only a small part of the job.

The real work is spread across scene hierarchy, prefabs, materials, UI, physics, animation, input setup, package differences, console errors, project conventions, and lots of repetitive editor actions.

That gap is exactly why we built GladeKit.

Today, we’re doing two things:

Launching GladeKit officially (see Product Hunt)
Open sourcing the GladeKit Unity MCP server

GladeKit Unity MCP

The open-source MCP server connects AI clients like Cursor, Claude Code, and Windsurf directly to the Unity Editor.

That means the model is not just chatting about your game in the abstract. It can actually operate with real Unity context.

The server includes:

230+ Unity tools across areas like scenes, GameObjects, scripts, prefabs, materials, lighting, VFX, audio, animation, physics, camera, UI, input, terrain, and NavMesh
a Unity-aware system prompt
GLADE.md project context injection
semantic script search
skill calibration based on user expertise
optional cloud intelligence for RAG and cross-session memory

Core features are free, local, and MIT licensed.

Why we open sourced it

For Unity especially, usefulness depends on project awareness. The model needs to understand what scene is open, what objects exist, what scripts are relevant, what pipeline is being used, what errors are happening, and what conventions the project already follows.

Without that, you end up with generic “AI-generated advice.”
With that, you start getting closer to an actual useful AI assistant / agent.

Open sourcing the MCP server is our way of pushing that interface forward.

Example of the difference

A normal coding assistant might help with:
“Write me a script for enemy spawning.”

A Unity-connected MCP can help more like this:
“Find how enemy spawning currently works in my project, inspect the related scripts, create a new spawn manager, wire it into the scene, and adjust the exposed values to match the existing design.”

That difference is what we care about.

Architecture at a high level

The setup is simple:

a Unity bridge package runs inside the editor
the MCP server connects to that bridge
your AI client talks to the MCP server over stdio or HTTP
the model gets tool access plus Unity-specific context

So instead of copy-pasting back and forth between your IDE, a chatbot, and Unity, the agent can operate much closer to the actual source of truth.

Why this matters beyond GladeKit

I think game dev is one of the most interesting places for MCP-style tooling.

Game development has a huge amount of structured-but-fragmented work:
editor actions, asset references, scene state, component wiring, engine-specific APIs, and long chains of small tasks that are annoying to do manually but difficult to solve with plain text generation alone.

That makes it a really good fit for agent tooling with real tool access.

My guess is we’ll see more of this pattern across game engines and other developer tools - not just AI that answers questions, but AI that can actually operate in the environment where the work is happening.

Links

Open-source MCP repo:
https://github.com/Glade-tool/glade-mcp-unity

GladeKit site:
https://gladekit.com

Product Hunt launch:
https://www.producthunt.com/products/gladekit?launch=gladekit

Would love feedback from anyone building AI dev tools, working with MCP, or trying to make Unity workflows faster.

Playing HEVC in a Browser Without Plugin — An H.265 Decoder in WebAssembly

Thibaut Lion — Tue, 21 Apr 2026 08:00:42 +0000

The Problem — HEVC Everywhere Except the Browser

HEVC/H.265 is the standard codec for Netflix, Apple, broadcasters, 4K/HDR. It saves 30-50% bandwidth versus H.264 at equivalent quality — millions in annual CDN savings for streaming services.

But browser support is a mess.

macOS — Safari, Chrome, Edge, Firefox all decode HEVC natively via VideoToolbox. No extension needed.

Chrome 107+ on Windows — uses D3D11VA directly. No Microsoft extension required, but needs a GPU with hardware HEVC decoder (Intel Skylake 2015+, NVIDIA Maxwell 2nd gen+, AMD Fiji+). No software fallback.

Edge on Windows — uses Media Foundation. Requires the Microsoft HEVC Video Extension ($1 on the Store). Without it, no HEVC regardless of GPU.

Firefox 133+ on Windows — same MFT path, same extension dependency.

Linux — Chrome with VAAPI, maybe. Firefox, no.

The root cause is licensing. MPEG LA and Access Advance impose per-unit royalties. Microsoft passes this to users via the Store extension. Google negotiated a direct D3D11VA path. Mozilla relies on Microsoft's extension. The result: publishers must either encode everything twice (H.264 + HEVC) or accept that some users get a black screen.

The Solution — Decode HEVC Client-Side in WebAssembly

What if the browser didn't need to know it's playing HEVC?

hevc.js decodes HEVC in a Web Worker and re-encodes to H.264 via WebCodecs, delivering standard H.264 to Media Source Extensions. The player doesn't know it's happening.

fMP4 HEVC → mp4box.js (demux) → NAL units
         → WASM H.265 decoder → YUV frames
         → WebCodecs VideoEncoder → H.264
         → custom fMP4 muxer → MSE → <video>

The HEVC decoder is a from-scratch C++17 implementation of ITU-T H.265 (716 pages), compiled to WebAssembly. 236 KB gzipped. Zero dependencies. No special server headers needed.

dash.js integration

The plugin intercepts MediaSource.addSourceBuffer(). When dash.js creates an HEVC SourceBuffer, a proxy accepts the HEVC MIME type but feeds the real SourceBuffer with H.264. ABR, seek, live — everything works unmodified.

import dashjs from 'dashjs';
import { attachHevcSupport } from '@hevcjs/dashjs-plugin';

const player = dashjs.MediaPlayer().create();
await attachHevcSupport(player, {
  workerUrl: '/transcode-worker.js',
  wasmUrl: '/hevc-decode.js',
});
player.initialize(videoElement, mpdUrl, true);

Smart detection

MediaSource.isTypeSupported() can lie — Firefox on Windows reports HEVC support even without the Video Extension installed. hevc.js actually creates a SourceBuffer to probe; only activates transcoding on failure. When native HEVC works: zero overhead, WASM never loaded.

Browser Compatibility

Browser + OS	Native HEVC	hevc.js activates?	Transcoding?
Safari 13+ (macOS/iOS)	Yes (VideoToolbox)	No	—
Chrome/Edge/Firefox (Mac)	Yes (VideoToolbox)	No	—
Chrome 107+ (Win, HEVC GPU)	Yes (D3D11VA)	No	—
Chrome 107+ (Win, no HEVC GPU)	No	Yes	Yes
Edge (Win, with extension)	Yes (MFT)	No	—
Edge (Win, no extension)	No	Yes	Yes
Firefox 133+ (Win, with extension)	Yes (MFT)	No	—
Firefox 133+ (Win, no extension)	False positive	Yes	Yes
Chrome/Edge 94-106	No	Yes	Yes
Chrome (Linux, no VAAPI)	No	Yes	Yes

Requirements: WebAssembly, Web Workers, Secure Context (HTTPS), WebCodecs with H.264 encoding support.

Performance

Single-threaded, Apple Silicon:

	Native C++	WASM (Chrome)
1080p decode	76 fps	61 fps
4K decode	28 fps	21 fps
1080p transcode	—	~2.5x realtime

WASM reaches 80% of native C++ speed, and 83% of libde265 (a mature 10-year-old HEVC decoder) when both are compiled to WASM.

Conformance: 128/128 test bitstreams pixel-perfect against ffmpeg. Zero drift.

The Tradeoff

The first segment takes 2-3 seconds to transcode — that's the startup latency cost of software decode versus native hardware. After buffering, playback is smooth.

This makes hevc.js a good fit for:

Streaming platforms with existing HEVC catalogs
Infrastructure simplification (single HEVC pipeline, no H.264 fallback)
VOD or moderate-latency live
Controlled environments (IPTV, B2B)

Not ideal for: low-end mobile (CPU/battery), 4K on underpowered machines, or ultra-low-latency live sports.

Try It

Live demo: hevcjs.dev/demo/dash.html — toggle "Force transcoding" to test the WASM path even if your browser has native HEVC.

Install:

npm install @hevcjs/dashjs-plugin dashjs

GitHub: github.com/privaloops/hevc.js

MIT license. Feedback and contributions welcome.

How to Build a Remote Job Alert System (No API Key Required)

agenthustler — Tue, 21 Apr 2026 08:00:09 +0000

The Problem with Job Board Notifications

Most job boards have email alerts, but they're noisy and limited. You can't filter by salary range, tech stack, or specific keywords in the description. You can't combine alerts from multiple boards into one feed. And you definitely can't pipe the results into your own tools.

Let's fix that. In this tutorial, we'll build a remote job alert system that:

Pulls fresh listings from remote job boards every few hours
Filters by your criteria (keywords, salary, location)
Sends you a clean email digest
Runs on autopilot with zero API keys to manage

The Stack

Data source: WeWorkRemotely Scraper on Apify (handles the data collection)
Scheduling: Apify's built-in scheduler (or cron if self-hosting)
Filtering + alerts: A simple Python script
Email: SMTP (Gmail, SendGrid, or any provider)

Step 1: Set Up Automated Data Collection

Create a free Apify account and find the WeWorkRemotely Scraper in the store. Configure it with your search parameters and set it to run on a schedule (every 6 hours works well for job listings).

Each run produces a dataset of JSON objects like this:

{
  "title": "Senior Python Developer",
  "company": "Acme Corp",
  "url": "https://weworkremotely.com/listings/acme-senior-python",
  "category": "Programming",
  "date": "2026-04-15",
  "salary": "$120k - $160k",
  "description": "We're looking for a senior Python developer..."
}

Step 2: Filter and Alert with Python

Here's a complete script that fetches the latest results, filters them, and sends an email:

import requests
import smtplib
from email.mime.text import MIMEText
from datetime import datetime, timedelta

# Config
APIfY_TOKEN = 'your_apify_token'
DATASET_ID = 'your_dataset_id'  # From the scheduled run
EMAIL_FROM = 'alerts@yourdomain.com'
EMAIL_TO = 'you@yourdomain.com'
SMTP_HOST = 'smtp.gmail.com'
SMTP_PORT = 587
SMTP_USER = 'your_email'
SMTP_PASS = 'your_app_password'

# Keywords to match (case-insensitive)
KEYWORDS = ['python', 'fastapi', 'data engineer', 'backend']
MIN_SALARY = 100_000  # Optional: filter by minimum salary

def fetch_jobs():
    """Pull latest job listings from Apify dataset."""
    url = f'https://api.apify.com/v2/datasets/{DATASET_ID}/items'
    resp = requests.get(url, params={'token': APIFY_TOKEN})
    return resp.json()

def matches_criteria(job):
    """Check if a job matches our filter criteria."""
    text = f"{job['title']} {job.get('description', '')}".lower()
    return any(kw.lower() in text for kw in KEYWORDS)

def format_digest(jobs):
    """Format matching jobs into a readable email body."""
    lines = [f"Found {len(jobs)} matching remote jobs:\n"]
    for job in jobs:
        lines.append(
            f"**{job['title']}** at {job['company']}\n"
            f"  Salary: {job.get('salary', 'Not listed')}\n"
            f"  Link: {job['url']}\n"
        )
    return '\n'.join(lines)

def send_email(subject, body):
    """Send the digest via SMTP."""
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = EMAIL_FROM
    msg['To'] = EMAIL_TO

    with smtplib.SMTP(SMTP_HOST, SMTP_PORT) as server:
        server.starttls()
        server.login(SMTP_USER, SMTP_PASS)
        server.send_message(msg)

def main():
    jobs = fetch_jobs()
    matching = [j for j in jobs if matches_criteria(j)]

    if matching:
        subject = f'{len(matching)} new remote jobs matching your criteria'
        body = format_digest(matching)
        send_email(subject, body)
        print(f'Sent digest with {len(matching)} jobs')
    else:
        print('No matching jobs found')

if __name__ == '__main__':
    main()

Step 3: Run It on a Schedule

You have a few options:

Apify webhook — Set up a webhook on your scheduled actor run that hits your script endpoint
Cron job — Run the Python script every 6 hours on any server or even a Raspberry Pi
GitHub Actions — Free scheduled workflows that can run this script

For GitHub Actions, create .github/workflows/job-alerts.yml:

name: Job Alerts
on:
  schedule:
    - cron: '0 */6 * * *'
jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install requests
      - run: python job_alerts.py
        env:
          APIFY_TOKEN: ${{ secrets.APIFY_TOKEN }}

Extending It

Once the basic system works, you can add:

Multiple sources — Add RemoteOK, Indeed, or other boards to the same pipeline
Deduplication — Track seen job URLs in a simple JSON file or SQLite database
Slack/Discord alerts — Replace the email function with a webhook POST
Salary parsing — Extract numeric ranges and filter more precisely
Dashboard — Push results to a Google Sheet for tracking over time

Why This Beats Built-In Alerts

Job board email alerts give you everything that matches a single keyword. This system lets you:

Combine multiple boards into one feed
Apply complex filters (salary + keywords + category)
Control the format and delivery channel
Keep a historical record of listings
Build on top of it (analytics, auto-apply, etc.)

The whole setup takes about 20 minutes, runs for free (within Apify's free tier and GitHub Actions limits), and you'll never miss a relevant remote job posting again.

What's your current job search automation setup? I'd love to hear what tools people are using — drop a comment below.

Cinematic Product Videos with fal.ai and Kling 3.0 for $1 a Scene

Ben Utting — Tue, 21 Apr 2026 08:00:00 +0000

A client needed social media videos of their product in six different lifestyle scenes. Professional shoots would have cost thousands per location. We did all six for about $6 total, in under an hour.

The pipeline is two API calls: one to place the real product into a generated scene, one to animate it into a 5-second video with sound. Both run through fal.ai.

The brief

The client had a small physical product and a solid brand page with plenty of existing content. He sent me an AI-generated video he'd seen of someone walking through New York that seamlessly featured a product. He wanted something similar for his own brand: cinematic scenes showing the product in restaurant and bar settings, generated entirely from a single product photo.

The goal was to build a repeatable skill that could produce these scenes on demand, not just a one-off video.

Step 1: place the product into a scene

The first script uses Google's Nano Banana 2 edit model via fal.ai. You give it a reference photo of the real product and a text prompt describing the scene you want. It generates a new image with the product placed naturally into that environment, preserving the product's appearance, label, and proportions.

python generate_kontext.py product_photo.jpg \
  "Product on white linen table, candlelit restaurant, beside wine glass, warm golden light, cinematic" \
  --variations 5

The --variations 5 flag is important. AI image generation is inconsistent. Out of five attempts, usually two or three look good. One will be excellent. The rest get discarded. At $0.04 per image, generating five costs $0.20. Cheap enough to always overshoot.

One thing I learned: prompts need a scale anchor. If the product is small, the model will sometimes scale it up to fill the scene. Always include a size reference in the prompt: a wine glass, a hand, a plate. Something that tells the model how big the product actually is relative to its surroundings.

Step 2: animate the winner

The second script takes the best image from Step 1 and turns it into a 5-second video using Kling 3.0 Pro, also via fal.ai. It generates native audio too: sizzling sounds for a kitchen scene, ambient restaurant noise, clinking glasses.

python generate_video.py \
  "Hand reaches for product, picks it up, tilts gently, slow motion" \
  --image_url "https://fal.media/files/..." \
  --duration 5 \
  --cfg_scale 1.0

The cfg_scale setting matters. The default (0.5) gives the model creative freedom, which is fine for abstract content but bad for product shots. Setting it to 1.0 forces the model to follow the prompt closely. For product content, you want maximum adherence: the product should stay in frame, the motion should be what you described, nothing should morph or distort.

One video takes 60 to 180 seconds to generate and costs about $0.80. Combined with the image step, a full scene (5 image variations + 1 video) runs to about $1.

The scenes we built

We created a prompt library with six scenes, each with an image prompt and a matching motion prompt. Restaurant lifestyle, in-hand close-ups, kitchen action shots, moody food pairings, textured product beauty shots, and bar settings.

Each scene follows the same workflow: two commands, one decision (pick the best of five images), one output (a 5-second video with audio). Total cost for all six scenes: about $6. Total time: under an hour, including prompt iteration.

The prompt library is the reusable part. Once you've dialled in the style and scale for one product, adapting it for another is just swapping the product description and the reference photo.

What I'd do differently

Batch the image generation. Right now each scene is a separate script invocation. A wrapper that runs all six scenes, generates all 30 images, and presents them for review in one pass would save time.

Test 9:16 for Stories and Reels. All our content was 16:9. Kling supports 9:16 for vertical video, but only in text-to-video mode (not image-to-video). For Instagram Reels, you'd need to either crop or generate the initial image at 9:16.

Build a prompt template system. The prompt library works, but it's manual. A template where you swap in the product name, size description, and setting would make this reusable across clients without rewriting prompts from scratch.

Why this works for small brands

This client is a bootstrapped D2C brand. There's no budget for location shoots across six restaurants. But the social content needs to look premium because the product is premium.

This pipeline delivers that. Five minutes per scene, a dollar per video, and the output looks like it came from a production studio. The client picks from five image options, approves one, and gets a ready-to-post video with sound. No photographer, no stylist, no venue booking.

If you're selling a physical product and need lifestyle content at scale, this exact pipeline works. Two scripts, one API key, and a good product photo to start from.

ctrlaltautomate.com

16 Ways to Make a Small Language Model Think Bigger

Wojtek Pluta — Tue, 21 Apr 2026 07:56:58 +0000

This article is syndicated from the original post on blogs.oracle.com. Read the canonical version there for the latest updates.

All of the code in this article is available in the Oracle AI Developer Hub. The repository is part of Oracle’s open-source AI collection and serves as the reference implementation for everything covered here.

You can install it with pip install agent-reasoning, browse the 16 agent classes, run the TUI, or integrate it directly into an existing Ollama pipeline as a zero-change replacement client. If you find it useful, a GitHub star goes a long way.

Key Takeaways

Small language models struggle with complex reasoning on their own, but agent-based architectures (like Tree of Thoughts or Self-Consistency) can significantly improve their performance.
The agent-reasoning framework adds 16 research-backed reasoning strategies to any Ollama model using a simple +strategy tag—no code changes required.
Different strategies suit different tasks: CoT works well overall, ReAct excels with external data, and branching methods improve accuracy at the cost of speed.
Much of modern AI progress comes from orchestration (prompting, search, control flow), not just larger models.

Generally, a 270M parameter LLM (as of today, April 2026) struggles with even basic multi-step reasoning. Ask a model like gemma3:270m to solve the classic water jug problem, and it will often return a confidently incorrect answer—much like other small language models (SLMs) of similar size and training.

However, take that same model and wrap it inside a Tree of Thoughts (ToT) agent, running a breadth-first search (BFS) with three levels and weighted branches, and it can reliably solve the puzzle. The improvement comes from the architecture: the agent distributes the reasoning process across structured exploration steps, compensating for the limitations of a single LLM call.

This is where things get interesting. Much of the progress in applied AI isn't coming from bigger models alone, but from engineers rethinking how to orchestrate them—layering search, memory, and control flow on top of a standard LLM call to unlock new capabilities.

This is the fundamental idea behind agent-reasoning: sixteen cognitive architectures—each backed by peer-reviewed research—can be applied to any Ollama-served model via a simple +Strategy tag appended to the model name. Call gemma3:270m+tot instead of gemma3:270m, and the interceptor handles everything else.

We’ll talk about the different ways to invoke these reasoning strategies through the project.

What You’ll Learn

How the ReasoningInterceptor intercepts model names, removes the +Strategy tag, and directs traffic to one of 16 agent classes
How 16 strategies divide into four families: sequential, branching, reflective, and meta —each representing a different reasoning approach and set of trade-offs
What each major strategy accomplishes in practice, focusing on implementation rather than theory
Which type of problem each strategy is best suited for, based on benchmark results from March 2026

The Interception Layer

Key insight: The ReasoningInterceptor is an interchangeable drop-in client for Ollama that analyzes the model name for a +Strategy tag and directs traffic to one of 16 cognitive agent classes while making no modifications to your pre-existing code.

Everything relies on a single template: add +Strategy to any Ollama model name.

Using ReasoningInterceptor as a drop-in replacement client

The image below illustrates the entire routing process from start to finish. The interceptor acts as a middleman between your code and Ollama, removes the +Strategy tag, and sends traffic to the correct agent class.

Illustrating how the interceptor separates the base model from the Strategy tag

agent_map contains over fifty-five aliases mapped to sixteen agent classes. For example, cot, chain_of_thought, and CoT all map to CotAgent, while mcts and monte_carlo map to MCTSAgent. Because the interceptor is a drop-in client for Ollama—supporting the same .generate() and .chat() APIs— existing LangChain pipelines, web UIs, and scripts can automatically gain reasoning capabilities by changing a single string in the model name.

Additionally, the interceptor can be used as a network proxy. Instead of pointing an Ollama compatible application at http://localhost:11434, direct it to http://localhost:8080 instead. Using a model name like gemma3:270m+CoT, the gateway will apply reasoning transparently.

Family 1: Sequential Strategies

Key insight: Sequential Strategies process problems in a linear chain, where each step feeds into the next. In benchmarks, CoT achieved 88.7% average accuracy, compared to 81.3% for standard generation on the same model and weights.

Each of the sixteen strategies fall into one of four families. The diagram below illustrates how they are grouped.

Categorization of the four strategy families

Sequential strategies are designed for high-speed processing with minimal latency. They are ideal for problems with discrete, sequential steps.

Chain of Thought (CoT)

Paper: Wei et al. (2022), “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”

Chain of Thought (CoT) is a prompting strategy in which the model generates intermediate reasoning steps before producing a final response. As noted in the original paper: prompting a model to produce these intermediate steps can significantly improve accuracy.

For example, standard prompting on GSM8K achieves 66.7% accuracy. With CoT prompting, this increases to 73.3%— a 10% relative improvement achieved through simple prompt design alone.

The following graphic illustrates how CoT chains appear in practice: a sequence of numbered steps, each building on the previous one.

CoT in operation

In terms of implementation within CotAgent, the query is wrapped in a structured prompt:

Structured prompting enforces step-by-step reasoning in CoTAgent

Benchmark result for qwen3.5:9b (9.7B): CoT achieves 88.7% average accuracy, across GSM8K (math), MMLU (logic), and ARC-Challenge (reasoning), compared to 81.3% for standard generation. This seven-point gain in performance is attributable solely to structural prompts. Identical weights and temperatures were applied to both models.

Recommended usage: Math word problems; logic puzzles; any multi-step reasoning task where the individual steps are sequential and do not have branches.

Decomposed Prompting

Paper: Khot et al. (2022), “Decomposed Prompting: A Modular Approach for Solving Complex Tasks”

Decomposed prompting is an architectural module that splits large problems into smaller sub-problems. Each sub-problem is handled independently while carrying forward accumulated context from earlier steps. Once all sub-problems are processed, their outputs are synthesized into a final result. DecomposedAgent follows a three-phase process—decomposition, execution and synthesis—and propagating context throughout so that each step can build on prior results.

Recommended usage: Planning problems; trip itinerary generation; any problem where the ultimate answer consists of multiple distinguishable parts that may be individually addressed.

Note: Decomposed prompting achieved only 38.5% average accuracy in benchmark testing. This result requires context. GSM8K primarily evaluates arithmetic reasoning, where decomposing a problem like “what is 47 × 13 + 9?” introduces overhead without improving the model's ability to compute the answer.

Decomposition is more effective for problems with genuinely separable components (trip planning, multi-section reports etc.), where each part benefits from focused attention. These strengths are not captured by the benchmark, and the results reflect that mismatch.

Least-to-Most Prompting

Paper: Zhou et al. (2022), “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models”

Least-to-most prompting is a strategy that orders sub-questions from simplest to most complex, establishing prerequisite knowledge before tackling harder steps. Unlike decomposed prompting which generates arbitrary sub-problems, it enforces a deliberate progression where each step builds on the last. Knowledge is accumulated iteratively until the model reaches the final question.

Recommended usage: Questions with genuine prerequisites — e.g., “what is x?” before determining “how does x relate to y?”; educational style explanation sequences (“concept ladder”); tasks that require establishing foundational concepts before addressing more complex components.

Family 2: Branching Strategies

Key insight: Branching strategies explore multiple reasoning paths simultaneously and choose the best path. ToT scored 76.7% on GSM8K math, compared to 66.7% on GSM8K math with standard generation.

More LLM calls mean higher latency— but often better answers on hard problems. Take this into consideration when running all branching strategies.

Tree of Thoughts (ToT)

Paper: Yao et al. (2023), “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”

ToT is a search-based methodology that evaluates numerous possible reasoning paths concurrently, selecting the best performing path as determined by evaluation metrics such as distance traveled or quality of intermediate solutions etc.

Similar to chess engines, ToT applies BFS through an expanding tree of possible solutions. The core idea is straightforward: generate multiple partial solutions, evaluate them, prune weaker candidates, and continue exploring the most promising branches.

Below is an illustration of how ToT generates and eliminates branches: green nodes represent surviving branches, while red nodes indicate those that have been eliminated. The final answer is derived from the highest scoring leaf node.

A key design decision is how branches are evaluated. Should the same model handle both generation and scoring, or should a stronger model be introduced as a judge? In these benchmarks, the same model was used for both roles, but this is an area worth experimenting with, depending on your accuracy and latency constraints.

Generating candidate branches at each level

ToTAgent implements this as configurable by depth (default=3) and width (default=2 branches). At every level, the agent generates a set of candidate next steps, evaluates them using a scoring function, prunes low-scoring options, and expands the remaining candidates into the next level.

Tot achieved 76.7% accuracy—a 10% percent improvement over standard generation on GSM8K math problems. This performance comes at a cost: additional LLM calls are required at each step to evaluate candidate paths and their intermediate result, making it roughly 5-8x slower than CoT equivalent queries.

Recommended usage: Logic puzzles with multiple solution paths; strategic decision problems; tasks where multiple approaches can be explored and compared.

Self-Consistency (Majority Voting)

Paper: Wang et al. (2022), “Self-Consistency Improves Chain of Thought Reasoning in Language Models”.

Self-Consistency is a sampling method that generates multiple independent reasoning traces and selects a final answer through majority voting. Unlike standard prompting, it relies on sampling k diverse traces at a higher temperature to encourage variation. Each trace produces a candidate answer, and the most frequently occurring answer is selected as the final output.

The image below illustrates how both Self-Consistency and Monte Carlo Tree Search (MCTS) sample multiple reasoning paths, but differ fundamentally in how those paths are evaluated—majority voting versus UCB1-based exploration-exploitation balancing.

Self-Consistency vs MCTS comparison

ConsistencyAgent uses k=5 samples at temperature of 0.7 by default. It extracts final answers using regex-based pattern matching and selects the most frequent result via counter.most_common().

Self-Consistency matches CoT on both MMLU (96.7%) and GSM8K (76.7%). Its advantage lies in reliability rather than raw accuracy: majority voting across independent reasoning traces reduces the risk of single-trace errors propagating to the final answer.

Recommended usage: Factual question answering; multiple-choice style questions; problems where arriving at the correct answer via diverse reasoning paths is more important than inspecting a single reasoning trace.

Family 3: Reflective Strategies

Self-Reflection

Paper: Shinn et al. (2023), “Reflexion: Language Agents with Verbal Reinforcement Learning” — arXiv:2303.11366

Self-Reflection is a draft-critique-refine loop in which the model generates an initial answer, critiques it for errors, and then revises it. The Reflexion paper showed that this iterative process can meaningfully improve output quality, even without any gradient updates.

The image below shows all 3 reflective strategies side by side: Self-Reflection, Debate, and Refinement Loop.

Reflective strategies comparison

SelfReflectionAgent runs a draft-critique-refine loop for up to 5 iterations, with early termination when the critique returns “CORRECT” in under 20 characters. If the critique is satisfied on an early pass, subsequent iterations are skipped. This approach helps keeps latency low for queries the model answers correctly on the initial pass.

Recommended usage: Creative writing, high-stakes technical explanations, anything where “good enough on the first try” is insufficient.

Adversarial Debate

Paper: Irving et al. (2018), “AI Safety via Debate”

Irving proposed debate as a mechanism for improving AI safety. Two agents present opposing arguments, and a judge (either a human or another LLM) evaluates their merits. The underlying premise is that that identifying flaw in weak arguments is often easier than constructing strong ones.

DebateAgent conducts multiple rounds of PRO and CON arguments, with a judge evaluating each exchange. Following all rounds, the strongest arguments from both sides are synthesized into a final answer that balances competing perspectives. Context is carried forward between rounds, enabling incremental refinement rather than redundant arguments.

Recommended usage: Controversial or ambiguous subjects; policy analysis; ethics and any subject matter requiring a balanced perspective.

Refinement Loop

Paper: Madaan et al. (2023), “Self-Refine: Iterative Refinement with Self-Feedback”

This paper describes a refinement loop similar to self-reflection, but instead of relying on a human-style critique to guide revisions, it uses a machine-based evaluation system with quantifiable quality metrics. These metrics determine whether further refinement is necessary. The loop terminates when a predefined quality metric is reached (> 0.9 by default) or when the maximum number of iterations is exceeded.

The five-stage complex refinement pipeline consists of sequential stages, each focused on a distinct type of critique: technical accuracy, structure, depth, examples, and polish.

Each stage targets a distinct aspect of quality, ensuring the model focuses exclusively on improving that dimension rather than attempting to optimize everything at once.

Recommended usage: Highly technical writing; documentation; blog posts, a scenario where production-quality output is required rather than simply a first draft.

Family 4: Cross-Domain and Meta Strategies

Key insight: Cross-domain strategies enable sharing knowledge among disciplines, while meta-strategies automatically route queries to the most appropriate reasoning technique without requiring manual selection.

Analogy-Based Reasoning

Paper: Gentner (1983), “Structure Mapping: A Theoretical Framework for Analogy”, Cognitive Science

Gentner's structure-mapping theory proposes that analogical reasoning operates by identifying structural correspondences across domains, rather than relying on surface-level similarity. The AnalogicalAgent builds on this idea through three phases: (1) identify the underlying structure independent of domain specifics, (2) generate analogous solutions from different domains that share that structure, (3) select the most effective analogy and apply its solution approach.

This process reduces reliance on memorized patterns. By focusing on underlying structure, the model learns why a solution works, rather than simply recalling what worked before.

Recommended usage: Solving problems that are structurally similar to prior ones, even if they differ superficially; transferring knowledge across domains; explaining complex concepts through analogy.

Socratic Questioning

Paper: Paul & Elder (2007), “The Art of Socratic Questioning”

The Socratic Method: Do not answer the question directly. Instead, ask follow-up questions that reduce ambiguity in the solution space.

SocraticAgent repeatedly asks questions and receives model responses, continuing until it reaches a limit of five question-response exchanges. It then synthesizes the collected information into a final answer. A deduplication or normalization step helps prevent repeated queries that differ only in wording.

Recommended for: Philosophy; ethics; deep technical knowledge; any field requiring the model to “know” something as opposed to merely answering it.

ReAct (Reason + Act)

Paper: Yao et al. (2022), “ReAct: Synergizing Reasoning and Acting in Language Models”

ReAct is a conceptual framework that interweaves reasoning steps with tool invocations, allowing the model to ground its thinking in external information. In practice, the model decides what action to take, calls a tool such as a web search engine, examines the result, updates its reasoning, and repeats the cycle until it reaches a satisfactory answer. Current tools include web scraping, accessing Wikipedia via an API call, and a calculator interface, with mock-ups available for off-line execution scenarios.

Using ReAct achieved 70.0% accuracy on ARC-Challenge (Science Reasoning). While not the highest on this particular benchmark, it enabled tool use for the LLM and allowed it to search for required information on the Internet.

Recommended usage: Fact-checking; current events queries; mathematical calculations; tasks where access to grounded, external information is important.

Auto Router: MetaReasoningAgent

Key insight: A single LLM invocation allows MetaReasoningAgent to classify each input into one of eleven categories and route it to the most appropriate strategy, without human intervention.

All sixteen strategies depend on selecting the appropriate strategy for a given task. By removing this requirement, MetaReasoningAgent eliminates the need for manual selection.

The diagram below shows how each category maps to its corresponding strategy.

MetaReasoningAgent classification diagram

MetaReasoningAgent instantiates the selected strategy class and passes control to it, along with all event objects for visualization.

To use this capability, specify a model such as gemma3:270m+meta or gemma3:270m+auto.

In practice, routing is generally intuitive: math problems are directed to CoT, logic puzzles to ToT, philosophical questions to Socratic Questioning, and controversial topics to Adversarial Debate.

The trade-off is reduced control over strategy-specific hyperparameters in exchange for automatic routing aligned with the problem type.

What Strategy Should You Pick? Benchmark Results (March 2026)

Key insight: CoT performs best on average (88.7%) across diverse tasks. ReAct excels when tool use is available (70.0% on ARC-Challenge). ToT and Self-Consistency tie on GSM8K math at 76.7%.

These results are based on 4,200 evaluations across 11 strategies using qwen3.5:9b, collected as of March 2026. All 16 strategies are implemented and production-ready. However, the benchmarks shown below focus on the 11 that produce a single extractable answer. The remaining five are generation-focused and not suited to multiple-choice evaluation.

The heat map and bar chart below provide a complete view of the results.

Benchmark results heatmap and bar chart

The short version: CoT wins on average across diverse tasks. Self-Consistency and ToT beat it on specific math benchmarks. ReAct dominates on factual/science tasks. Self-Reflection and Refinement Loop are not well captured by these benchmarks, as they primarily improve generation quality rather than multiple-choice accuracy.

For most queries, start with +cot. If you’re solving logic puzzles or planning problems, try +tot. If you need factually grounded responses, use +react. If you need polished, high-quality output rather than a quick answer, use +refinement. When in doubt, +meta will route the query automatically.

In my experience building agent-reasoning, the most surprising finding is how much prompt structure alone can improve performance. For example, qwen3.5:9b improves from 81.3% to 88.7% average accuracy simply by prompting it to produce numbered reasoning steps.

As of March 2026, all 16 strategies are production-ready and have been evaluated across 4,200 benchmark runs.

You can find the repository here. Install with pip install agent-reasoning or uv add agent-reasoning. The commands to get started:

Getting started commands

The TUI provides a 16-agent sidebar, live streaming, and a step-through debugger. Arena mode runs all 16 agents simultaneously on the same query in a 4×4 grid.

If this is useful, a GitHub star is always appreciated.

Frequently Asked Questions

Do I need to modify my existing code to use agent-reasoning?

No. The interceptor is a drop-in replacement for the Ollama client. Just change the model name string by appending +strategy (e.g., gemma3:270m+cot) and the interceptor handles everything else. Existing LangChain pipelines, web UIs, and scripts work without any other changes.

Which strategy should I start with?

Start with +cot (Chain of Thought). It scored the highest average accuracy (88.7%) across our benchmarks and adds minimal latency. If you are unsure, use +meta and let the auto-router pick the best strategy for you.

Why were only 11 of the 16 strategies benchmarked?

The benchmarks (GSM8K, MMLU, ARC-Challenge) measure multiple-choice accuracy, which works well for strategies that produce a single extractable answer. The remaining five strategies are generation-focused (e.g., Refinement Loop, MCTS) and their strengths in output quality are not captured by multiple-choice evaluations. All 16 strategies are fully implemented and production-ready.

Can I use this with models other than Ollama-served models?

Currently the interceptor targets the Ollama API. Since it exposes the same .generate() and .chat() endpoints, any Ollama-compatible client works out of the box. Support for additional inference backends is on the roadmap.

How much slower are branching strategies compared to CoT?

ToT is roughly 5-8x slower than CoT because it generates and evaluates multiple candidate branches at each level. Self-Consistency (k=5 samples) adds similar overhead. For latency-sensitive applications, stick with sequential strategies (CoT, Least-to-Most) and reserve branching strategies for problems where accuracy matters more than speed.

Created by Nacho Martinez, Data Scientist at Oracle. Find Nacho on GitHub and LinkedIn, or visit the Oracle AI Developer page for more resources.

Exploring Elyan Labs: Open-source infrastructure for vintage silicon

houariblr — Tue, 21 Apr 2026 07:55:03 +0000

I recently looked into Elyan Labs and found their approach to hardware infrastructure quite interesting. They are focusing on the intersection of vintage hardware and open-source development, integrating "Proof of Antiquity" concepts within the RustChain blockchain.

What caught my attention:

Infrastructure focus: 44+ PRs contributed to core projects like OpenSSL, Ghidra, vLLM, and LLVM.

Research: They have a paper accepted at CVPR 2026, which suggests a solid technical foundation behind their hardware attestation models.

Hardware integration: Trying to make vintage silicon relevant in a modern AI/blockchain stack is a unique challenge.

Definitely worth a look if you are into low-level systems or hardware-software co-design.

Keywords: #elyanlabs #vintagecomputing #opensource #hardwareattestation #CVPR2026

Your AI Agent Now Remembers Your Project: Persistent Memory with vem

vem.dev — Tue, 21 Apr 2026 07:50:46 +0000

🚀 vem is in early access — we're looking for our first users. If you try it and find it useful, we'd love to hear from you. Early access is completely free.

Every time you open a new chat with your AI coding assistant you spend the first few minutes re-explaining the same things: what the project does, which patterns you follow, what you were just working on, and why you made the architectural choices you did.

This is not a UX quirk — it is a structural gap. AI agents are stateless. vem solves this with a local memory layer that lives inside your repository.

Prerequisites — Install vem and Link a Project

You need the vem CLI installed, an authenticated account, and a repository linked to a vem cloud project. If you completed the Cycles tutorial you are already set up — skip to the next section.

# 1. Install the CLI globally
npm install -g @vemdev/cli

# 2. Authenticate with your API key from vem.dev/keys
vem login <your-api-key>

# 3. Initialise memory in your repo and link to a cloud project
cd my-project
vem init
vem link

# Confirm everything is connected
vem status

The Problem: AI Agents Forget Everything

This is not a UX quirk — it is a structural gap. AI agents are stateless. They have no memory between sessions. The work you do to orient them at the start of each session is pure overhead, and the accumulated reasoning from previous sessions is permanently lost.

vem solves this with a local memory layer that lives inside your repository. Everything your agents need to hit the ground running — project context, architectural decisions, sprint state — is stored durably in .vem/ and synced to the cloud so agents can query it instantly.

How vem Memory Works

vem's memory system is built around four durable artifacts, all stored in .vem/ inside your repository. They are gitignored by default (so secrets never leak) but backed up to the vem cloud for search indexing and team sharing.

CONTEXT.md — project overview and "need to know" facts
CURRENT_STATE.md — live progress summary updated after each work session
decisions/ — one ADR file per architectural decision
tasks/ — structured task backlog with cycle assignments

CONTEXT.md is your project's "North Star" — a human-readable summary of what the project is, who it is for, and the non-obvious things any new contributor (human or AI) needs to know. CURRENT_STATE.md captures where work stands right now: what just changed, what is in progress, and what is blocked.

The decisions/ directory holds Architectural Decision Records (ADRs) — one file per decision, recording what was chosen, why, and what was considered and rejected. Together these four artifacts give any AI agent a complete, structured picture of your project before it writes a single line of code.

Step 1 — Write Your First Project Context

Start by writing a concise project context. Open .vem/CONTEXT.md in any editor and describe your project in plain language: what it does, the main tech choices, and any gotchas a new developer would need to know on day one.

vem context show prints the current context so you can confirm what your agents will see. After editing, run vem push to sync it to the cloud immediately.

# Open the context file in your editor
$EDITOR .vem/CONTEXT.md

# Preview what agents see right now
vem context show

# Sync to the cloud after editing
vem push

Step 2 — Record an Architectural Decision

Every non-obvious choice deserves a decision record. vem decision add writes an ADR to .vem/decisions/ and immediately makes it searchable via the MCP server.

Include the context (why you faced this decision) and the decision (what you chose). Future agents — and future you — will understand not just what was chosen but why. This prevents the "why did we do it this way?" confusion that slows down every project after the first month.

vem decision add "Use Zod for input validation at CLI boundaries" \
  --context "Catching invalid user input early prevents confusing downstream errors." \
  --decision "All CLI inputs are validated with Zod schemas before any business logic runs."

Step 3 — See Exactly What Your Agent Sees

vem pack generates a structured JSON snapshot of your entire project memory — tasks, context, decisions, and sprint state — in a single block. This is the exact payload that the MCP server sends to your AI agent at the start of each session.

Running vem pack manually is the fastest way to audit your memory quality. If the output looks thin or outdated, that is what your agents are working with. A well-maintained pack is the difference between an agent that needs three rounds of clarification and one that writes correct code on the first attempt.

# Generate the full context pack
vem pack

# Pipe to a file to inspect offline
vem pack > /tmp/my-project-context.json

Step 4 — Ask Questions About Your Project

vem search performs semantic search across your project memory — tasks, decisions, context, and changelog entries. It is powered by the vem cloud vector index built from your most recent vem push.

This is especially useful for finding related decisions, locating tasks about a specific feature, or checking whether a topic has already been addressed before adding a new decision record.

# Search across all memory artifacts
vem search "error handling"

# Find decisions related to authentication
vem search "auth"

# Find tasks mentioning a specific library
vem search "retry logic"

Step 5 — Connect Any Agent via MCP

The vem MCP server is the bridge between your memory layer and any AI agent that supports the Model Context Protocol: Claude Desktop, Cursor, Copilot, and more. Once connected, your agent calls structured tools to read tasks, search memory, and record decisions — no copy-pasting context into the chat window.

Add the snippet below to your agent's MCP configuration file. Your vem API key is read automatically from ~/.vem/config.json — you never need to expose it in the config.

Privacy matters: vem uses a Bring Your Own Key model. Your AI provider keys (OpenAI, Anthropic, etc.) are stored only on your local machine and never sent to the vem cloud.

{
  "mcpServers": {
    "vem": {
      "command": "npx",
      "args": ["@vemdev/mcp-server"]
    }
  }
}

Tools available to your agent:

# Tools exposed by the vem MCP server:
# get_active_tasks()     — list current sprint tasks with status
# search_memory(query)   — semantic search across all memory artifacts
# read_decision(id)      — fetch a specific ADR by ID
# update_task(id, ...)   — mark progress and add evidence
# record_decision(...)   — write a new ADR from the agent session

Step 6 — The Web Memory Dashboard

The vem web app at app.vem.dev gives you a visual view of everything stored in your project memory. The Context tab shows your CONTEXT.md, current state, key decisions, and recent changelog entries all on one page.

The Memory tab hosts a chat interface you can use to ask questions about your project directly from the browser — the same semantic search your agents use, but in a conversational UI. It is particularly useful during onboarding or code review when you need to quickly orient a new contributor.

Key Decisions panel in the vem web app — all ADRs accessible to team members and agents

Step 7 — Implement Tasks Remotely with the vem Agent

vem does not just store context — it can act on it. The vem agent runner lets you trigger AI-powered task implementation from the web dashboard, delegating work to an agent running on your local dev machine or a cloud runner.

Your AI keys never leave your machine. The vem cloud only orchestrates which task to run and where — the actual agent execution and code changes happen locally. This is BYOK (Bring Your Own Key) by design.

# Start the vem runner — listens for tasks dispatched from the web
vem runner

# Or specify a particular AI agent
vem runner --agent claude

# The runner outputs a secure token you connect in the web Workspace tab

Step 8 — Track Agent Activity with Insights

vem insights shows a power score and command frequency breakdown for your project. It surfaces which workflow features you are using, which you are not, and how your agent activity patterns have evolved over time.

The power score is a simple metric (0–100) that rewards high-value behaviours: agent-driven implementation, decision recording, task-driven work, and memory finalisation.

# Show power score and command frequency
vem insights

Step 9 — Push Memory to the Cloud

vem push publishes a snapshot of your entire .vem/ memory to the vem cloud. The snapshot is marked pending until a matching Git push is detected — at that point it is verified using the git_hash + snapshot_hash pair and becomes permanently auditable.

Push after any significant session: after adding decisions, after completing tasks, after updating context. Your teammates and any agent connected via MCP will immediately see the updated memory on their next request.

# Publish current memory snapshot
vem push

# Check sync and connection status
vem status

Step 10 — Cycle Validation: Memory Stays Correct Over Time

Development never stops. New features can invalidate old decisions, refactors break assumptions captured in CONTEXT.md, and security issues can surface weeks after the original code was written. vem's cycle validation step is designed exactly for this.

When you close a sprint with vem cycle validate, vem checks each completed task's validation steps against the current codebase and flags items that need human review.

Run validation at the end of each cycle before you mark it done. It takes less than a minute and ensures your memory layer stays trustworthy — so agents in future cycles don't build on stale or incorrect foundations.

# Validate the active cycle before closing it
vem cycle validate

# Review specific task validation results
vem cycle validate --task TASK-003

# Close the cycle once validation passes
vem cycle close

The Full Memory Loop

Every AI session should leave the project in a better state than it found it. That means updated context, recorded decisions, completed tasks, and a fresh push to the cloud. With vem, this loop takes under two minutes and pays dividends on every session that follows.

# 1. Inspect what your agents currently see
vem pack

# 2. Record any decisions made this session
vem decision add "..." --context "..." --decision "..."

# 3. Update task progress
vem task done TASK-001 --evidence "Implemented in src/auth.ts, tests pass"

# 4. Refresh current state summary
vem context set current "Completed auth module. Next: add refresh token rotation."

# 5. Push to cloud and verify
vem push
vem status

Your agents start each session with full context. Your decisions are permanent and searchable. Your sprint state is always visible. And your memory is verified against your actual Git history — not just a file on disk.

vem is currently in early access. We're looking for our first users — developers and teams tired of re-explaining their project to AI agents every session. Early access is completely free. No credit card, no trial timer.

If you found this useful, sign up at vem.dev and let us know what you're building. Your feedback will directly shape the product. 🙏

I Built an Experiences Marketplace Five Years Before Airbnb Experiences

Talvinder Singh — Tue, 21 Apr 2026 07:48:29 +0000

In 2011, we built Tushky — a marketplace for local experiences in India. Cooking classes with home chefs. Heritage walks through old Mumbai. Photography workshops in the Western Ghats. Five years later, Airbnb launched Experiences and scaled the exact same model globally.

We had the idea first. We executed reasonably well. We still failed.

The reason wasn't timing or capital or competition. It was something more fundamental: we optimized for transactions when we should have been building social infrastructure.

The Social Capital Gap

Most marketplace failures are diagnosed as "chicken-and-egg problems" — you need supply to attract demand, you need demand to attract supply. That's true but useless. It's like saying you failed because you ran out of money. The question is why you couldn't solve the bootstrap problem when others did.

The answer is what I call the Social Capital Gap — the difference between a transactional platform and a community with economic infrastructure built on top.

Airbnb Experiences closed that gap. We didn't. Not because we didn't understand marketplaces, but because we treated the wrong thing as the product.

What we got right

Profitable unit economics on outbound marketing. We could acquire customers through Facebook ads and Google search profitably. Rs 200-300 customer acquisition cost, Rs 800-1200 average booking value, 15-20% take rate. Not venture scale, but sustainable.

Easy supplier onboarding. Experience providers could create a listing in under 10 minutes. No approval bottleneck. We had 150+ experiences listed within six months.

Unique inventory. A Parsi chef teaching dhansak in her South Mumbai apartment. A tabla master offering two-hour sessions in Dadar. A birding expert leading dawn walks in Sanjay Gandhi National Park.

The product worked. People booked. Providers got paid. Reviews were positive.

Transactions hit a wall at about 80-100 bookings per month.

We couldn't break through. We added more experiences. We improved search. We ran more ads. We tried discounting. Nothing moved the number sustainably.

The diagnosis in our internal docs: "Repeat customers were not getting enough options and first timers wanted more options to decide from."

That diagnosis was wrong.

What actually broke

The real problem was visible in how our experience providers talked about us.

We wanted to be seen as business partners. We positioned ourselves that way in pitch decks and partner communications. But providers saw us as a booking channel — one of several ways they got customers, not materially different from their own Facebook page or a listing on JustDial.

When we asked providers to promote Tushky to their existing customers, most didn't. When we asked them to refer other providers, most didn't. When we suggested they collaborate on multi-experience packages, almost none did.

They had no social capital invested in the platform. We were a lead source, not a community.

Compare that to what Airbnb built. They didn't just launch a booking interface. They built host meetups. They created an online forum where hosts shared tips. They featured hosts in marketing materials with their stories, not just their listings. They built a brand that hosts were proud to be associated with.

Their CTO told me years later: "The product is not the website. It's the final booking." Meaning: the value isn't in the interface, it's the trust infrastructure that makes the transaction possible.

We built a website. They built social capital.

The numbers that should have told us

Our repeat booking rate: 12-15%
Our provider referral rate: <5%
Our provider-to-provider collaboration rate: 0%

Those aren't marketplace metrics. Those are lead generation metrics.

A real marketplace creates network effects. Each new provider should make the platform more valuable to customers. Each new customer should make the platform more valuable to providers. We had linear growth at best.

We also made a strategic error on marketing. Outbound worked. We could buy traffic profitably. So we kept doing it. What we didn't realize until too late: outbound marketing scales linearly with spend. Inbound marketing — SEO, word of mouth, community — scales exponentially but takes longer to build.

From our internal strategy doc in 2013: "Inbound marketing is the way to go. Build extremely loyal experience partner base. They will do word of mouth for you."

We knew it. We wrote it down. We didn't do it. Because outbound delivered this month's numbers. Inbound required believing in next year's numbers. We were optimizing for the wrong time horizon.

What I got wrong

I treated the chicken-and-egg problem as a supply problem. I thought: get enough experiences listed, and demand will follow. So we focused on making supplier onboarding frictionless.

That was backwards.

The constraint wasn't the number of listings. It was the depth of engagement. We needed 20 customers who booked 5 times each, not 100 customers who booked once. We needed suppliers who saw Tushky as their primary channel, not one of five. Who would promote it to their customers. Who would collaborate with other suppliers. Who had reputational skin in the game.

That requires a different product. Not a listing interface. A community infrastructure.

We also underestimated the importance of curation and quality signaling. We made listing easy, which meant we had a quality variance problem. Some experiences were exceptional. Some were mediocre. Customers couldn't tell the difference from the listing page. Airbnb solved this with detailed reviews, verified photos, and editorial featuring. We had basic star ratings.

The final mistake: we thought being first was an advantage. It's not. Being first means you absorb all the market education cost. You teach customers that "experience marketplaces" exist. Then someone with more capital and better execution takes the market you created.

First-mover advantage is real in network-effect businesses only if you can build the network faster than competitors can copy the product. We couldn't.

The test that matters

If you're building a marketplace, here's the question:

Are your suppliers investing social capital in your platform, or are they just using it as a lead source?

If it's the latter, you don't have a marketplace. You have a lead-gen business with marketplace unit economics. That's not venture-scalable. It's also not defensible.

The test is simple:

Do suppliers refer other suppliers?
Do suppliers promote your platform to their existing customers?
Do suppliers collaborate with each other through your platform?

If the answer to all three is no, you haven't built the social infrastructure yet. You've built a directory.

We spent two years optimizing transaction flow when we should have been building community. By the time we realized it, we didn't have the capital or the team energy to rebuild.

Airbnb had the capital. They also had something harder to replicate: they understood from day one that the product wasn't the booking form. It was the trust system that made strangers willing to transact.

I still don't know if we could have won even if we'd understood this earlier. The India market in 2011 wasn't ready for experiential consumption at scale. Airbnb launched Experiences in 2016 into a global market that had already been trained by Airbnb Stays.

But I know we lost for the wrong reasons. We lost because we optimized for the transaction when we should have been building the social capital that makes transactions possible at scale.

The question I'm still working through: how do you build social capital infrastructure before you have transaction volume? Community requires critical mass. But you can't get to critical mass without community.

That's the real chicken-and-egg problem. Not supply and demand. Trust and scale.

Originally published at talvinder.com.