DEV Community: Joske Vermeulen

AI Dev Weekly Extra: Did Anthropic Let Opus 4.6 Rot So 4.7 Would Look Better?

Joske Vermeulen — Fri, 17 Apr 2026 09:28:38 +0000

AI Dev Weekly Extra — a special edition for breaking news that can't wait until Thursday.

Anthropic shipped Claude Opus 4.7 this week. The benchmarks are impressive. The vision jump is absurd. And I should be writing a straightforward "here's what's new" piece right now.

But I can't do that without talking about what happened to Opus 4.6 first. Because the story of 4.7 doesn't start with its release — it starts with the slow, public deterioration of the model it replaces, and the uncomfortable questions that deterioration raises about trusting any AI provider with your production workloads.

The Opus 4.6 Collapse Was Real

Let me be blunt: Opus 4.6 got noticeably worse over the past several weeks, and the evidence isn't anecdotal.

A HuggingFace analysis across 6,852 sessions documented a 67% drop in reasoning depth. On BridgeBench, Opus 4.6 fell from 83.3% — good enough for the #2 spot — down to 68.3%, landing it at #10. That's not drift. That's a cliff. An AMD senior director posted forensic evidence on GitHub showing systematic capability loss. Some users reported accuracy score declines of 58%.

If you were using Claude Code in mid-March, you probably felt it firsthand. Sessions hanging for 10-15 minutes on prompts that used to resolve in seconds. Outputs that felt shallow, hedging, stripped of the analytical depth that made Opus the model you reached for when the problem was hard.

Reddit and X lit up with the vocabulary we've all learned to use for this phenomenon: "AI shrinkflation." "Lobotomized." "Nerfed." The community wasn't being dramatic — they were describing a measurable reality.

Anthropic's official response? They denied degrading the model weights.

I believe them, technically. I don't think someone at Anthropic opened a config file and turned a dial labeled "make it worse." But "we didn't change the weights" is a narrow denial that sidesteps a lot of territory — infrastructure changes, serving optimizations, quantization adjustments, routing modifications. There are many ways a model's effective capability can degrade without anyone touching the weights themselves.

Enter Opus 4.7: Savior or Convenient Timing?

Now here's where it gets interesting. Opus 4.7 lands with numbers that look fantastic — especially when measured against the degraded version of 4.6 that users had been suffering through:

SWE-bench Pro: 64.3% (up from 53.4%)
CursorBench: 70% (up from 58%)
Vision: 98.5% (up from 54.5%)

That vision jump alone — from 54.5% to 98.5% — is genuinely remarkable. The coding benchmarks represent real, meaningful progress. I've been running 4.7 through my own workflows for the past two days, and the improvement in structured reasoning and code generation is not imaginary. This is a better model.

But here's the thing that keeps nagging at me: users on X have been joking that 4.7 "feels like early 4.6." The version they actually liked. The one that scored 83.3% on BridgeBench before it started its mysterious decline.

So which is it? Is 4.7 a genuine leap forward, or did we just spend weeks watching 4.6 get worse so that "normal" would feel like a breakthrough?

I think the honest answer is: both. The SWE-bench and vision numbers suggest capabilities that go beyond where 4.6 ever was, even at its peak. But the subjective experience of improvement is amplified by the fact that we've been working with a degraded model for weeks. Anthropic gets to announce a 20% coding improvement against a baseline that had already fallen 15%. The math works out very nicely for the press release.

The Tokenizer Tax Nobody's Talking About

Opus 4.7 ships at the same per-token price as 4.6. Anthropic made sure to highlight this. Same price, better model — what's not to love?

The new tokenizer, that's what.

Opus 4.7's tokenizer uses up to 35% more tokens to represent the same content. If you're processing the same codebase, the same documents, the same prompts you were running last week, you're now paying up to 35% more for the privilege.

Let's call this what it is: a hidden price increase. Not on the rate card — on the meter. It's the AI equivalent of shrinking the cereal box while keeping the price tag the same. The "per token" price didn't change, but the number of tokens your work requires did.

For hobbyists and occasional users, this is a rounding error. For teams running Claude through CI pipelines, code review automation, or document processing at scale, a 35% token increase is a material cost change that showed up with zero advance warning. If you're budgeting API costs, recalculate now. Your March invoices are not predictive of your April ones.

For a deeper dive into the technical differences, check out our Opus 4.7 vs 4.6 comparison.

The Mythos in the Room

Here's the part of this story that doesn't get enough attention. The same week Anthropic released 4.7, Axios ran a headline that should have been louder than it was: "Anthropic releases Claude Opus 4.7, concedes it trails unreleased Mythos."

Mythos Preview beats 4.7 on almost every benchmark. And it's restricted — available only in limited preview, not generally accessible through the API.

So we're in a strange position. Anthropic is asking developers to be excited about 4.7 while simultaneously acknowledging they have something substantially better that they're not shipping. I understand the reasons — safety evaluation, scaling infrastructure, responsible deployment. These are legitimate concerns. But it creates an awkward dynamic where the product you're paying for is, by the company's own admission, not the best they can do.

It also raises a strategic question: if you're building a product on top of 4.7 today, how do you plan for a model that might be dramatically better arriving in weeks or months? Do you optimize for 4.7's specific strengths, or do you build abstractions assuming the foundation will shift under you again?

For more context on how these models stack up, see our AI model comparison.

This Isn't Just an Anthropic Problem

I want to be fair here. Anthropic is not uniquely guilty of anything. GPT-4 users reported strikingly similar degradation patterns before GPT-4o launched. OpenAI faced the exact same "did they nerf it?" accusations. The community had the same arguments, the same forensic analyses, the same official denials.

This is a structural problem with the entire model-as-a-service paradigm. When you call an API, you have no way to verify what's actually running on the other side. The model you tested against last Tuesday might not be the model serving your requests today. There's no checksum, no version hash, no way to pin a specific set of weights the way you'd pin a dependency version in your package manager.

You're renting intelligence, not owning it. And the landlord can renovate your apartment while you're at work without telling you.

This is fundamentally different from every other dependency in your stack. When you upgrade PostgreSQL, you choose when. When a library updates, your lockfile protects you. But your AI provider can change the effective capability of your most critical dependency at any time, and your only detection mechanism is "hmm, the outputs feel different."

For developers who lived through the 4.6 degradation while running production workloads — that's not a theoretical concern. That's a retrospective incident report waiting to be written.

What Developers Should Actually Do

So where does this leave us? Here's my honest take.

Opus 4.7 is a good model. Probably a genuinely great one. The complete guide covers the capabilities in detail, and the coding and vision improvements are real and significant. If you're choosing a model today, 4.7 deserves serious consideration.

But the 4.6 episode should change how you architect around these models. Here's what I'd recommend:

Build evaluation harnesses, not vibes. If you don't have automated quality checks on your AI-dependent workflows, the 4.6 degradation is what happens to you — slow, invisible capability loss that you only notice when users complain. Run benchmarks on your actual use cases. Weekly, at minimum.
Budget for the tokenizer tax. If you're on Opus, your costs just went up ~35%. Plan for it. Monitor it. Don't let it surprise your finance team.
Abstract your model layer. If you're not already using a model-agnostic interface, start. The ability to swap between providers — or between Claude models — without rewriting your application isn't a nice-to-have anymore. It's operational resilience. Our Opus 4.6 vs 4.5 comparison shows how much can change between versions.
Keep receipts. Log your inputs, outputs, and quality metrics. When the next degradation happens — and it will, from someone — you want data, not feelings.
Watch Mythos. Whatever Anthropic is holding back is, by their own benchmarks, significantly better than what they just shipped. That's either exciting or unsettling depending on your perspective. Either way, it's worth tracking.

The AI industry has a trust problem it hasn't solved. Not a safety trust problem — a reliability trust problem. The companies building these models need to give developers better tools for verifying, pinning, and monitoring the models they depend on. Until they do, we're all building on ground that can shift without warning.

Opus 4.7 is a step forward. The way we got here is a step backward. Both things are true, and pretending otherwise doesn't help anyone.

See you Thursday for the regular edition.

Originally published at https://www.aimadetools.com

AI Dev Weekly #6: OpenAI's $852B Wobble, GPT-5.4 Solves 60-Year Math Problem, and Agents Get Infrastructure

Joske Vermeulen — Thu, 16 Apr 2026 07:12:57 +0000

AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news — with my take as someone who actually uses these tools daily.

The AI money machine cracked open this week. OpenAI's own investors started questioning the $852B valuation, VCs flooded Anthropic with $800B offers, and a sneaker company's stock jumped 600% by saying "AI compute." Meanwhile, the actual technology kept moving: GPT-5.4 Pro solved a 60-year-old math conjecture, three major platforms shipped agent infrastructure upgrades on the same day, and a federal court ruled your AI chats can be subpoenaed. Let's get into it.

OpenAI's $852B valuation faces investor doubt

The Financial Times reported that some of OpenAI's own backers are questioning whether the $852B post-money valuation can hold. One investor who backed both companies told the FT that justifying OpenAI's recent round required assuming an IPO valuation of $1.2 trillion or more — making Anthropic's $380B mark look like "the relative bargain."

The same week, Business Insider reported VCs are flooding Anthropic with offers at valuations up to $800 billion — more than double its current mark. And SoftBank's lenders are inviting more banks to join its $40B loan facility backing the OpenAI investment.

My take: The interesting HN comment on this: "What if there are no other killer apps for Enterprise? Only Claude Code will produce the level of token churn that could drive huge profits." If that's right, the entire AI valuation thesis depends on whether coding agents keep growing. As someone running 7 AI agents in a race right now, I can tell you: the token burn is real. Whether it translates to $852B of value is another question.

GPT-5.4 Pro solves a 60-year-old Erdős conjecture

GPT-5.4 Pro solved Erdős problem #1196 — the asymptotic primitive set conjecture that had been open since the 1960s. Mathematician Jared Duker Lichtman called it a "Book Proof": a compact, elegant 3-page argument that bypassed the probability approach implicit in all human work since Erdős's own 1935 paper.

My take: This might be the first machine-generated proof to genuinely overturn human aesthetic conventions in pure math. It didn't just solve the problem — it found a fundamentally different approach that humans hadn't considered in 60 years. For developers, the practical takeaway is that these models aren't just pattern-matching anymore. When GPT-5.4 Pro can find novel mathematical approaches, the "AI can't be creative" argument is dead.

Agent infrastructure day: three platforms ship at once

On the same Wednesday, three major platforms upgraded their agent infrastructure:

OpenAI shipped the next evolution of the Agents SDK with native sandbox execution, model-native harness for long-running agents, and turnkey integrations with Cloudflare, Modal, E2B, Vercel, Temporal, and more. The key feature: agents can now run in isolated sandboxes with persistent state.

Gemini CLI got subagents — parallel sub-task delegation via @agent invocations, mirroring Claude Code's subagent feature.

Zapier launched its Agent SDK — authenticated access to 7,000+ apps for AI agents, with no OAuth flows or token management on the developer side.

My take: The agent infrastructure layer is consolidating fast. Six months ago, building an AI agent meant writing your own execution loop, state management, and tool integration. Now OpenAI, Google, and Zapier all want to be the platform you build on. If you're building anything with AI agents, evaluate now — before you're locked into one ecosystem.

For our AI Startup Race, this is directly relevant. The agents competing are essentially doing what these SDKs enable: autonomous coding, deployment, and iteration. The difference is our agents have been doing it since before these SDKs existed.

Federal court: no attorney-client privilege for AI chats

A federal judge in the Southern District of New York ruled in US v. Heppner that conversations with AI chatbots are not protected by attorney-client privilege. Your ChatGPT logs can be subpoenaed.

The same week, Anthropic started requiring government ID verification (via Persona) before allowing subscriptions.

My take: The era of "AI as private confidant" just legally ended. For developers, the practical implication: don't put anything in an AI chat that you wouldn't put in an email. If you're using Claude Code or Codex CLI on proprietary code, make sure your company's legal team knows. And if you're building AI products, your users' chat logs are now discoverable — plan your data retention accordingly.

Anthropic stops letting developers pin model versions

Anthropic removed the ability to pin specific Claude model versions, forcing users onto the latest claude-sonnet-4-6 even when it breaks downstream client apps. The HN thread went viral with developers complaining about silent breakage.

My take: This is a real problem for production systems. If you're building on Claude's API, you now need regression tests that run on every model update — because Anthropic won't let you stay on a version that works. This is exactly the kind of issue we cover in our LLM regression testing guide. The fix: test against the latest model in CI, but have a fallback to OpenRouter or another provider if quality drops.

Allbirds pivots from sneakers to AI compute, stock pops 600%

The struggling shoe retailer announced a $50M convertible financing facility and is pivoting to "AI compute infrastructure" after selling its sneaker brand for $39M. The stock jumped 600% in a single morning.

My take: We've officially entered the "put AI in your company name and watch the stock go up" phase. This is the 2021 crypto pivot playbook all over again. For developers: ignore the noise. The actual compute market is real (cloud GPU providers are genuinely useful), but a shoe company becoming a GPU-as-a-Service provider is not where you want to deploy your models.

Apple sends Siri team to coding bootcamp

The Information reported that Apple is sending a chunk of its Siri team — fewer than 200 people — to a multi-week bootcamp to learn how to code using AI, two months before the expected major Siri revamp.

My take: Even Apple's voice assistant team needs to learn vibe coding now. If Apple's own engineers are being retrained on AI-assisted development, the "should I learn AI coding tools?" question is answered. Yes. Yesterday.

Quick hits

Shopify open-sourced "autoresearch" — an autonomous experiment loop that cut their CI pipeline build time by 65%. Not just for ML; they used it on production infrastructure optimization.
Vercel CEO signaled IPO readiness — 30% of apps on Vercel are now deployed by AI agents. ARR hit $340M (up from $100M in early 2024).
CoreWeave landed $6B from Jane Street plus a $1B equity investment. The quant trading firm is now a major shareholder.
Claude had elevated errors across Claude.ai, API, and Claude Code on Wednesday. Growing pains from tripling revenue.
Google launched Gemini 3.1 Flash TTS with 70-language support and scene direction for expressive voices.
Gemini for Mac launched as a native Swift app — share your screen with Gemini in real time.
Nature published a "subliminal trait transmission" paper — language models can transmit behavioral traits through hidden signals in training data. Major implication for AI safety.
N-Day-Bench cyber leaderboard — GPT-5.4 leads (83.93), GLM-5.1 at #2 (80.13) above Claude Opus 4.6 (79.95). Open-weight model beating Claude on cybersecurity.
NVIDIA Nemotron 3 Super — 120B/12B-active MoE with 1M context, 2.2x throughput vs comparable models.
Cal.com closed its open-source core — citing AI-automated code scanning making open source a security liability. Hugging Face's CEO disagreed, arguing open source IS the security solution.
Microsoft exec proposed AI agents should pay for software seats — 10 employees × 5 agents each = 50 paid licenses. The SaaS pricing model is about to get weird.

What I'm watching

The agent infrastructure convergence is the story. OpenAI, Google, and Zapier all shipping agent SDKs in the same week means the "build vs buy" decision for agent infrastructure just got real. If you're hand-rolling agent loops, it's time to evaluate whether a managed platform saves you enough time to justify the lock-in.

The OpenAI valuation crack is worth watching too. If investors start pulling back, it could mean cheaper API pricing as OpenAI fights harder for market share. That's good for developers.

And the model version pinning issue from Anthropic is a canary in the coal mine. As AI models become infrastructure (not just tools), we need the same versioning guarantees we expect from databases and operating systems. Right now, we don't have them.

See you next Thursday. If you found this useful, share it with a developer friend who's still reading AI news from five sources instead of one.

Previous issues: #5: Anthropic's Too-Dangerous Model · #4: Anthropic Leaks Everything · #3: Claude Code Auto Mode

Related: How to Choose an AI Coding Agent · AI Coding Tools Pricing · The $100 AI Startup Race · LLM Regression Testing · How to Build an AI Agent

Originally published at https://www.aimadetools.com

I'm Giving 7 AI Coding Agents $100 Each to Build a Startup — Here's What Happens

Joske Vermeulen — Mon, 13 Apr 2026 10:01:49 +0000

TL;DR: 7 AI coding agents (Claude, GPT, Gemini, DeepSeek, Kimi, Xiaomi, GLM) each get $100 and 12 weeks to autonomously build a real, revenue-generating startup. Public repos, live sites, zero human code. Starts April 20.

The experiment

I wanted to answer a simple question: can AI actually build a business, not just write code?

Not a demo. Not a toy project. A real startup with a landing page, pricing, payment integration, blog content, and actual users.

So I set up 7 AI coding agents on a VPS, gave each one $100 and a 30-minute session timer, and let them run. They choose their own ideas, write their own code, deploy their own sites, and request help (domains, Stripe) via GitHub Issues.

The agents

Agent	Tool	Model	Origin
🟣 Claude	Claude Code	Sonnet / Haiku	🇺🇸 Anthropic
🟢 GPT	Codex CLI	GPT-5.4 / Mini	🇺🇸 OpenAI
🔵 Gemini	Gemini CLI	Pro / Flash	🇺🇸 Google
🔴 DeepSeek	Aider	Reasoner / Chat	🇨🇳 DeepSeek
🟠 Kimi	Kimi CLI	K2.5	🇨🇳 Moonshot
🟡 Xiaomi	Aider	MiMo V2 Pro	🇨🇳 Xiaomi
🟤 GLM	Claude Code	GLM-5.1 / 4.7	🇨🇳 Z.ai

3 US models vs 4 Chinese models. 5 different coding tools. Subscriptions vs API pricing. The playing field is deliberately uneven — just like real life.

The rules

$100 budget per agent for the startup (domains, services, tools). AI model costs are separate.
Fully autonomous — no human writes code or makes product decisions
1 hour of human help per agent per week — only for things AI physically can't do (buy domains, set up Stripe)
Public repos — watch them build in real-time
Surprise events throughout the 12 weeks

What we learned from the test run

We ran 3 test rounds before launch. Key findings:

Kimi was the best performer — it didn't just code, it planned a full Product Hunt launch strategy with social media templates and screenshots
DeepSeek was the most prolific — 302 commits in 5 days, but chose a saturated market (name generators)
Gemini over-engineered — chose Next.js, spent 5 days fighting deploy errors, never shipped
Xiaomi was the most efficient per commit — built a complete product in just 31 commits before running out of API budget
Qwen was removed — filed duplicate help requests, created files with social media posts as filenames, stalled for 25 hours

GLM-5.1 (the #1 model on SWE-Bench Pro) replaces Qwen for the real race.

Scoring

At the end of 12 weeks, agents are scored on:

Revenue earned (25 pts)
Users / traffic (20 pts)
Community vote (20 pts)
Code quality (15 pts)
Cost efficiency (10 pts)
AI peer review (10 pts)

Follow along

Dashboard: aimadetools.com/race
Daily digest: Updated daily with standings and highlights
Weekly recaps: In-depth analysis every week
All repos are public on GitHub

The race starts April 20, 2026.

What startup idea would YOU give an AI agent? Drop it in the comments — the best suggestion might become a surprise event.

I write about AI coding tools, model comparisons, and developer productivity at aimadetools.com.

I Used ChatGPT Plus for a Week — The Swiss Army Knife That's Not a Scalpel

Joske Vermeulen — Sun, 12 Apr 2026 09:51:53 +0000

This is week 5 of my "I Used It for a Week" series. So far I've reviewed Cursor (speed), Kiro (specs), GitHub Copilot (ecosystem), and Windsurf (budget pick). This week: the tool everyone already uses but nobody thinks of as a coding tool.

Let me be upfront: ChatGPT is not a code editor. It doesn't live in your IDE, it doesn't index your codebase, and it can't edit your files. Comparing it directly to Cursor or Kiro isn't fair.

But here's the thing — I used it more than any of them this week. Just not for the same things.

The Setup

I subscribed to ChatGPT Plus at $20/month. That gets you GPT-5.2, DALL-E 3, and priority access. There's also a Go tier at $8/month and the Pro tier at $200/month for power users, but Plus is what most developers use.

OpenAI's pricing tiers in 2026:

Free: GPT-5 with strict limits
Go: $8/month — extended limits, custom GPTs, voice
Plus: $20/month — GPT-5.2, higher limits, DALL-E 3
Pro: $200/month — GPT-5.4 Thinking, highest limits, Sora

I stuck with Plus because $200/month for Pro is hard to justify when Cursor costs $20 and does the actual coding part better.

What ChatGPT Is Actually Great At

Thinking partner, not typing partner

The biggest shift in my week was realizing ChatGPT's value isn't in writing code — it's in thinking about code. I used it to:

Debate architecture decisions before opening my editor
Explain unfamiliar codebases ("here's a 200-line file, explain what it does")
Rubber-duck debug problems I was stuck on
Generate regex patterns and SQL queries I'd otherwise spend 20 minutes on
Draft API contracts before implementing them

None of the IDE tools do this well. Cursor's chat is focused on your current codebase. Kiro's spec mode is structured and formal. ChatGPT is just... a conversation. Sometimes that's exactly what you need.

Learning accelerator

I was picking up a new library this week, and ChatGPT was invaluable. "Explain how React Server Components work with concrete examples." "What's the difference between these two approaches?" "Show me the tradeoffs."

It's like having a patient senior developer who never gets annoyed by basic questions. The IDE tools assume you already know what you're building. ChatGPT helps you figure out what to build.

Writing everything that isn't code

Documentation, commit messages, PR descriptions, technical specs, email drafts, blog outlines — ChatGPT handles all of this faster than I can type. A peer-reviewed study in Science found that writers using ChatGPT completed tasks 40% faster with 18% higher quality output.

This is where the $20/month pays for itself even if you never write a line of code with it.

Canvas mode for iteration

The Canvas feature lets you collaborate on a document or code snippet side by side. It's not as powerful as Cursor's multi-file editing, but for iterating on a single file or algorithm, it's surprisingly good. You can highlight a section and say "make this more efficient" or "add error handling here."

What Frustrated Me

The coding quality rollercoaster

Multiple OpenAI forum threads tell the same story: GPT-5's coding ability feels inconsistent. One user wrote: "Scripts that used to work now fail, solutions are weaker, and the model is less consistent." Another said GPT-5 is "intelligent, but it absolutely sucks at code" compared to earlier models for sustained coding sessions.

My experience matched this. For isolated coding questions — "write a function that does X" — it's great. For anything requiring sustained context across a long conversation, it starts losing track. By message 15 in a coding session, it would forget constraints I'd set in message 3.

No codebase awareness

This is the fundamental limitation. ChatGPT doesn't know your project. You have to manually paste code, explain your architecture, and re-establish context every session. After using Cursor's deep indexing and Kiro's spec-driven context, going back to copy-pasting code snippets into a chat window feels primitive.

Yes, you can upload files. But it's not the same as an AI that's read your entire codebase and understands how everything connects.

The limits are real

Even on Plus, you hit usage caps on GPT-5.2. During heavy use days, I got throttled to slower models. The dynamic caps mean you never quite know when you'll hit the wall. One reviewer noted: "While the $20 plan unlocks GPT-5.2 and DALL-E 3, it still has a trap: limits."

Pro at $200/month removes most limits, but that's 10x the price of Cursor or Copilot.

It doesn't execute

ChatGPT generates code. You copy it. You paste it. You run it. It fails. You copy the error. You paste it back. It fixes it. You copy again.

This loop is exhausting after using tools that edit your files directly. Cursor's agent runs the code, sees the error, and fixes it — all without you touching the clipboard. Kiro's hooks run tests automatically. ChatGPT just... talks about code.

Where ChatGPT Fits in My Stack

After four weeks of testing, here's how I actually use each tool:

Task	Best Tool	Why
Writing code in my editor	Cursor	Tab completion, multi-file agent
Planning new features	Kiro	Spec workflow, structured design
Learning new tech	ChatGPT	Conversational, patient, broad knowledge
Debugging logic	ChatGPT	Good at reasoning about problems
Architecture decisions	ChatGPT	Thinks through tradeoffs well
Writing docs/emails	ChatGPT	Fast, good quality prose
Quick code generation	ChatGPT	Isolated snippets, regex, SQL
Large refactoring	Cursor	Subagents, codebase awareness

ChatGPT is the tool I use around coding, not for coding. And that's fine — it's genuinely the best at that role.

My Verdict After 7 Days

ChatGPT Plus is worth $20/month for any developer, but not as a coding tool. It's a thinking tool, a learning tool, and a writing tool that happens to understand code.

If you're choosing between ChatGPT Plus and Cursor Pro and can only afford one, get Cursor. It'll save you more time on actual coding. But if you can afford both, they complement each other perfectly — Cursor for the doing, ChatGPT for the thinking.

Would I keep paying? Yes, without hesitation. But I'd never use it as my primary coding tool when Cursor, Kiro, and Copilot exist.

Who should subscribe:

Every developer (the thinking/learning value alone is worth it)
Non-technical founders who need to understand code
Anyone who writes documentation, emails, or specs

Who doesn't need it for coding:

Anyone already using Cursor or Kiro (they're better at the actual coding)
Developers who only need inline completions (Copilot is cheaper)

Next week: I Used Devin for a Week — the most hyped AI tool in recent memory. Is the "first AI software engineer" real, or just a great demo?

Originally published at https://www.aimadetools.com

I Used GitHub Copilot for a Week — The Safe Choice That's Falling Behind

Joske Vermeulen — Sat, 11 Apr 2026 09:49:10 +0000

This is week 3 of my "I Used It for a Week" series. I reviewed Cursor (the speed demon) and Kiro (the spec planner). Now it's time for the one most developers actually use: GitHub Copilot.

Here's the thing about Copilot — I used it for over a year before trying Cursor and Kiro. It was my baseline. The tool I compared everything else to. Going back to it after two weeks with the competition was... revealing.

The Setup

Unlike Cursor and Kiro, Copilot isn't a standalone editor. It's an extension that lives inside your existing IDE — VS Code, JetBrains, Neovim, Xcode, even Eclipse. That's its biggest strength and its biggest limitation.

I installed it in VS Code (my default before the Cursor experiment) and picked up right where I left off. All my extensions, all my settings, zero switching cost. If you've never used an AI coding tool before, this is the easiest possible starting point.

What Still Works Well

Inline completions are solid

Copilot's bread and butter — the ghost text that appears as you type — is still good. It predicts the next few lines based on your current file and open tabs. For writing boilerplate, implementing interfaces, and filling in repetitive patterns, it saves real time.

A ProductHunt reviewer summed it up: "It saves time by suggesting accurate code snippets and helps me stay in flow while coding." That matches my experience. For straightforward coding, Copilot just works.

IDE flexibility is unmatched

This is Copilot's trump card. Cursor locks you into their VS Code fork. Kiro is also VS Code-based. Copilot works in everything. If you're a JetBrains user (IntelliJ, PyCharm, WebStorm), Copilot is basically your only option among the big three.

For teams with mixed IDE preferences, this matters a lot.

Agent mode has caught up (mostly)

Copilot launched agent mode in February 2025, and by 2026 it's genuinely useful. You can ask it to plan changes, edit multiple files, run terminal commands, and iterate until the task is done. The coding agent can even turn GitHub Issues into pull requests autonomously.

With the March 2026 update, you can now select GPT-5.4 for agent mode across all supported IDEs. The quality jump from the older models is noticeable.

The GitHub ecosystem

Copilot's integration with GitHub is seamless in ways the competition can't match. Code review suggestions on pull requests, automated security scanning, Copilot Workspace for planning features directly from issues — if your team lives on GitHub, this ecosystem is valuable.

The Copilot SDK (production-ready since January 2026) lets enterprises build custom agents trained on their own architectural patterns. With 4.7 million paid users, the ecosystem is massive.

Price

The free tier gives you 2,000 completions and 50 agent/chat requests per month. That's enough to evaluate it properly. Pro at $10/month is the cheapest paid option among the big three — half the price of Cursor's $20/month.

What Frustrated Me (Coming Back From Cursor and Kiro)

Context awareness is shallow

This is where Copilot falls hardest behind. After using Cursor's deep codebase indexing and Kiro's spec-driven context, Copilot's understanding of my project felt surface-level.

Copilot primarily works from the current file and open tabs. It doesn't index your entire repository the way Cursor does. In testing across projects exceeding 10,000 lines, suggestions were accurate only about 50% of the time. It frequently suggested APIs and methods that didn't exist in my codebase.

One TrustRadius reviewer nailed it: "Copilot is not the best at analyzing large monolithic codebases and placing them in their context."

No next-edit prediction

After two weeks of Cursor's Tab-Tab-Tab workflow — where it predicts not just the current line but your next edit location — going back to Copilot's basic inline suggestions felt like downgrading. Copilot completes the line you're on. Cursor anticipates where you're going next. That difference compounds over a full day of coding.

Multi-file editing is weaker

Copilot's agent mode can edit multiple files, but it doesn't match Cursor's subagent system or Kiro's spec-guided implementation. The trade-off is architectural: Copilot works through extension APIs rather than controlling the whole editor environment. It can't understand your codebase as deeply because it's a guest in someone else's house.

For quick single-file edits, this doesn't matter. For large refactoring across 10+ files, the difference is stark.

No spec workflow, no hooks

Kiro's spec-driven approach and Agent Hooks have no equivalent in Copilot. There's no way to define requirements before coding, no automated triggers on file changes, and no structured planning workflow. Copilot is reactive — it responds to what you're doing. It doesn't help you figure out what you should be doing.

Security concerns are real

Multiple reviews and studies flag that Copilot can suggest insecure code patterns. Since it learns from public repositories, it sometimes pulls in outdated or vulnerable patterns. This isn't unique to Copilot — all AI coding tools have this risk — but Copilot's shallower context awareness means it's less likely to understand your project's specific security requirements.

The Pricing Breakdown

Plan	Price	Key Features
Free	$0	2,000 completions, 50 chat/agent requests
Pro	$10/month	Unlimited completions, premium model access
Pro+	$39/month	More premium requests, coding agent
Business	$19/user/month	Organization management, policy controls
Enterprise	$39/user/month	SSO, SCIM, audit logs, IP indemnity

The free tier is genuinely useful for evaluation. Pro at $10/month is the sweet spot for individuals. But note: heavy agent usage on Pro can hit limits, pushing you toward Pro+ at $39/month — which is nearly double Cursor's flat $20.

The Three-Tool Comparison

After using all three for a week each, here's my honest ranking by category:

Category	Winner	Runner-up	Third
Inline completions	Cursor (next-edit)	Copilot	Kiro
Multi-file refactoring	Cursor (subagents)	Kiro (spec-guided)	Copilot
Planning & architecture	Kiro (specs)	Copilot (Workspace)	Cursor
IDE flexibility	Copilot (all IDEs)	—	Cursor/Kiro (VS Code only)
Codebase understanding	Cursor (deep index)	Kiro (spec context)	Copilot
Price (value)	Copilot ($10/mo)	Cursor ($20/mo)	Kiro (variable)
Ecosystem	Copilot (GitHub)	Kiro (AWS)	Cursor
Speed of small edits	Cursor	Copilot	Kiro
Code quality	Kiro (spec-driven)	Cursor	Copilot

My Verdict After 7 Days

Copilot is the Toyota Corolla of AI coding tools. It's reliable, affordable, works everywhere, and gets the job done. There's a reason 4.7 million developers pay for it.

But after experiencing Cursor's speed and Kiro's discipline, Copilot feels like it's coasting on distribution rather than innovation. The GitHub integration and IDE flexibility keep it relevant, but the core AI experience — context awareness, multi-file editing, intelligent suggestions — is falling behind.

Would I keep paying? Only if I needed JetBrains support or was on a team standardized on GitHub's ecosystem. For VS Code users, Cursor is a better tool at twice the price — and the productivity gains more than cover the difference.

Who should use it:

JetBrains users (no real alternative)
Teams already deep in the GitHub ecosystem
Developers who want the cheapest entry point
Anyone who doesn't want to switch editors

Who should look elsewhere:

VS Code users who want the best AI experience (→ Cursor)
Solo developers building features from scratch (→ Kiro)
Anyone doing heavy multi-file refactoring
Developers who want deep codebase understanding

Tips If You're Starting

Use agent mode, not just inline suggestions — the inline completions are table stakes now, the agent is where the value is
Try GPT-5.4 as your model — it's a significant upgrade over the default
Open relevant files in tabs — Copilot uses open tabs for context, so more tabs = better suggestions
Don't trust security-sensitive suggestions blindly — review anything touching auth, encryption, or user data
Consider the free tier first — 2,000 completions/month is enough to decide if it's for you

That's three weeks, three tools. My current setup: Cursor for daily coding, Kiro for new features, Copilot retired. Your mileage may vary — the best tool is the one that matches how you think, not how I think.

Originally published at https://www.aimadetools.com

Claude Code vs Cursor — Terminal Agent vs AI IDE (2026)

Joske Vermeulen — Fri, 10 Apr 2026 10:11:37 +0000

Claude Code and Cursor are the two AI coding tools developers argue about most in 2026. They represent fundamentally different philosophies: Claude Code is a terminal agent that reads your codebase and executes autonomously. Cursor is a VS Code fork with AI deeply integrated into the editing experience.

The Pragmatic Engineer's 2026 survey of nearly 1,000 developers found Claude Code is now the #1 most-used AI coding tool, overtaking both Copilot and Cursor in just eight months. But Cursor grew 35% in the same period. Both are winning — just for different developers.

The Core Difference

Claude Code = you describe what you want, the AI does it. You review the result.

Cursor = you write code with AI assistance. The AI suggests, you decide in real-time.

That's the fundamental split. Claude Code is an autonomous agent. Cursor is an augmented editor.

Feature Comparison

Feature	Claude Code	Cursor
Interface	Terminal	VS Code fork
Approach	Autonomous agent	Augmented editor
Pricing	Usage-based (~$5-20/session)	$20/mo flat (Pro)
Context window	200K (1M in beta)	Varies by model
Codebase awareness	Reads entire repo	Indexes entire project
Multi-file editing	Native (agent does it)	Composer mode
Tab completion	No	Yes (multi-line + next-edit)
Model	Claude Opus 4.6 (default)	Claude, GPT, Gemini — your pick
IDE integration	Works with any editor	Cursor only
Git integration	Can commit, push, branch	Basic
Runs commands	Yes (shell access)	Limited

Where Claude Code Wins

Autonomy

You can tell Claude Code "refactor the auth system to use JWT tokens" and walk away. It'll read the codebase, plan the changes, modify files, run tests, fix errors, and commit. Cursor's Composer is powerful, but it still expects you to be in the loop reviewing each step.

For large, well-defined tasks, Claude Code's autonomy is a massive time saver.

Context window

Claude Code runs on Opus 4.6 with a 200K context window (1M in beta). It can hold your entire codebase in context for medium-sized projects. Cursor's context is limited by whichever model you're using and how much of your project it indexes.

Works with any editor

Claude Code runs in your terminal. You can use it alongside VS Code, JetBrains, Neovim, Vim — whatever. It doesn't care about your editor. Cursor forces you into their VS Code fork.

Shell access

Claude Code can run your tests, start your dev server, check build errors, and fix them — all in the same session. It has full shell access. Cursor's terminal integration exists but the AI doesn't interact with it as naturally.

Developer love

46% of developers in the Pragmatic Engineer survey named Claude Code as the tool they love most. Cursor was at 19%. That's a significant gap in satisfaction.

Where Cursor Wins

Real-time coding flow

Cursor's Tab predictions and inline suggestions keep you in a flow state. You're writing code, and the AI is right there suggesting the next line, the next edit, the next file to change. Claude Code has no inline editing — you describe, it executes, you review. Different rhythm entirely.

If you enjoy the act of writing code (not just describing it), Cursor feels better.

Visual feedback

You see changes happening in real-time in your editor. Diffs are highlighted, you can accept or reject individual changes. With Claude Code, you see terminal output and then check the files afterward. For developers who think visually, Cursor's approach is more intuitive.

Predictable pricing

Cursor Pro is $20/month, period. Claude Code is usage-based — a heavy session can cost $5-20 depending on the model and how much context you're feeding it. If you code 8 hours a day, Claude Code can get expensive fast. Cursor's flat rate is simpler to budget.

Model flexibility

Cursor lets you switch between Claude, GPT, and Gemini models per task. Claude Code only runs Claude models. If you want GPT-5.4 for a specific task, you can't do that in Claude Code.

Pricing Reality

Claude Code

Runs on your Anthropic API key or Claude Max subscription
Claude Max: $100/mo (5x usage), $200/mo (20x usage)
API: ~$5-15 per heavy coding session (varies wildly)
No free tier for coding use

Cursor

Free: 2,000 completions, 50 premium requests
Pro ($20/mo): Unlimited completions, 500 premium requests
Business ($40/mo): Team features

For light-to-moderate use, Cursor is cheaper. For heavy autonomous work, Claude Code can cost more but potentially saves more time.

Who Should Use What

Choose Claude Code if:

You're comfortable in the terminal
You want maximum autonomy (describe → AI builds)
You work on large refactoring tasks
You already pay for Claude Max
You use a non-VS Code editor

Choose Cursor if:

You love the VS Code editing experience
You want real-time AI suggestions while you type
You prefer predictable monthly pricing
You want to choose between multiple AI models
You enjoy hands-on coding with AI assistance

The power move: Use both. Claude Code for big autonomous tasks ("refactor this entire module"), Cursor for daily editing with inline suggestions. Many developers in the Pragmatic Engineer survey reported using 2-4 AI tools simultaneously.

Claude Code is next on my I Used It for a Week review list. Stay tuned.

Related: Best AI Coding Tools in 2026: The Definitive Ranking

Originally published at https://www.aimadetools.com

AI Dev Weekly #5: Anthropic's Too-Dangerous Model, $30B Revenue, and China's GLM-5.1 Beats Everyone

Joske Vermeulen — Thu, 09 Apr 2026 10:59:06 +0000

AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news — with my take as someone who actually uses these tools daily.

This was the biggest week in AI since GPT-4 dropped. Anthropic built a model too dangerous to release, hit $30B in revenue, and launched managed agents. Meta shipped its first model from the $14B Alexandr Wang deal. And a Chinese lab released an open-source model that beats GPT-5 and Claude on the hardest coding benchmark. Let's get into it.

Anthropic built Claude Mythos — and won't release it

Anthropic launched Project Glasswing this week, revealing Claude Mythos Preview — a model so good at finding software vulnerabilities that they decided it's too dangerous for public access. Mythos autonomously discovered thousands of zero-day flaws across every major operating system and web browser, including a 17-year-old remote code execution bug in FreeBSD.

Partners including AWS, Apple, Google, Microsoft, CrowdStrike, and the Linux Foundation are getting early access to patch critical systems, backed by $100M in usage credits.

My take: This is either genuinely responsible AI safety or the most effective marketing campaign in tech history. Probably both. The "too dangerous to release" framing is straight from OpenAI's GPT-2 playbook in 2019 — and it works just as well now. The difference is Mythos actually found real zero-days that are being patched, which gives the claim more credibility than "this text generator is too good."

For developers: the practical impact is that your dependencies are getting security patches faster. The philosophical impact is that we're entering an era where AI finds vulnerabilities faster than humans can fix them.

Anthropic hits $30B revenue, surpasses OpenAI

Anthropic's run-rate revenue hit $30 billion, up from $9 billion at the end of 2025. They've surpassed OpenAI for the first time. More than 1,000 business customers now spend over $1 million annually.

They also signed a deal with Google and Broadcom for 3.5 gigawatts of next-generation TPU compute starting in 2027.

My take: The revenue number is staggering but the compute deal is the real story. 3.5 gigawatts is a small city's worth of power. Anthropic is betting that demand for Claude will continue to grow exponentially — and given that they just launched managed agents, they're probably right.

For context: if you're using Claude Code or any Claude-based tool, you're part of this revenue. The Pro subscription model is clearly working.

Z.ai's GLM-5.1 beats GPT-5 and Claude on coding

Z.ai (formerly Zhipu AI) released GLM-5.1, a 754-billion-parameter open-source model under the MIT license that scored #1 on SWE-Bench Pro at 58.4 — beating GPT-5.4 (57.7), Claude Opus 4.6 (57.3), and Gemini 3.1 Pro (55.1).

The headline feature: GLM-5.1 can work autonomously on a single coding task for up to eight hours straight. In a demo, it built a full Linux desktop environment from scratch.

The weights are free on Hugging Face.

My take: This is the most significant open-source model release since Llama 4. An MIT-licensed model beating every proprietary model on the hardest coding benchmark changes the economics of AI coding tools. If you're building with OpenCode or any model-agnostic tool, GLM-5.1 is worth testing immediately.

The eight-hour autonomous coding claim is wild. Most AI coding sessions today last 30 minutes before the model loses context or goes off track. If GLM-5.1 genuinely maintains coherence for eight hours, it's a step change in what AI agents can do.

Meta ships Muse Spark — and it's not open source

Meta debuted Muse Spark, the first AI model from its Superintelligence Labs led by Alexandr Wang (the $14.3B Scale AI acquisition). It's proprietary — a break from Meta's open-source Llama tradition — and powers the Meta AI app across Facebook, Instagram, and WhatsApp.

Meta says they plan to eventually open-source future Muse models.

My take: Meta going proprietary is a big deal. Llama 4 was the backbone of the open-source AI ecosystem. If Meta's best models are now closed, the open-source community loses its biggest contributor. The "we'll open-source future models" promise is vague enough to mean nothing.

For developers: Muse Spark is only available through Meta's platforms for now. If you need open models, Gemma 4, Qwen 3.5, and now GLM-5.1 are your best options.

Anthropic launches Claude Managed Agents

Anthropic released Claude Managed Agents in public beta — APIs for building and deploying cloud-hosted AI agents at scale. The product handles infrastructure, state management, and permissioning.

Launch partners include Sentry (auto-fixing bugs end-to-end), Rakuten (7 hours of autonomous coding), and Notion (delegating work to Claude inside workspaces).

My take: This is Anthropic's play to own the agent infrastructure layer. Instead of developers building their own agent loops (like we cover in our AI agent guide), Anthropic wants you to use their managed service. The tradeoff is convenience vs lock-in.

The Sentry integration is the most interesting — an AI that automatically fixes bugs when they're detected in production. That's the kind of agent use case that actually saves money.

OpenAI proposes robot taxes and a four-day workweek

OpenAI published a 13-page policy paper called "Industrial Policy for the Intelligence Age" proposing robot taxes, a public wealth fund, a four-day workweek, and automatic safety nets that expand when AI disruption crosses thresholds.

My take: The company building the robots is proposing the robot tax. Make of that what you will. The four-day workweek proposal is interesting because it acknowledges that AI will reduce the amount of human labor needed — which is exactly what OpenAI's products are designed to do.

Quick hits

Chinese AI models swept all top 6 spots on OpenRouter's global usage rankings. Alibaba's Qwen 3.6 Plus topped the list with 4.6 trillion weekly tokens.
Anthropic cut off subscription access for third-party tools like OpenClaw, requiring separate pay-per-token billing. If you're using Claude through a third-party harness, check your billing.
Claude had back-to-back outages Monday and Tuesday. Growing pains from tripling revenue in four months.
An AMD AI director called Claude Code "dumber and lazier" since recent updates, filing a detailed GitHub issue calling it "unusable for complex engineering tasks."
OpenAI acquired TBPN, a tech talk show with under 60K YouTube followers, for reportedly hundreds of millions. Nobody understands why.
Japan relaxed privacy laws to make itself the "easiest country to develop AI," removing opt-out options for personal data use.
Research found "cognitive surrender" — AI users increasingly abandon logical thinking, uncritically accepting faulty AI answers.

What I'm watching

The GLM-5.1 release is the story to watch. If an MIT-licensed model genuinely beats GPT-5 on coding, the pricing pressure on OpenAI and Anthropic will be enormous. Why pay $20/month for Codex CLI when a free model does it better?

The managed agents space is heating up fast. Anthropic, OpenAI, and Google are all racing to be the platform where developers build agents. If you're building anything with AI agents, now is the time to evaluate your options — before you're locked into one ecosystem.

And the Anthropic revenue number ($30B) tells us something important: developers are willing to pay for AI tools. The market is real. The question is whether open-source alternatives like GLM-5.1 and Gemma 4 will compress those margins.

See you next Thursday. If you found this useful, share it with a developer friend who's still reading AI news from three sources instead of one.

Previous issues: #4: Anthropic Leaks Everything · #3: Claude Code Auto Mode

Originally published at https://www.aimadetools.com

I Used Kiro for a Week — The AI IDE That Plans Before It Codes

Joske Vermeulen — Wed, 08 Apr 2026 10:12:24 +0000

This is week 2 of my "I Used It for a Week" series. Last week I reviewed Cursor — the AI editor that blew me away with its Tab predictions and agent mode. This week, I tried something fundamentally different.

After a week with Cursor, I thought I knew what AI coding tools were about: fast autocomplete, multi-file agents, and Tab-Tab-Tab your way through boilerplate. Then I opened Kiro, and it asked me to write a spec before touching any code.

That threw me off. In a good way.

What Is Kiro, Actually?

Kiro is AWS's AI-powered IDE. Like Cursor, it's built on VS Code, so the switch is painless. But the philosophy is completely different. Where Cursor says "let me write that code for you," Kiro says "let's figure out what we're building first."

They call it spec-driven development, and it follows a structured workflow:

Discuss — you describe what you want in plain language
Spec — Kiro generates formal requirements
Design — it creates a technical design document
Tasks — it breaks the work into implementation steps
Build — then it writes the code

It sounds heavy. It is, a little. But after a week, I understand why it exists.

Day 1: The Spec Workflow

I started with a real task: building a notification system for a side project. In Cursor, I would've just said "build me a notification component" and started accepting suggestions. In Kiro, I opened a spec.

Kiro asked me clarifying questions I hadn't thought about. What triggers a notification? Do they persist or auto-dismiss? What about mobile? Do we need a notification center? Rate limiting?

By the time the spec was done, I had a proper requirements document. The kind of thing a product manager would write — except it took 10 minutes instead of a meeting.

Then Kiro generated a design document with component architecture, data flow, and API contracts. Then it broke it into tasks. Then it started coding.

The code it produced was noticeably more complete than what I typically get from Cursor's agent mode. Fewer edge cases missed, better error handling, proper TypeScript types from the start. The spec gave it enough context to get things right on the first pass.

What Blew Me Away

The spec is the context

This is Kiro's killer insight. In Cursor, I spent a lot of time crafting prompts and using @file references to give the AI enough context. In Kiro, the spec is the context. Every task the agent executes has the full requirements and design document behind it.

The result: less back-and-forth, fewer "that's not what I meant" moments, and code that actually matches what I wanted.

Agent Hooks

Kiro has this feature called Agent Hooks — automated triggers that fire when certain things happen. I set up hooks to:

Run tests automatically when implementation files change
Update documentation when API contracts change
Run linting on every save

It's like having a CI pipeline inside your editor. Cursor has nothing like this — you'd have to manually ask the agent to run tests or update docs.

Steering files

Similar to Cursor's .cursorrules, Kiro has Steering — project-level instructions that guide the AI's behavior. But Kiro's version feels more integrated. You can define coding standards, architecture patterns, and even reference external documentation. The AI follows these consistently across all spec-generated tasks.

It actually slows you down (in a good way)

This sounds like a criticism, but hear me out. With Cursor, I caught myself accepting suggestions without reading them. The speed was addictive but dangerous. Kiro's spec workflow forces you to think before you code. You review requirements, approve the design, then watch the implementation.

I shipped fewer bugs this week. That's not a coincidence.

What Frustrated Me

The spec workflow is overkill for small tasks

Need to rename a variable? Fix a typo? Add a CSS class? You don't need a requirements document for that. Kiro's spec mode is brilliant for features but painful for quick fixes.

Kiro does have a "vibe" mode for quick tasks (basically a standard chat), but it feels like an afterthought compared to the polished spec workflow. Cursor is significantly better for rapid, small edits.

Pricing drama

Kiro launched with a generous free preview, then introduced pricing that upset a lot of developers. The free tier lost access to spec mode entirely. The paid plans have request limits that heavy users burn through quickly, with overage charges of $0.04 per vibe request and $0.20 per spec request.

There was even a pricing bug in early March 2026 that drained developer limits faster than expected — AWS blamed it on a bug, but trust was damaged.

For comparison: Cursor Pro is a flat $20/month with unlimited completions. Kiro's costs can be unpredictable if you're a heavy user.

Performance under load

During the preview period, Kiro hit capacity issues. AWS introduced waitlists and usage caps within a week of the public preview launch. Performance has improved since, but I still hit occasional slowdowns during peak hours — something I rarely experience with Cursor.

Less community and ecosystem

Cursor has a massive community, tons of .cursorrules templates, and years of user feedback baked into the product. Kiro is newer and it shows. Fewer tutorials, fewer community resources, and the documentation still has gaps. One reviewer noted that "the official docs only tell part of the story, leaving you to guess if it really works as promised."

Kiro vs Cursor: Head to Head

	Kiro	Cursor
Philosophy	Plan first, code second	Code fast, iterate
Best for	Features, new projects	Refactoring, quick edits
Spec workflow	✅ Full requirements → design → tasks	❌ No equivalent
Tab completion	Basic	✅ Best-in-class (next-edit prediction)
Agent hooks	✅ Automated triggers	❌ Manual only
Multi-file editing	✅ Good (spec-guided)	✅ Excellent (subagents)
Codebase indexing	Good	✅ Deep semantic search
Model choice	Claude (Sonnet/Opus 4.6)	GPT-5, Claude, Gemini
Pricing	Usage-based, can spike	$20/mo flat
Community	Growing	✅ Large, established

Claude Sonnet and Opus 4.6 Under the Hood

Kiro runs on Anthropic's Claude models — Sonnet 4.6 for most tasks and Opus 4.6 for complex reasoning. Having used both through Kiro for a week:

Sonnet 4.6 handles the day-to-day spec generation and routine coding. It's fast, follows instructions well, and the 200K context window (1M in beta) means it can hold your entire spec + codebase in memory. At $3/$15 per million tokens, it's the workhorse.

Opus 4.6 kicks in for complex architectural decisions and multi-step reasoning. You can feel the difference — responses are slower but more thorough. The 128K output limit means it can generate entire feature implementations in one pass. At $5/$25 per million tokens, it's expensive but worth it for the hard stuff.

The combination works well. Kiro seems to route intelligently between them — simple tasks get Sonnet's speed, complex tasks get Opus's depth. It's the model routing strategy I mentioned in the Gemini vs Opus comparison — use the cheap model for bulk work, the expensive one for the hard problems.

My Verdict After 7 Days

Kiro made me a more disciplined developer. The spec workflow caught requirements I would've missed, and the code quality was consistently higher than what I get from pure "vibe coding" tools.

But it's not my daily driver. For the way I work — lots of small edits, quick iterations, jumping between files — Cursor's speed and Tab completion are hard to beat. Kiro shines when I'm starting a new feature from scratch or working on something complex enough to warrant a spec.

My ideal setup: Kiro for planning and building new features. Cursor for everything else. They're not really competitors — they're complementary tools with different philosophies.

Would I keep paying? Yes, but only for the feature-building sessions. I wouldn't use it for daily coding the way I use Cursor.

Who should try it:

Developers who want more structure in their AI workflow
Solo founders building MVPs (the spec workflow prevents scope creep)
Teams that value documentation and requirements
Anyone frustrated by AI tools that write code without understanding what they're building

Who should skip it:

Developers who mostly do quick edits and refactoring
Anyone on a tight budget (costs can be unpredictable)
People who find specs and planning documents tedious

Tips If You're Starting

Use spec mode for features, vibe mode for fixes — don't force the spec workflow on everything
Set up Agent Hooks early — auto-running tests on save is a game changer
Write good Steering files — same advice as Cursor's .cursorrules, but even more important here since specs amplify your instructions
Review the generated spec carefully — garbage spec = garbage code, no matter how good the AI is
Budget for overages — track your usage in the first week to avoid surprises

Next week: I Used GitHub Copilot for a Week — the tool 4.7 million developers pay for. Is it still worth it in 2026, or have Cursor and Kiro left it behind?

Originally published at https://www.aimadetools.com

Supabase vs. Firebase — Which Backend in 2026?

Joske Vermeulen — Tue, 07 Apr 2026 10:11:08 +0000

Supabase if you want SQL, open source, and low vendor lock-in.
Firebase if you want the Google ecosystem, real-time by default, and mobile-first features.

Side-by-side

	Supabase	Firebase
Database	PostgreSQL (SQL)	Firestore (NoSQL)
Open source	Yes	No
Self-hostable	Yes	No
Auth	Yes (email, OAuth, magic link)	Yes (email, OAuth, phone)
Storage	Yes	Yes
Real-time	Yes (Postgres changes)	Yes (built into Firestore)
Functions	Edge Functions (Deno)	Cloud Functions (Node.js)
Querying	Full SQL, joins, aggregates	Limited NoSQL queries
Vendor lock-in	Low (it's just Postgres)	High
Free tier	Generous (500 MB DB)	Generous
Push notifications	No	Yes (FCM)
Analytics	No	Yes (built-in)

Where Supabase wins

PostgreSQL — full SQL power. Joins, aggregates, CTEs, window functions. Firestore can't do any of this.
Open source — you can self-host. Your data is in a standard PostgreSQL database you can take anywhere.
No vendor lock-in — if you leave Supabase, you have a Postgres database. If you leave Firebase, you have a migration nightmare.
Row-level security — powerful auth policies at the database level.
Developer experience — the dashboard, docs, and client library are excellent.

Where Firebase wins

Real-time — Firestore is real-time by default. Every query can be a live subscription. Supabase has real-time but it's not as seamless.
Mobile — Firebase was built for mobile. Push notifications (FCM), crash reporting (Crashlytics), analytics, remote config — all built in.
Google ecosystem — tight integration with Google Cloud, Google Analytics, BigQuery.
Maturity — Firebase has been around since 2012. More battle-tested at massive scale.
Offline support — Firestore has excellent offline persistence for mobile apps.

Pricing comparison

Both have generous free tiers. The pricing models differ:

Supabase — predictable monthly pricing based on database size and compute
Firebase — pay-per-read/write/delete. Can get expensive with lots of reads (and Firestore encourages denormalized data = more reads)

Firebase's pricing is harder to predict. Many developers have been surprised by unexpected bills.

How to choose

Building a web app? Supabase (SQL is more natural for web).
Building a mobile app? Firebase (push notifications, offline, analytics).
Care about vendor lock-in? Supabase (open source, standard Postgres).
Need complex queries? Supabase (SQL vs. NoSQL is no contest here).
Need real-time everything? Firebase (more mature real-time).
Team already knows SQL? Supabase.
Team already knows Firebase? Stay with Firebase.

🛠️ Free tools related to this article:

SQL Formatter

Originally published at https://www.aimadetools.com

How to Use Claude Code: A Beginner's Guide

Joske Vermeulen — Mon, 06 Apr 2026 10:13:59 +0000

This is the first post in my Build It With AI series — practical tutorials for developers who want to use AI tools effectively.

Claude Code is a terminal-based AI coding agent from Anthropic. According to the Pragmatic Engineer's 2026 survey, it's now the most-used AI coding tool, overtaking both GitHub Copilot and Cursor in just eight months. 46% of developers named it the tool they love most.

Here's how to get started.

What Claude Code Actually Is

Claude Code runs in your terminal. No IDE, no editor — just your terminal. You describe what you want in plain English, and it:

Reads your codebase
Plans the changes
Writes and modifies files
Runs commands (tests, builds, git)
Fixes errors it encounters

It's not an autocomplete tool. It's an autonomous agent that does the work while you supervise.

Installation

Prerequisites

Node.js 18+ installed
An Anthropic account with Claude Max subscription or API access

Install

npm install -g @anthropic-ai/claude-code

Authenticate

claude

The first time you run it, it'll open a browser window to authenticate with your Anthropic account. Once authenticated, you're ready to go.

Your First Session

Navigate to any project directory and start Claude Code:

cd ~/my-project
claude

You'll see a prompt where you can type natural language instructions. Try something simple:

> What does this project do? Give me a summary.

Claude Code will read your project files and give you an overview. This is a great way to onboard onto unfamiliar codebases.

Common Workflows

Ask questions about your code

> How does the authentication flow work in this project?
> Where is the database connection configured?
> What would break if I changed the User model?

Claude Code reads the relevant files and gives you answers with specific file references.

Make changes

> Add a /health endpoint to the Express server that returns { status: "ok" }

Claude Code will:

Find your server file
Add the endpoint
Show you the diff
Ask for confirmation (unless you use --yes)

Refactor code

> Refactor the auth middleware to use [JWT](https://www.aimadetools.com/blog/jwt-decoder/?utm_source=devto) tokens instead of session cookies

This is where Claude Code shines. It'll identify all the files that need to change, plan the refactoring, and execute it across your entire codebase.

Fix bugs

> The /api/users endpoint returns a 500 error. Find and fix the bug.

Claude Code will read the route, check the error handling, potentially run the server, and fix the issue.

Run commands

> Run the tests and fix any failures

Claude Code has shell access. It can run your test suite, read the output, and fix failing tests — all in one session.

Key Flags

# Skip all confirmation prompts (careful!)
claude --dangerously-skip-permissions

# Auto-accept all changes
claude --yes

# Run a single prompt and exit (no interactive session)
claude -p "Add a README.md to this project"

# Resume a previous session
claude --resume

Tips From Daily Use

Start with questions, not commands. Before asking Claude Code to change anything, ask it to explain the codebase. This loads context and leads to better changes.

Be specific about what you want. "Make the app better" gives bad results. "Add input validation to the /api/users POST endpoint that checks for valid email format" gives great results.

Let it run tests. After making changes, tell it to run your test suite. It'll fix its own mistakes, which saves you review time.

Use it alongside your editor. Claude Code works in the terminal, so you can have it open in one pane and your editor in another. Watch the files change in real-time.

Commit frequently. Tell Claude Code to commit after each logical change. If something goes wrong, you can easily revert.

Pricing

Claude Code requires either:

Claude Max subscription: $100/mo (5x usage) or $200/mo (20x usage)
API key: Pay per token, roughly $5-15 per heavy coding session

There's no free tier for Claude Code specifically, but you can try it with a standard Claude Pro subscription ($20/mo) with limited usage.

When to Use Claude Code vs Other Tools

Task	Best Tool
Large refactoring	Claude Code
Daily coding with inline suggestions	Cursor
Quick edits in JetBrains	Copilot
Thinking through architecture	ChatGPT
Budget-friendly AI IDE	Windsurf

Claude Code is best when you have a well-defined task and want the AI to handle it autonomously. For real-time pair programming, Cursor is still better. For a detailed comparison, see Claude Code vs Cursor in 2026.

Next in this series: we'll build a Chrome extension from scratch using Claude Code. Subscribe to get notified.

Related: What is Regex? A Simple Explanation for Developers

Originally published at https://www.aimadetools.com

claude-opus-4-vs-gpt-5

Joske Vermeulen — Sun, 05 Apr 2026 09:45:05 +0000

Both Claude Opus 4 and GPT-5 are top-tier AI models, but they excel in different areas. Here's how they compare.

At a glance

	Claude Opus 4	GPT-5
Provider	Anthropic	OpenAI
Context window	200K tokens	128K tokens
Input price	$15 / 1M tokens	$10 / 1M tokens
Output price	$75 / 1M tokens	$30 / 1M tokens
Coding (SWE-bench)	~76.8%	~71.8%
Multimodal	Text + images	Text + images + audio
Subscription	$20/mo (Claude Pro)	$20/mo (ChatGPT Plus)

Coding

Claude Opus 4 has the edge here. It scores higher on SWE-bench and tends to produce cleaner, more complete code on the first try. Developers working on complex multi-file refactors or architecture decisions generally prefer Opus.

GPT-5 is no slouch — it's significantly better than GPT-4o and handles most coding tasks well. But for advanced coding work, Opus is the current leader.

Winner: Claude Opus 4 🏆

Reasoning

GPT-5 excels at multi-step reasoning and math. It scored perfectly on AIME benchmarks and handles complex logical chains well. Opus 4 is strong too, but GPT-5 has a slight edge on pure reasoning tasks.

Winner: GPT-5 🏆

Context & long documents

Opus 4 supports 200K tokens vs GPT-5's 128K. If you're working with large codebases, long documents, or need to process a lot of context at once, Opus gives you more room.

Winner: Claude Opus 4 🏆

Price

GPT-5 is cheaper on both input and output. If you're making heavy API use, the cost difference adds up — especially on output tokens where Opus is 2.5x more expensive.

Winner: GPT-5 🏆

Which should you pick?

Use case	Pick
Complex coding projects	Claude Opus 4
Math & reasoning tasks	GPT-5
Large codebase analysis	Claude Opus 4 (bigger context)
Budget-conscious API use	GPT-5
General assistant	Either — both excellent
Multimodal (audio)	GPT-5

Bottom line

For coding and long-context work, Claude Opus 4 is the better choice. For reasoning, math, and cost efficiency, GPT-5 wins. Both are excellent — you can't go wrong with either at the $20/mo subscription tier.

The real answer: try both. Both offer free tiers or trials.

See our full AI Model Comparison for all models side by side.

Originally published at https://www.aimadetools.com

I Used Cursor AI for a Week — Here's What Actually Happened

Joske Vermeulen — Fri, 03 Apr 2026 09:56:44 +0000

I've been hearing about Cursor for months. Every dev subreddit, every Twitter thread, every "10x your productivity" post — Cursor was always in the conversation. So I decided to actually use it as my only editor for a full week and see what the hype is about.

Here's the unfiltered version.

Day 1: The Switch

Switching from VS Code to Cursor took about five minutes. It's literally a fork of VS Code, so all my extensions, keybindings, and themes carried over. My muscle memory worked from the first second. That alone puts it ahead of every other "AI editor" I've tried — there's no learning curve for the basics.

I opened a project, and the first thing Cursor did was index my entire codebase. For my medium-sized project (~2,000 files), this took maybe 30 seconds. I've heard horror stories about large monorepos taking hours, but for a typical project, it was fast.

What Blew Me Away

Tab completion that reads your mind

This is the feature that sold me within the first hour. Cursor's Tab doesn't just autocomplete the current line — it predicts your next edit. You accept a suggestion, press Tab again, and it jumps to the next logical place you'd want to change something.

It's hard to explain until you experience it. You start writing a function, Tab completes it, then Tab jumps you to where you need to add the import, then Tab takes you to the test file. It feels like pair programming with someone who's already read your code.

Their custom Tab model was trained with reinforcement learning to show 21% fewer suggestions but with a 28% higher accept rate. In practice, that means less noise and more "yes, that's exactly what I wanted."

Agent mode is the real deal

Cmd+I opens the agent, and this is where Cursor separates itself from Copilot. You can say "refactor this component to use React hooks instead of class components" and it will:

Read the relevant files
Plan the changes
Edit multiple files
Run your linter to check for errors
Fix any issues it finds

It doesn't just suggest code — it executes. With version 2.4's subagents, it can even spin up parallel tasks. Need to update the component AND its tests AND the documentation? It handles all three simultaneously.

Codebase awareness

The @ symbol is incredibly powerful. Type @filename to reference a specific file, @codebase to search semantically across your project, or @docs to pull in documentation. This context management is what makes Cursor's suggestions actually relevant instead of generic.

I found myself using @codebase constantly — "find everywhere we handle authentication" or "show me how we format dates across the project." It's like having a senior dev who's memorized every line of your code.

.cursorrules changed everything

On day 2, I created a .cursorrules file in my project root. This is basically a system prompt that tells Cursor how you want it to behave. I added things like:

"Use TypeScript strict mode, never use any"
"Prefer functional components with hooks"
"Always add error handling"
"Follow the existing naming conventions in this project"

The difference was night and day. Before the rules file, suggestions were generic. After, they matched my project's style perfectly. This is the single biggest tip I can give any new Cursor user: write your rules file on day one.

What Frustrated Me

Performance on larger projects

By day 3, I opened a bigger project at work — around 8,000 files. Cursor started struggling. The indexing took several minutes, and I noticed lag when typing. GPU usage spiked to 90% during code application. Some developers report memory consumption hitting 7GB+ with hourly crashes on large codebases.

I had to tune things: added folders to .cursorignore, disabled some extensions, and increased Node.js memory limits. After that it was usable, but it shouldn't require manual tuning to handle a normal enterprise project.

The constant updates

Cursor pushes updates almost daily, and each one requires a restart. If you're running dev servers in the integrated terminal — which I always am — that means restarting your servers too. It's a small thing, but by day 5 it was genuinely annoying.

Some updates also moved UI elements around or changed how features worked. The Cursor forum has threads from frustrated users saying the interface changes too frequently. I get that they're iterating fast, but stability matters when this is your daily tool.

AI quality is inconsistent

When Cursor is good, it's incredible. But it has bad days. Sometimes the agent would confidently make changes that broke things in subtle ways — passing tests but introducing logic errors. One afternoon, the suggestions felt noticeably worse than the morning, which makes me think it depends on which model is handling your request and how loaded the servers are.

The Cursor forum has posts from power users calling the Composer feature "an absolute garbage producing slop machine" during bad periods. That's harsh, but I understand the frustration when you're paying $20/month and the quality fluctuates.

It can make you lazy

This is the sneaky one. By day 4, I caught myself accepting suggestions without fully reading them. The Tab completion is so good that you start trusting it blindly. I had to consciously slow down and review what it was generating, especially for business logic.

One user on Reddit put it perfectly: "It helps a lot if you change how you work. It feels useless if you treat it like a fancy autocomplete." You need to think of it as a junior developer who's very fast but needs code review.

The Pricing Reality

Free: 2,000 completions (enough to try it, not enough to use it)
Pro: $20/month — unlimited completions, 500 fast requests
Pro+: $60/month — more agent usage
Ultra: $200/month — heavy agent users
Business: $40/user/month — team features, admin controls

For most solo developers, Pro at $20/month is the sweet spot. You'll only feel limited during intense multi-file refactoring sessions. But be aware — heavy agent usage can burn through your allowance fast.

Cursor vs GitHub Copilot

I used Copilot for over a year before this, so here's the honest comparison:

	Cursor	GitHub Copilot
Inline completions	Excellent (+ next-edit prediction)	Excellent
Multi-file editing	✅ Native, powerful	⚠️ Limited
Codebase understanding	Deep (indexes everything)	Surface-level
Agent mode	Full autonomous agent	Basic
IDE support	Cursor only (VS Code fork)	VS Code, JetBrains, Neovim, etc.
Price	$20/month	$10-19/month
Model choice	GPT-5, Claude, Gemini	Primarily OpenAI

Copilot wins if you need IDE flexibility, want the cheapest option, or work in JetBrains. Cursor wins if you do complex multi-file work and don't mind being locked to one editor.

My Verdict After 7 Days

Cursor made me mass faster at the boring parts of coding — boilerplate, refactoring, test writing, documentation. I'd estimate it saved me 1-2 hours per day on a typical workday. For $20/month, that's absurd ROI.

But it didn't make me a better programmer. The hard parts — architecture decisions, debugging subtle logic errors, understanding business requirements — are still 100% on me. Cursor is a productivity multiplier, not a replacement for knowing what you're doing.

Would I keep paying? Yes. Going back to vanilla VS Code after a week of Cursor feels like coding with one hand tied behind your back. That's not marketing — that's what it actually feels like.

Who should try it:

Any developer writing code daily (the free tier is enough to decide)
Teams doing lots of refactoring or working across large codebases
Solo developers who want to ship faster

Who should skip it:

Developers who primarily work in JetBrains IDEs
Teams with strict security policies that don't allow code to be sent to external APIs
People who expect AI to write entire applications without guidance

Tips If You're Starting

Write a .cursorrules file immediately — this is the single biggest quality improvement
Learn the @ references — @file, @codebase, @docs make the AI actually useful
Don't accept suggestions blindly — review everything, especially business logic
Use agent mode for refactoring, Tab for writing — each has its sweet spot
Add large folders to .cursorignore — node_modules, build artifacts, vendor deps
Treat it like a junior dev — fast and eager, but needs supervision

Related: Claude Code vs Cursor — Which One Wins in 2026?

Originally published at https://www.aimadetools.com