DEV Community: MergeShield

How a Cursor Agent Deleted 37GB - A Forensic Breakdown

MergeShield — Thu, 02 Apr 2026 12:13:00 +0000

This is a cross-post. Read the full article with diagrams on mergeshield.dev

A developer set up a Cursor agent to clean a project directory. It had file system access - that felt fine at setup time. Forty minutes later, 37GB of data was gone.

The forensic report does not point to a single dramatic failure. It shows four ordinary decisions that each looked reasonable individually, and catastrophic in combination.

The Four Failure Points

See the failure chain diagram in the full article

The forensic report identifies four distinct places where this should have been caught.

Step 1: Scope was granted, not bounded. Permission and technical boundary are not the same thing. The agent was told it could access the directory - but nothing enforced that it had to stay within the intended subdirectory.

Step 2: No boundary enforcement layer. The agent traversed outside the working directory the team expected. Nothing prevented this. No path restriction, no chroot, no symlink guard.

Step 3: OS security policies were not active. macOS TCC and AppArmor on Linux exist specifically to create hard ceilings for process file access, even for processes running with user credentials. Dev machines almost never have these configured.

Step 4: No review gate before irreversible action. The agent operated autonomously from start to finish. No confirmation prompt. No dry-run preview. No human approval before bulk deletion.

Each of these four failures is independently recoverable. The problem is that all four appear together in most default agent configurations.

Nobody Watching

The fourth failure point is the one most teams can fix today without infrastructure changes.

An agent that executes irreversible operations without human sign-off requires extraordinary justification. The review gap was not an oversight - it was an intentional configuration choice to reduce friction. Nobody sat down and said they accept the risk of a bulk deletion with no confirmation. They just never asked the question.

Require confirmation before any bulk irreversible operation above a threshold. 10 files. 100MB. Pick a number. The specific threshold matters less than the existence of one.

What Should Have Stopped This

See the defense layers diagram in the full article

# Check scope BEFORE granting agent access
find "$WORKING_DIR" -type f | wc -l
# Output: 847,293 - scope is way too broad

# Correct: bind to the specific subdirectory
export AGENT_SCOPE="/project/src/components"

# OS-level: run agent as a restricted user
sudo -u cursor-agent \
  cursor-agent \
  --working-dir "$AGENT_SCOPE" \
  --max-files 500 \
  --dry-run-threshold 50

Four controls. Any one of them breaks the chain:

Scoped path binding - write access to /project/src/temp specifically, not a parent directory
OS-level process restrictions - dedicated agent user restricted with AppArmor or TCC
Dry-run with confirmation threshold - any operation touching more than N files should pause
Review gate for bulk irreversible actions - approval workflow before bulk deletions

The Pattern That Keeps Repeating

This incident involved Cursor. The four failure points show up in nearly every AI agent incident with file system impact, regardless of tool.

The permission model developers use for their own tooling does not translate to autonomous agents. When you run a command yourself there is friction - you read it, you hesitate before large deletions. Agents do not have that friction. Every control that historically relied on human judgment at execution time has to be replaced with explicit technical enforcement.

For code changes specifically, agent trust scoring provides a behavioral layer on top of attribution. Patterns in what changed, which files were touched, how the scope compared to past PRs build a risk signal.

The 37GB wipe is a filesystem incident. The governance lesson applies anywhere an agent can make irreversible changes without a human in the loop. Build the review gate before you need it.

Read the full article with SVG diagrams on mergeshield.dev

The Three Signals Undercover Mode Kills (and What to Use Instead)

MergeShield — Wed, 01 Apr 2026 19:43:12 +0000

When the Claude Code source leaked last week, most of the attention went to KAIROS and Coordinator Mode - the autonomous daemon and multi-agent orchestration features that have not shipped yet.

Undercover Mode got less coverage. That is the wrong priority.

KAIROS and Coordinator Mode are future problems. Undercover Mode is a present one. It is already in the codebase. And what it does is quietly eliminate every signal most teams rely on to know when an AI agent touched their code.

What Undercover Mode Actually Does

The leaked source is specific. When operating in Undercover Mode, Claude Code strips four categories of attribution:

Removes Co-Authored-By: Claude from every commit
No AI-generated message patterns or signatures in commit messages
Avoids agent-identifiable branch prefixes like claude/, ai-fix-, copilot-
Suppresses tool headers and automated footers in PR descriptions

The stated purpose is open-source contribution - Anthropic did not want Claude Code flagged as a bot when contributing to public repos. The intent is reasonable. The mechanism is the problem: it is identical whether Claude is contributing to a public library or opening a PR in your private monorepo. And according to the leaked code, there is no off switch.

Warning: There is no off switch for Undercover Mode in the leaked source. It activates based on context, not user preference.

The Three Signals It Kills

Most teams detecting AI-generated code - consciously or not - rely on three signals. Undercover Mode eliminates all three.

Signal 1: Git attribution. Co-author tags, commit trailer fields, the author field itself. Standard Claude Code practice is to add Co-Authored-By: Claude to commits. Undercover Mode removes this. The commit reads as purely human-authored.

Signal 2: Commit message patterns. AI-generated commit messages have recognizable structure - specific phrasing, consistent formatting, particular scope descriptions. Undercover Mode generates messages designed to match human conventions, not AI defaults.

Signal 3: Branch naming conventions. Most agent workflows create identifiable branches: claude/fix-auth-bug, copilot-refactor-db, sweep/update-deps. These are trivial to filter for. Undercover Mode uses whatever naming convention your repo already uses.

Strip all three and you have nothing to filter on at the metadata layer.

What Actually Works

The diff does not lie. Metadata is strippable. What an agent writes into the code itself is significantly harder to mask.

File-level risk patterns. An agent touching auth code behaves differently than one touching a UI component. The structural changes it makes to session management, token handling, and permission checks follow patterns that do not disappear when you remove the co-author tag. Scoring risk by what files changed and how they changed works regardless of what the commit metadata claims.

Diff entropy analysis. AI-generated code has different entropy characteristics than human-written code - consistent formatting, predictable variable naming, symmetric error handling. These patterns survive Undercover Mode because they are in the substance of the change, not the wrapper around it.

Change scope signals. Agents tend to change more files than humans on equivalent tasks. They refactor things they were not asked to refactor. They update tests in predictable ways humans often skip. The breadth and coherence of a diff is a signal that attribution stripping does not touch.

Cross-PR trust scoring. A single PR from an unknown author is hard to classify. A pattern of PRs from the same contributor over time builds a behavioral profile. If patterns across PRs match known agent behavior - even with stripped attribution - trust scoring catches what single-PR analysis misses.

Tip: Behavioral detection in the diff is more durable than metadata detection. Metadata is one config change away from disappearing. Behavioral patterns are embedded in the code itself.

The KAIROS Multiplier

Undercover Mode is a present concern. KAIROS makes it a harder future one.

KAIROS is the background daemon in the leaked source - an agent that runs continuously, monitors your repo, and opens PRs based on conditions you have configured, without waiting for you to invoke it. No terminal session. No obvious trigger. A PR that appears on its own schedule.

When KAIROS ships, you will not have the signal of "someone ran Claude Code right before this PR appeared." The PR arrives from a process that has been running quietly in the background. Undercover Mode plus KAIROS means the PR looks human-initiated, human-attributed, and arrives without a visible trigger.

Behavioral detection at the diff layer is not optional in that world. It is the only layer left.

What Teams Should Do Right Now

Undercover Mode is in the current codebase. You do not need to wait for KAIROS to act on this.

Audit your detection assumptions. If your process for knowing whether an AI touched a PR relies on co-author tags or branch prefixes, document that dependency explicitly. It is already breakable with a single config change.

Shift to diff-level analysis. Whatever risk assessment process you have - manual or automated - the primary input should be what changed, not who the commit claims authored it. File categories, change scope, entropy patterns in the diff.

Build behavioral baselines now. Trust scoring improves with history. The sooner you start tracking behavioral patterns per contributor, the more signal you have when attribution gets stripped. Start before you need it.

MergeShield scores risk at the diff level - file-level attribution, behavioral patterns, trust scores per agent. It does not assume commit metadata is accurate. Try it on your repo.

What Claude Code's Leaked Source Reveals About AI Agent Governance

MergeShield — Tue, 31 Mar 2026 13:43:07 +0000

On March 31, 2026, security researcher Chaofan Shou discovered that Anthropic had accidentally shipped the complete source code of Claude Code in their npm package. A .map file contained a link to 1,900 TypeScript files - 512,000 lines of unobfuscated source.

Within hours, the community mirrored it on GitHub (1,100+ stars, 1,900+ forks). Anthropic pushed an update to remove the maps, but the code was already public. This is their second major leak in five days.

The source code itself is interesting but not groundbreaking. What's far more significant is what the unreleased feature flags reveal.

Five unreleased features, each increasing agent autonomy. Combined, they require a governance model that doesn't exist yet.

The Features Nobody Expected

Kairos - Autonomous Daemon Mode. Not a session tool you invoke, but a persistent process that runs 24/7. References "nightly dreaming phases" for memory consolidation and "proactive behavior" where the agent acts without being prompted.

Coordinator Mode - Multi-Agent Orchestration. Spawns parallel worker agents managed from a central orchestrator. A fleet of agents working on different parts of your codebase simultaneously.

Buddy System - Paired Agent Collaboration. Started as April Fools (18 species including capybara, rarity tiers, 1% shiny chance). Evolving into real paired-agent review.

Undercover Mode - Stealth Commits. The most concerning: auto-strips AI attribution from commits on public repos. No git trailers, no co-author tags, no indication AI wrote the code. No off switch.

Agent Triggers - Event-Driven Actions. Multi-agent teams triggered by events, not human prompts. The agent watches for conditions and acts without asking.

The Undercover Mode Problem

Most tools that detect AI-generated code rely on metadata: git trailers, commit patterns, author tags. Undercover Mode removes all of it.

Governance tools need a second detection layer: behavioral analysis.

Commit timing - agents commit at consistent intervals humans don't
File change velocity - agents modify files faster than any human
Branch naming conventions - agent branches follow predictable patterns
Change patterns - agents modify files in specific order (tests after implementation)
Session characteristics - agent sessions produce commits in bursts

The lesson: never rely on self-reported attribution for governance decisions. The model provider has every incentive to make AI attribution invisible.

What Always-On Agents Mean for Review

Kairos changes the governance model from "review what was asked" to "review what the agent decided to do on its own."

Combine Kairos with Coordinator Mode and you have 10 daemon agents opening PRs across your monorepo at 3 AM. Each thinks its change is safe. None knows what the others are doing.

The only way to govern this is automated: risk scoring on every PR, trust tracking per agent, and auto-merge rules that enforce policies regardless of when the change was made.

The Four-Lab Agent Race

All four major labs now ship coding agents racing toward more autonomy:

Anthropic (Claude Code) - Computer Use, Auto Mode, Kairos/Coordinator coming
OpenAI (Codex) - Plugins, Security agent, multi-agent workflows
Google (Gemini CLI) - Plan Mode
xAI (Grok Build) - 8 parallel agents, Arena Mode

DryRun Security tested all three building apps from scratch. Results: Claude 13 vulnerabilities, Gemini 11, Codex 8. Every agent ships security issues.

Teams today use 2-3 agents. By next quarter, most will use all four. Multi-agent governance isn't optional anymore.

What This Means For Your Team

Don't rely on AI attribution metadata. It can be stripped. Build behavioral detection.
Assume agents will run without you. Daemon mode is coming to every agent.
Plan for multi-agent coordination. Each agent needs its own trust score.
Automate review triage. At fleet scale, manual review is impossible.
Keep an audit trail. When something breaks, trace which agent made the change.

The governance gap is widening fast. The leaked roadmap just showed us exactly how wide it's about to get.

We're building the governance layer for this at MergeShield - risk scoring across 6 dimensions, per-agent trust that evolves over time, auto-merge for trusted agents. Try the interactive demo to see how it works.

Anthropic Says Use More Agents to Fix Agent Code. Here's What's Missing.

MergeShield — Mon, 30 Mar 2026 14:16:00 +0000

Last week, Anthropic published their recommended architecture for building production apps with Claude Code. The core idea: a multi-agent harness where a Planner expands prompts into specs, a Generator implements features, and an Evaluator grades output against criteria.

It's a solid pattern inspired by GANs - one system creates, another critiques, and the tension drives quality up.

But there's a gap nobody seems to be talking about.

The Generator and Evaluator are both Claude - they share the same training data and the same blind spots.

The Shared Blind Spot Problem

When your Generator is Claude and your Evaluator is also Claude, they share the same training data, the same biases, and the same blind spots.

It's like asking your coworker to proofread something they helped you write. They'll catch typos. But the structural problems - the wrong assumptions, the edge cases neither of you considered - those survive because you both have the same mental model of what "correct" looks like.

We've seen this play out:

Auth flows that passed evaluation but used client-side token storage with no expiry
API endpoints both agents considered "complete" but had no rate limiting
Database queries that worked in tests but had no indexes for production scale

The Generator optimizes for "does it work?" The Evaluator asks the same question slightly differently. Nobody asks: "What would break this in production?"

What Same-Model Evaluators Miss

AI models have consistent failure patterns when generating code. These aren't random - they're systematic:

Happy-path optimization. AI writes code that handles expected input perfectly. Edge cases, concurrent access, network timeouts get skipped because the model optimizes for the prompt scenario, not production scenarios.

Security as afterthought. Models treat security like junior devs often do - something you add after the feature works. Hardcoded secrets, missing CSRF, SQL injection vectors.

Blast radius blindness. When an agent modifies auth middleware, it doesn't reason about how many services depend on that module. Models think locally, not systemically.

Test coverage gaps. AI-generated tests mirror the implementation. If the code has a bug, the test often encodes that bug as expected behavior.

Why External Evaluation Changes Everything

Mature engineering orgs don't ask the developer who wrote code to also write the security review. They have separate teams with separate checklists:

Security review looks for attack vectors, not functionality
Architecture review looks for coupling and blast radius, not correctness
Performance review looks for bottlenecks, not feature completeness

The same applies to AI code. External evaluation should score across dimensions the generator wasn't optimizing for:

Security - auth changes, secrets, injection risks
Blast Radius - how many components affected
Test Gaps - whether tests actually cover new behavior
Dependencies - supply chain concerns
Breaking Changes - API contract modifications

When evaluation criteria are orthogonal to generation criteria, you catch problems the generator structurally cannot see.

The Missing Piece: Trust That Evolves

Anthropic's harness treats every sprint the same. First feature gets identical evaluation to the fiftieth. No memory, no learning.

But in real teams, trust is earned. A dev who consistently ships clean code gets less scrutiny on routine changes. AI agents should work the same way:

New agents start with maximum scrutiny
Each clean PR builds trust incrementally
High-risk findings reset trust immediately
Trusted agents auto-merge low-risk changes
Untrusted agents require human review

The harness gives you per-sprint quality control. Trust scoring gives you quality control that compounds over time.

The Complete Picture

Anthropic's harness solves code quality within a single session. But it doesn't address:

Cross-session learning (does the agent improve over time?)
Multi-agent governance (Claude + Copilot + Cursor in one repo)
Risk-proportional review (dependency bump vs auth middleware change)
Audit trail (which agent, what risk score, what decision)

The generator-evaluator loop handles the inner feedback cycle. Governance handles everything outside - organizational policies, trust relationships, risk-based routing.

The complete stack: inner-loop quality (harness) + external risk scoring across 6 dimensions + trust-based governance routing.

What to Do About It

Use the harness pattern for inner-loop quality. It works.
Add external evaluation with different criteria the generator wasn't optimizing for.
Build trust incrementally. Track which agents produce clean code. Let data drive review policy.
Automate what's safe. Low-risk PRs from trusted agents don't need human review.
Keep an audit trail. When production breaks, trace which agent introduced the change.

The harness gives you better code. Governance gives you confidence that what ships is safe. You need both.

This is the approach we're building at MergeShield - external risk scoring across 6 dimensions, per-agent trust scores that evolve over time, and auto-merge rules for trusted agents. Try the interactive demo to see it in action.

Opus 5 Is Coming - Is Your Code Governance Ready?

MergeShield — Fri, 27 Mar 2026 17:00:49 +0000

Anthropic accidentally leaked details of their most powerful model ever built. The implications for teams using AI coding agents are significant — and most aren't prepared.

What We Know About Claude Mythos / Opus 5

This week, Fortune reported that Anthropic acknowledged testing a new AI model that represents a "step change" in capabilities. Internal documents describe it as scoring "dramatically higher" than Opus 4.6 in coding, reasoning, and cybersecurity.

The details that matter for engineering teams:

It's reportedly a 10 trillion parameter model
Anthropic says it's "very expensive for us to serve, and will be very expensive for our customers to use"
Early access is restricted to cybersecurity firms to "help cyber defenders prepare"
They're taking a "slower, more gradual approach to releasing Mythos than we have with our other models"

When the model maker restricts access because of security concerns, that tells you something.

Why This Matters for Your Codebase

Here's the progression over the past 48 hours:

March 26: Claude Code ships auto-fix and auto-merge. Your AI agent can now fix CI failures and merge PRs autonomously.

March 27: Vercel open-sources OpenReview, a Claude-powered code review bot. AI reviewing AI-generated code becomes commoditized.

March 27: Anthropic confirms Opus 5 exists and is too powerful to release without restrictions.

Connect the dots: models are getting dramatically more powerful, agents are getting more autonomous, and review tools are proliferating. The one thing that isn't keeping pace is governance — the layer that decides what actually ships to production.

The Governance Gap Is Accelerating

A year ago, the AI coding workflow looked like this:

Developer prompts AI → AI suggests code → Developer reviews and edits → Developer commits

Today it looks like this:

AI agent writes code → AI agent opens PR → AI reviewer checks it → Auto-merge if CI passes

The human went from being the author and reviewer to being... optional. That works fine when the AI is writing a simple utility function. It becomes a problem when it's rewriting your authentication middleware or refactoring your payment pipeline.

And with Opus 5, the code will look even more correct. It will pass more tests. It will follow more patterns. It will be harder to distinguish from expert human code. Which means the failure modes become more subtle and more dangerous.

What Risk-Proportional Governance Looks Like

The solution isn't to slow down — it's to be smarter about what gets human attention.

Every PR that enters your codebase should be evaluated on multiple dimensions before a merge decision is made:

Risk scoring across dimensions. Not just "did tests pass" but how complex is this change, what's the security surface area, how many files does it touch, are there breaking changes, and where are the test coverage gaps.

Agent-aware analysis. Knowing which AI tool authored the code matters. Each agent has a different risk profile based on its track record in your codebase. A Dependabot version bump from an agent with 100 safe merges is very different from a new agent's first PR touching your database schema.

Trust that's earned, not assumed. AI agents should start with limited autonomy and earn more as they prove reliable. The same way you wouldn't give a new hire production merge access on day one.

Proportional response. Low-risk PRs from trusted agents auto-merge. Medium-risk gets lightweight review. High-risk gets full human analysis with escalation to designated reviewers.

Preparing for the Next Generation

When Opus 5 becomes generally available and developers start using it to write production code, the teams that will be fine are the ones that already have governance infrastructure in place:

Automated risk scoring on every PR, regardless of source
Agent detection that tracks which model and tool generated each change
Trust scores that reflect actual performance in your specific codebase
Approval workflows that trigger based on risk, not just author type
Audit trails that show exactly what was merged, by which agent, with what risk score

The teams that will struggle are the ones still relying on "the tests passed, ship it."

The Bottom Line

Opus 5 isn't a threat to developers — it's a tool that will make them dramatically more productive. But productivity without governance is just velocity without direction.

The review process that worked when humans wrote all the code doesn't work when AI writes 41% of it. And it definitely won't work when the next generation of models makes that number 60%, 70%, or higher.

The time to build your governance pipeline is before you need it, not after a production incident forces your hand.

I'm building MergeShield to solve exactly this — risk scoring, agent trust, and auto-merge governance for GitHub teams. You can explore the interactive demo without signing up, or install the GitHub Action to try it on your repos.

Claude Code Can Now Auto-Merge Your PRs — Here's Why That's Not Enough

MergeShield — Thu, 26 Mar 2026 19:39:06 +0000

Claude Code just shipped auto-fix and auto-merge. Your AI agent can now monitor PRs in the background, fix CI failures, and merge once all checks pass — without you touching a thing.

This is a genuinely exciting development. But after building governance tooling for AI-generated code, I think teams need to understand what this does and doesn't solve before enabling it across their repos.

What Claude Code Auto-Merge Actually Does

The workflow is straightforward:

Claude Code opens a PR
It monitors CI check status in the background
If CI fails, auto-fix attempts to resolve the failure
Once all checks pass, auto-merge lands the PR

You can literally walk away, start a new task, and come back to a merged PR. For developer velocity, this is a huge win.

The Assumption Worth Questioning

The auto-merge logic is: CI passes → safe to merge.

But is that true?

Consider two PRs that both have green CI:

PR A: Bump express from 4.18.2 to 4.21.0. One file changed. All tests pass.

PR B: Add JWT authentication with token storage in localStorage. 14 files changed across auth middleware, user model, and API routes. All tests pass.

Both have green CI. Both would auto-merge. But they carry fundamentally different levels of risk.

PR A is a routine dependency bump — auto-merging it makes perfect sense. PR B introduces security-sensitive patterns (localStorage token storage, hardcoded fallback secrets) that passing tests won't catch. A test suite validates behavior, not architectural decisions.

What CI Checks Don't Catch

Tests verify that code does what it's supposed to do. They don't evaluate:

Security patterns — Is storing JWTs in localStorage a good idea? Tests don't know.
Blast radius — Does this PR touch 14 files across 3 packages? Tests pass file by file, not holistically.
Breaking changes — Will this new required auth header break all existing API consumers? Unit tests for the new endpoint pass fine.
Architectural risk — Is adding a new dependency (jsonwebtoken, 1.2MB, 3 transitive deps) worth the supply chain risk? CI doesn't evaluate this.
Test coverage gaps — The tests that exist pass. But are there tests for expired tokens, malformed inputs, concurrent sessions? CI can't tell you what's missing.

The Multi-Agent Problem

Claude Code's auto-merge governs Claude Code's own PRs. But most teams in 2026 use multiple AI coding agents:

Claude Code for complex features
Copilot for inline suggestions
Cursor for full-file edits
Dependabot and Renovate for dependency updates
Devin for autonomous tasks

Each agent has a different risk profile. A Dependabot version bump is fundamentally different from a Cursor-generated auth middleware rewrite. But if you're only governing Claude Code's output, what about the other agents?

A governance approach that works needs to be agent-aware and agent-agnostic — tracking trust across all agents, not just one.

Risk-Proportional Governance

The alternative to "CI passes → merge" is risk-proportional governance:

Score every PR on multiple dimensions — security, complexity, blast radius, test coverage, breaking changes
Track agent trust over time — agents that consistently produce safe PRs earn more autonomy
Auto-merge proportionally — low-risk PRs from trusted agents merge automatically. High-risk PRs get human review
Maintain an audit trail — when something goes wrong, you can trace exactly what was merged, by which agent, with what risk score

This way, that dependency bump auto-merges in seconds. But the JWT auth PR gets flagged, scored at 62/100, and routed to a security reviewer — even though CI was green.

The Bottom Line

Claude Code's auto-merge is a great feature for developer velocity. But it's one piece of a larger governance puzzle.

The question isn't whether to auto-merge — it's which PRs should auto-merge, and which ones need human eyes despite passing CI.

Teams that figure this out will ship faster and safer. Teams that blindly auto-merge everything will learn expensive lessons in production.

I'm building MergeShield to solve this — a governance layer that scores risk, tracks agent trust, and auto-merges proportionally across all AI coding agents. If this resonates, check out the interactive demo.