Speedcraft Lab

Posted on Apr 19 • Originally published at Medium

3 Lines of SQL Wiped Our Entire AWS Database

#devops #tutorial #programming #webdev

Every backend team has this script sitting in their repo right now. Most will not find it until it is too late.

3 Lines of SQL Wiped Our Entire AWS Database

Every backend team has this script sitting in their repo right now. Most will not find it until it is too late.

Row count: 0. Not slow. Not degraded. Zero. The compiler had no errors. Neither did the database.

The Deploy That Couldn’t Go Wrong

The Slack notification wasn’t a monitoring alert. Not a PagerDuty page. Just our CTO, typing in all lowercase, which was somehow worse: “hey can someone check the users table? something looks off.”

I unlocked my phone and opened the AWS RDS console. The row count for our primary users table loaded: 0. Not degraded. Not slow. Zero rows. I scrolled up to make sure I was looking at production. I was.

Hours earlier, I had run what our ticket called a “low-risk schema cleanup.”

The Confidence That Made It Possible

The migration was small. Three lines of SQL to remove a deprecated legacy_auth column we hadn't touched in two years. I had run it in staging. It passed. I reviewed the script twice. We were days from a board demo, and the task felt like housekeeping.

I thought I understood database migrations. I did not.

The WHERE Clause That Wasn’t There

Here is what the migration script was supposed to execute:

DELETE FROM user_sessions WHERE session_type = 'legacy_auth';

Here is what it actually contained:

DELETE FROM user_sessions;

That’s not even the worst part. Our user_sessions table had a foreign key with ON DELETE CASCADE pointing back to users. Which meant when every session row vanished, the cascade tore through the parent table without a single error message. No constraint violation. No alarm. Just silence and a shrinking row count.

I stared at the script for a long time before I understood what I was reading.

The WHERE clause existed in the version I had reviewed. Somewhere between that review and the final execution, I had opened the wrong file. A near-identical filename: migration_cleanup_v2.sql versus migration_cleanup_v2_FINAL.sql. An older draft. A copy I made two weeks earlier when I first scoped the task, then abandoned in favour of the revised version, then apparently reopened by accident when I was in a hurry to ship.

Our DBA pulled the binary logs. The DELETE had executed cleanly, cascade operations completed without error, and I had already closed my laptop and driven home before anyone noticed. Nobody saw it until hours later.

“How long to restore?” our CTO asked on the call.

Nobody answered immediately.

The Silence After

The Slack channel filled up in about four minutes. Our head of support posted first: customers were reporting login failures across the board. Then our backend lead: “RDS looks healthy, no errors but the tables are empty.” Then the CTO, just: “call?”

We had a snapshot policy on RDS. The most recent snapshot existed. But our retention policy had been silently failing for six weeks due to an IAM permission scoping issue nobody had caught. The last valid, restorable snapshot was eleven days old.

What a Missing WHERE Clause Actually Costs

A DELETE without a WHERE clause is not a typo. It is a category of mistake one that results from treating a migration script as a throwaway file rather than as production code subject to the same scrutiny as application logic.

The analogy that stuck with our team: a migration script is not a sticky note. It is a signed contract. You would not countersign a contract without reading every clause. You should not run a migration without diffing the exact file you are executing against your reviewed version not the filename, the actual content.

The structural fix we adopted:

-- Step 1: Run the SELECT equivalent first, confirm affected row count  
SELECT COUNT(*) FROM user_sessions WHERE session_type = 'legacy_auth';  
-- Expected: 4,203. If the number surprises you, stop here.


-- Step 2: Wrap the DELETE in a transaction and verify before committing  
BEGIN;  
DELETE FROM user_sessions WHERE session_type = 'legacy_auth';  
-- Verify: total remaining rows = original count minus expected deletes  
SELECT COUNT(*) FROM user_sessions;  
COMMIT; -- Only after the number matches your expectation

The trade-off worth naming: this pattern requires a human in the loop and does not scale cleanly into automated pipelines. For continuous delivery environments, you want row-count assertions baked into your migration framework as hard failure conditions not as comments in a script that a tired engineer might skip under pressure.

We recovered from the eleven-day-old snapshot in approximately 11 hours, cross-referencing application logs to reconstruct the most recent writes. We recovered most of the data. Not all of it. The board demo moved by one week.

The Next Migration You Run

Before you push your next migration to production, open the exact file you are about to execute — not a reviewed copy, the actual file run the equivalent SELECT for every DELETE or UPDATE it contains, and confirm the affected row count matches what you expected before you proceed. If the number looks wrong, even slightly, treat that discrepancy as a blocker and do not move forward until you have a clear explanation. The sentence you should use in your next PR review or post-mortem: “We need a verified row-count assertion before any destructive migration touches production.”

The Console Tab That’s Still Open

I still have the AWS RDS console bookmarked at the path it was on when I first saw that row count.

Sometimes I open it not to check anything but to remember what zero looks like in a field where three years of users used to be. The script was three lines. The WHERE clause was eleven characters. The recovery was 11 hours. The difference between a clean deploy and a disaster was a filename.

Have you hit this and what did you do differently the next time?

If this resonated, you might also find Why Your Migration Strategy Is Only as Strong as Your Snapshot Policy useful it covers how silent backup failures and database migrations combine to turn recoverable mistakes into all-hands incidents. I’m writing a series on production failure patterns and the concrete process changes that followed each one follow along so you don’t miss the next one. If you want these delivered to your inbox, you can subscribe below.

Follow me for more such content.

DEV Community

3 Lines of SQL Wiped Our Entire AWS Database

3 Lines of SQL Wiped Our Entire AWS Database

Every backend team has this script sitting in their repo right now. Most will not find it until it is too late.

The Deploy That Couldn’t Go Wrong

The Confidence That Made It Possible

The WHERE Clause That Wasn’t There

The Silence After

What a Missing WHERE Clause Actually Costs

The Next Migration You Run

The Console Tab That’s Still Open

Top comments (0)