DEV Community: Alan West

Qwen 3 vs Llama 3: Configuring Local LLMs for Actual Performance

Alan West — Sun, 19 Apr 2026 03:45:36 +0000

If you've been anywhere near the local LLM community lately, you've probably seen the buzz around Qwen 3. Specifically, reports suggest that Qwen 3 models — when properly configured — are delivering a genuine performance jump over their predecessors and competing head-to-head with Meta's Llama 3 family.

But here's the thing I keep seeing people trip over: they download the model, run it with default settings, and wonder why it feels sluggish or gives mediocre output. Configuration matters. A lot.

I spent the past week benchmarking both Qwen 3 and Llama 3 variants across a few real tasks, and I want to share what I found — plus the configuration pitfalls that can quietly tank your results.

Why This Comparison Matters

The local LLM space has gotten genuinely competitive. A year ago, the answer to "which model should I run locally?" was almost always Llama. Now? It depends on what you're doing, what hardware you have, and — critically — how you configure your inference setup.

Qwen 3 models from Alibaba's DAMO Academy have reportedly made significant strides in reasoning, code generation, and multilingual tasks. Llama 3 remains a strong all-rounder with massive community support. Both are open-weight and run well on consumer hardware.

The real question isn't which model is "better" — it's which model is better for your workload, properly tuned.

Setting Up: Ollama vs llama.cpp vs vLLM

Before we compare models, let's talk inference backends. Your choice of runtime can matter as much as the model itself.

# Ollama — easiest setup, good defaults
ollama pull qwen3:8b
ollama run qwen3:8b

# llama.cpp — more control, better for squeezing performance
./llama-server -m qwen3-8b-q4_k_m.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 35 \
  --threads 8

# vLLM — best for serving, supports continuous batching
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-8B \
  --tensor-parallel-size 1 \
  --max-model-len 8192

If you're just experimenting, Ollama is fine. If you care about throughput or latency, llama.cpp with properly tuned parameters or vLLM will get you there.

The Configuration That Actually Matters

This is where most people leave performance on the table. I've seen folks complain about Qwen 3 being "no better than Qwen 2.5" and the issue is almost always one of these:

Context Length

Qwen 3 models reportedly support extended context windows, but if your runtime defaults to a small context size, you're hobbling the model. Always set your context explicitly.

# Ollama Modelfile — don't rely on defaults
FROM qwen3:8b
PARAMETER num_ctx 8192       # match the model's trained context
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1 # helps with repetition loops

Quantization Tradeoffs

This is the big one. Running a Q4_K_M quantization saves VRAM but costs quality. For Qwen 3, I've found the sweet spot depends on your GPU:

24GB VRAM (RTX 4090, etc.): Run Q5_K_M or Q6_K for the 8B model. The quality difference over Q4 is noticeable for code and reasoning tasks.
16GB VRAM: Q4_K_M for 8B is solid. You can also try the smaller variants at higher quant levels.
8GB VRAM: You're looking at Q4_K_S or Q3_K_M. It works, but keep expectations realistic.

GPU Layer Offloading

Partially offloading layers to GPU is where things get interesting. Too few layers on GPU and you're CPU-bottlenecked. Too many and you're swapping.

# Check your VRAM usage and adjust n-gpu-layers accordingly
./llama-server -m qwen3-8b-q4_k_m.gguf \
  --n-gpu-layers 33 \  # start here, adjust up/down
  --ctx-size 8192 \
  --flash-attn \       # enable flash attention if supported
  --mlock              # keep model in RAM, prevents swapping

Side-by-Side: Qwen 3 vs Llama 3 (8B Class)

Here's what I observed across a few tasks. Take these as directional — your results will vary with hardware and quantization.

Task	Qwen 3 8B (Q5_K_M)	Llama 3 8B (Q5_K_M)
Code generation (Python)	Strong — good function structure	Strong — slightly more verbose
Reasoning / Chain-of-thought	Edge to Qwen 3	Solid but less structured
Multilingual (non-English)	Clear advantage	Weaker outside English
Following complex instructions	Comparable	Comparable
Community tooling & support	Growing	Mature and extensive
VRAM usage (same quant)	Comparable	Comparable

The takeaway: Qwen 3 has a genuine edge in reasoning-heavy and multilingual workloads. Llama 3 wins on ecosystem maturity — more fine-tunes, more community tooling, more battle-tested integrations.

Migration: Moving from Llama 3 to Qwen 3

If you've been running Llama 3 and want to try Qwen 3, here's the practical migration path:

Step 1: Swap the model, keep your pipeline. Both work with the OpenAI-compatible API format, so if you're using something like Open WebUI or a custom API client, you just change the model name.

Step 2: Adjust your system prompts. Different models respond differently to prompting styles. Qwen 3 tends to respond well to structured prompts with clear role definitions. If your Llama 3 prompts were loose and conversational, tighten them up.

Step 3: Re-tune your sampling parameters. Don't just copy your Llama 3 temperature and top_p settings. I found Qwen 3 benefits from slightly lower temperature (0.6-0.7 vs 0.7-0.8) for technical tasks.

# Example: OpenAI-compatible client — works with both models
import openai

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",  # Ollama endpoint
    api_key="not-needed"
)

# Just swap the model name — API is identical
response = client.chat.completions.create(
    model="qwen3:8b",  # was: "llama3:8b"
    messages=[
        {"role": "system", "content": "You are a senior Python developer."},
        {"role": "user", "content": "Refactor this function to use async/await"}
    ],
    temperature=0.65,   # slightly lower for Qwen 3 on code tasks
    max_tokens=2048
)

Monitoring Your Setup

One thing I'd recommend regardless of which model you run: track your usage and performance. If you're wrapping your LLM in a web app or API, lightweight analytics helps you understand what's actually happening.

I've been using Umami for this — it's a self-hosted, privacy-focused analytics tool that doesn't require cookie banners and is fully GDPR-compliant out of the box. Compared to alternatives like Plausible (also excellent, but their hosted plan costs more) or Fathom (hosted-only, pricier), Umami hits a sweet spot of simplicity and zero cost if you self-host. You get clean dashboards showing endpoint usage, response times, and user patterns without shipping data to third parties.

My Recommendation

Choose Qwen 3 if: You're doing reasoning-heavy tasks, working with multilingual content, or want to try something that's genuinely competitive with the best open models. Just invest the 20 minutes to configure it properly — context size, quantization level, and GPU offloading.

Stick with Llama 3 if: You value ecosystem maturity, want the widest selection of fine-tunes, or are already running a production setup that works. The community tooling advantage is real.

Either way: Don't trust default configurations. The difference between a properly tuned and a default-configured local LLM can feel like an entire generation gap. Set your context window explicitly, choose your quantization level deliberately, and benchmark on your actual tasks — not synthetic benchmarks from model cards.

The performance jump people are reporting with Qwen 3 is real, but only if you meet the model halfway with proper configuration. Download it, tune it, and judge for yourself.

How to Stop Nuking Your Postgres Data When Testing Schema Changes

Alan West — Sun, 19 Apr 2026 02:38:04 +0000

We've all been there. You're working on a feature that requires a schema migration, you run it against your dev database, something goes wrong, and now your carefully seeded test data is toast. Or worse — you accidentally ran it against staging.

The traditional solution is some combination of database dumps, Docker containers, and a prayer. But there's a better pattern emerging in the Postgres ecosystem: copy-on-write database branching. And with open-source tools like Xata bringing this to self-hostable Postgres platforms, it's worth understanding how this actually works and how to set it up.

The Root Cause: Shared Mutable State

The fundamental problem is that databases are shared mutable state — the thing every CS textbook warns you about. Here's what typically goes wrong:

One dev database for the team — migrations collide, test data gets overwritten
Local database per developer — data gets stale, fixtures drift from reality
Snapshot/restore workflows — slow, eat disk space, and nobody remembers to update them

Each approach has tradeoffs, but they all share a common failure mode: getting a clean, realistic copy of your database for testing is either slow, expensive, or both.

-- The classic "oh no" workflow
ALTER TABLE users ADD COLUMN org_id INTEGER;
-- Wait, I need a NOT NULL constraint...
ALTER TABLE users ALTER COLUMN org_id SET NOT NULL;
-- ERROR: column "org_id" of relation "users" contains null values
-- Now you're writing backfill scripts at 4pm on a Friday

What Copy-on-Write Branching Actually Is

If you've used Git, the mental model is straightforward. Copy-on-write (CoW) branching creates a logical fork of your database that shares the underlying data pages with the parent. You only pay storage costs for the data that actually changes on the branch.

This isn't a new concept at the filesystem level — ZFS and Btrfs have done this for years. The innovation is applying it at the Postgres layer, where you get branch-aware connection strings and can treat each branch as its own isolated database.

Here's the key insight: a traditional pg_dump | pg_restore of a 50GB database might take 20 minutes. A CoW branch? Usually seconds, regardless of database size. The data isn't copied — it's referenced.

Parent database (50GB)
├── Branch: feature/add-orgs    (only stores changed pages, ~50MB)
├── Branch: feature/new-billing  (only stores changed pages, ~120MB)
└── Branch: hotfix/user-emails   (only stores changed pages, ~2MB)

Setting Up Branch-Based Workflows

Xata is an open-source, cloud-native Postgres platform that implements this pattern. According to the project's GitHub repo, it provides copy-on-write branching along with scale-to-zero capabilities. Here's how a branch-based workflow generally looks with tools that support this pattern:

Step 1: Create a Branch for Your Feature

Most tools that support Postgres branching expose this through a CLI or API. The general pattern looks like:

# Create a branch from your main database
# (exact syntax varies by tool)
xata branch create feature/add-org-support --from main

# You get a connection string scoped to this branch
# postgresql://branch-feature-add-org-support:5432/mydb

The branch is instant. No waiting for a dump to finish, no disk space explosion.

Step 2: Run Your Migration Against the Branch

Now you can safely test destructive operations:

-- Connected to: feature/add-org-support branch
-- This only affects the branch, not main

BEGIN;

CREATE TABLE organizations (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    slug TEXT UNIQUE NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now()
);

ALTER TABLE users ADD COLUMN org_id INTEGER REFERENCES organizations(id);

-- Backfill existing users into a default org
INSERT INTO organizations (name, slug) VALUES ('Default', 'default');
UPDATE users SET org_id = (SELECT id FROM organizations WHERE slug = 'default');

-- Now safe to add the constraint
ALTER TABLE users ALTER COLUMN org_id SET NOT NULL;

COMMIT;

If something blows up? Delete the branch. Your main data is untouched. No rollback scripts, no restoring from backups.

Step 3: Validate and Merge

Once your migration works correctly on the branch, you have a few options:

Run the migration against main — treat the branch as a dry run
Promote the branch — if the tool supports it, swap the branch in as the new main
Reset and re-branch — start fresh if you need to iterate

Why Scale-to-Zero Matters Here

Here's the thing about dev/preview databases: most of them sit idle 90% of the time. That feature branch you created on Monday? It's been idle since Tuesday afternoon.

Scale-to-zero means those idle branches aren't consuming compute resources. The storage (which is minimal thanks to CoW) persists, but the Postgres process itself shuts down when there's no active connections. When someone connects again, it spins back up.

This is what makes branch-per-PR workflows actually viable economically. Without scale-to-zero, ten branches means ten running Postgres instances. With it, you're only paying for what's actually being queried.

Wiring This Into CI/CD

The real power is automating this. Here's a simplified GitHub Actions workflow that creates a branch per PR:

# .github/workflows/preview-db.yml
name: Preview Database
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  create-preview-db:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Create database branch
        run: |
          # Create a branch named after the PR
          BRANCH_NAME="pr-${{ github.event.pull_request.number }}"
          # Use your branching tool's CLI here
          xata branch create "$BRANCH_NAME" --from main

      - name: Run migrations
        run: |
          # Point your migration tool at the branch
          DATABASE_URL=$(xata branch connection-string "pr-${{ github.event.pull_request.number }}")
          npx prisma migrate deploy
        env:
          DATABASE_URL: ${{ env.DATABASE_URL }}

      - name: Run integration tests
        run: npm test -- --integration

When the PR is merged or closed, a cleanup job deletes the branch. Clean, automated, and nobody accidentally tests against production.

Prevention Tips: Stop the Pain Before It Starts

Even without fancy branching tools, you can adopt patterns that reduce database pain:

Always use transactions in migrations — if step 3 of 5 fails, you don't end up in a half-migrated state
Test migrations with BEGIN; ... ROLLBACK; — validate the SQL without committing
Use IF NOT EXISTS guards — makes migrations idempotent and re-runnable
Keep a seed.sql in version control — deterministic test data that any developer can load
Name your constraints — ALTER TABLE DROP CONSTRAINT is a lot easier when you know the name

-- Idempotent migration pattern
DO $$
BEGIN
    IF NOT EXISTS (
        SELECT 1 FROM information_schema.columns
        WHERE table_name = 'users' AND column_name = 'org_id'
    ) THEN
        ALTER TABLE users ADD COLUMN org_id INTEGER;
    END IF;
END $$;

When to Reach for Database Branching

Database branching isn't always necessary. If you're working solo on a small project with a simple schema, pg_dump and a good seed.sql are probably fine.

But it starts to shine when:

Multiple developers are working on competing schema changes
You need preview environments with realistic data
Your database is large enough that dump/restore is painfully slow
You're running integration tests in CI that need isolated database state

The Postgres ecosystem is evolving fast, and copy-on-write branching is one of the more practical innovations I've seen. Projects like Xata are worth keeping an eye on if this workflow appeals to you. Being open source and designed for cloud-native deployments, it fits into the broader trend of making Postgres operations feel as smooth as Git operations.

The bottom line: your database workflow shouldn't be the bottleneck in your development process. Whether you adopt full branching or just tighten up your migration hygiene, the goal is the same — stop being afraid to touch the schema.

Why Your Site Is Slow on Shared Hosting and How to Fix It with a VPS Migration

Alan West — Sun, 19 Apr 2026 02:03:10 +0000

Last week I migrated a client's WordPress site off shared hosting onto a $6/month VPS. The before-and-after was genuinely embarrassing. We're talking TTFB dropping from 2.8 seconds to 180 milliseconds. Same code. Same database. Same content. The only difference was where it was running.

If you've ever stared at a slow site and thought "maybe I need to optimize my queries" when the real problem is your neighbor on the same box running a crypto miner — this one's for you.

Why Shared Hosting Is Killing Your Performance

Shared hosting means your site shares CPU, RAM, and disk I/O with dozens (sometimes hundreds) of other sites on the same physical server. The hosting provider oversells capacity because most sites are idle most of the time. That works fine until it doesn't.

Here's what's actually happening under the hood:

CPU throttling: Your process gets timesliced with everyone else. During peak hours, your PHP workers are literally waiting in line.
Disk I/O contention: One site doing heavy database writes tanks read performance for everyone. Shared disks are the bottleneck nobody talks about.
Memory limits: You're typically capped at 256-512MB regardless of what the server actually has. OOM kills happen silently.
Noisy neighbors: You have zero control over what other tenants are doing. One misconfigured cron job can spike load for the entire box.

The thing that tipped me off with this client was inconsistent response times. Sometimes the site loaded in 400ms, sometimes 4 seconds. That variance is the telltale sign of resource contention.

Diagnosing the Problem Before You Migrate

Before ripping everything out, confirm that shared hosting is actually the bottleneck. SSH into your current host (if they allow it) and run some quick checks:

# Check current server load — anything above the CPU count is bad
uptime
# Output: load average: 24.31, 22.67, 21.89  (on a 4-core box... yikes)

# See how many sites are running on this box
ls /home/ | wc -l
# Output: 187

# Check disk I/O wait — high iowait means disk contention
iostat -x 1 3
# Look at %iowait and await columns

If your load average is consistently above the CPU core count and you see high I/O wait, no amount of code optimization will fix this. You need your own box.

Step-by-Step VPS Migration

Here's the exact process I followed. The whole thing took about two hours including DNS propagation.

1. Provision and Secure the VPS

Spin up a VPS with your provider of choice. For most small-to-medium sites, 1 vCPU and 1GB RAM is more than enough. Seriously. That's more dedicated resources than you were getting on shared hosting.

# First things first — update and lock it down
apt update && apt upgrade -y

# Create a non-root user
adduser deploy
usermod -aG sudo deploy

# Set up SSH key auth and disable password login
mkdir -p /home/deploy/.ssh
cp ~/.ssh/authorized_keys /home/deploy/.ssh/
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys

# Disable root login and password auth
sed -i 's/PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
systemctl restart sshd

# Basic firewall — only allow SSH, HTTP, HTTPS
ufw allow OpenSSH
ufw allow 'Nginx Full'
ufw enable

Don't skip the security steps. An unsecured VPS will get brute-forced within hours. I'm not exaggerating.

2. Install Your Stack

For this particular migration, I went with Nginx, PHP-FPM, and MariaDB. If you're migrating a Node app or something else, adjust accordingly.

# Install the essentials
apt install -y nginx mariadb-server php8.3-fpm php8.3-mysql \
  php8.3-curl php8.3-gd php8.3-mbstring php8.3-xml php8.3-zip

# Secure MariaDB
mysql_secure_installation

# Tune PHP-FPM for your available memory
# For 1GB RAM, these are reasonable starting values
sudo nano /etc/php/8.3/fpm/pool.d/www.conf

Here's the PHP-FPM config that made the biggest difference:

; Switch from dynamic to ondemand if memory is tight
pm = ondemand
pm.max_children = 10
pm.process_idle_timeout = 10s
pm.max_requests = 500

; Enable opcache — this alone cut response times in half
[opcache]
opcache.enable=1
opcache.memory_consumption=128
opcache.interned_strings_buffer=8
opcache.max_accelerated_files=10000
opcache.validate_timestamps=0  ; set to 1 during development

That opcache.validate_timestamps=0 line is important. It tells PHP to never check if files changed, which eliminates stat() calls on every request. Just remember to restart PHP-FPM after deployments.

3. Migrate the Data

# On the old server — dump the database
mysqldump -u root -p --all-databases --single-transaction > dump.sql

# Tar up the site files
tar czf site-backup.tar.gz /var/www/html/

# Transfer to new server
rsync -avz --progress dump.sql deploy@new-server:/tmp/
rsync -avz --progress site-backup.tar.gz deploy@new-server:/tmp/

# On the new server — import
mysql -u root -p < /tmp/dump.sql
tar xzf /tmp/site-backup.tar.gz -C /

Use rsync instead of scp — it handles interruptions gracefully and shows progress. For large databases, pipe the dump through gzip to speed up the transfer.

4. Configure Nginx

Replace Apache's .htaccess sprawl with a clean Nginx config:

server {
    listen 80;
    server_name example.com www.example.com;
    root /var/www/html;
    index index.php;

    # Enable gzip — shared hosts often have this disabled
    gzip on;
    gzip_types text/css application/javascript application/json image/svg+xml;
    gzip_min_length 1000;

    # Static file caching
    location ~* \.(jpg|jpeg|png|gif|ico|css|js|woff2)$ {
        expires 30d;
        add_header Cache-Control "public, immutable";
    }

    location / {
        try_files $uri $uri/ /index.php?$args;
    }

    location ~ \.php$ {
        fastcgi_pass unix:/run/php/php8.3-fpm.sock;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include fastcgi_params;
        fastcgi_read_timeout 60;
    }
}

5. Set Up TLS and Flip DNS

# Install certbot and grab a certificate
apt install -y certbot python3-certbot-nginx
certbot --nginx -d example.com -d www.example.com

# Verify auto-renewal works
certbot renew --dry-run

Then update your DNS A record to point to the new server's IP. Set a low TTL (300 seconds) a day before the migration so the switchover is fast.

The Results

After the migration, I ran some benchmarks with curl:

# Measure TTFB
curl -o /dev/null -s -w "TTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" https://example.com

# Before (shared hosting):
# TTFB: 2.847s
# Total: 3.221s

# After (VPS):
# TTFB: 0.183s  
# Total: 0.247s

That's a 15x improvement in TTFB. The site went from a PageSpeed score of 34 to 91 without touching a single line of application code.

Preventing Future Problems

Now that you own the server, you own the problems too. Set up monitoring so you're not flying blind:

Set up unattended security updates: apt install unattended-upgrades and configure it. Seriously, do this day one.
Monitor disk space: Logs and backups will fill your disk eventually. Set up a cron job or use a monitoring tool to alert you.
Automate backups: A VPS without backups is a ticking time bomb. Schedule daily database dumps and weekly full snapshots.
Watch your logs: Check /var/log/nginx/error.log and PHP-FPM logs periodically. Errors that were invisible on shared hosting will now show up clearly.

The one downside of a VPS is that you're responsible for everything. No more opening a support ticket when MySQL crashes at 3 AM. But honestly, for the performance difference, it's a tradeoff worth making every single time.

If you're still on shared hosting and wondering whether migration is worth the effort — it is. Two hours of work for a 15x performance improvement is about the best ROI you'll ever get in web development.

Why Your AI-Generated Code Keeps Breaking (And How to Fix Your Process)

Alan West — Sat, 18 Apr 2026 23:50:15 +0000

Let me tell you about the three months I spent writing every line of code by hand. No Copilot. No ChatGPT. No AI autocomplete. Just me, my editor, and the docs.

It started because I kept running into the same frustrating problem: code that looked right but behaved wrong. AI-generated functions that passed a quick glance but had subtle issues — wrong error handling, misunderstood edge cases, dependencies I didn't actually need. I was shipping code I didn't fully understand, and it was catching up with me.

If that sounds familiar, here's what I learned and how you can fix the same problem without going full luddite.

The Root Cause: Comprehension Debt

We talk a lot about technical debt. But there's a newer, sneakier form I've started calling comprehension debt — the gap between the code in your repo and your understanding of what it actually does.

Every time you accept a suggestion without fully reading it, that gap widens. Every time you prompt an AI to "just make it work" and paste in the result, you're borrowing against your own understanding.

This isn't hypothetical. Here's a real pattern I caught in my own code:

// AI-generated: looks reasonable at first glance
async function fetchUserData(userId) {
  try {
    const response = await fetch(`/api/users/${userId}`);
    const data = await response.json();
    return data;
  } catch (error) {
    console.error('Failed to fetch user:', error);
    return null;
  }
}

Spot the bug? fetch doesn't throw on HTTP errors. A 404 or 500 response happily resolves, and response.json() might throw on a non-JSON error page, but by then you've lost the actual status code. This is the kind of thing you catch when you write it yourself, because you're thinking through each line instead of scanning it.

// What I actually needed
async function fetchUserData(userId) {
  const response = await fetch(`/api/users/${userId}`);

  if (!response.ok) {
    // Preserve the status for callers to handle appropriately
    throw new Error(`User fetch failed: ${response.status}`);
  }

  return response.json();
}

Smaller, clearer, correct. No try-catch swallowing errors silently. No returning null that forces every caller to do null checks.

The Debugging Problem

Here's where comprehension debt really bites: debugging. When something breaks at 2 AM and you're staring at code you didn't write — code you don't understand — you're essentially debugging someone else's work. Except there's no "someone else" to ask.

I tracked my debugging sessions for a month before and after I went AI-free. The pattern was clear:

With AI-generated code: Average debug time on unfamiliar sections was ~45 minutes. I'd often have to re-derive the logic from scratch.
Hand-written code: Average debug time dropped to ~15 minutes. I could reason about the code because I'd made every decision in it.

Those numbers aren't scientific. Your mileage will vary. But the directional signal was strong enough that I changed how I work.

The Fix: A Graduated Approach

I'm not going to tell you to stop using AI tools. That ship has sailed, and honestly, they're genuinely useful. But here's the process I landed on after three months of hand-coding.

Step 1: Write the skeleton yourself

Always write the structure, the function signatures, the data flow. This is where your architectural thinking lives.

# Write this part yourself — it's YOUR design
class OrderProcessor:
    def __init__(self, inventory_service, payment_gateway):
        self.inventory = inventory_service
        self.payment = payment_gateway

    def process(self, order):
        # Step 1: validate inventory
        # Step 2: reserve items
        # Step 3: charge payment
        # Step 4: confirm order
        # Each step needs rollback logic for the previous steps
        pass

    def _validate_inventory(self, items):
        pass

    def _reserve_items(self, items):
        pass

    def _charge_payment(self, order):
        pass

Those comments aren't fluff. They're your thinking, captured. When you come back to debug this at 2 AM, you'll know exactly what each piece was supposed to do and why.

Step 2: Write critical paths by hand

Error handling, authentication logic, data validation, anything involving money or user data — write it yourself. These are the paths where bugs are most expensive and where understanding matters most.

Step 3: Use AI for the boring parts (but read every line)

Boilerplate serialization? Unit test scaffolding? CSS grid layouts you've written a hundred times? Let the AI help. But read every line before you commit it. If you can't explain what a line does, rewrite it until you can.

Step 4: Implement a personal code review rule

Before committing any AI-assisted code, I now do what I call the "explain it" test: I pick a random function and explain it out loud as if I'm in a code review. If I stumble, I rewrite that section.

You can automate a lighter version of this with a pre-commit hook:

#!/bin/bash
# .git/hooks/pre-commit
# Flags files with high AI-generation markers

# Check for common AI patterns: overly verbose variable names,
# unnecessary try-catch wrapping, redundant comments
FILES=$(git diff --cached --name-only --diff-filter=ACM | grep -E '\.(js|ts|py)$')

for file in $FILES; do
  # Flag files with suspiciously many TODO/FIXME from paste-and-forget
  COUNT=$(grep -c 'TODO\|FIXME\|HACK' "$file" 2>/dev/null || true)
  if [ "$COUNT" -gt 5 ]; then
    echo "WARNING: $file has $COUNT TODO/FIXME markers. Review before committing."
  fi
done

It's a simple heuristic, not a silver bullet. But it's caught me a few times.

Prevention: Building the Habit

After my three-month experiment, here's what stuck:

Morning warm-up: I spend the first 30 minutes of coding without any AI tools. Just me and the problem. It's like stretching before a run — it keeps the muscles from atrophying.
New domain, no AI: When I'm learning a new library or language feature, I force myself to use the docs directly. AI summaries skip the nuance, and the nuance is where the real understanding lives.
Review diffs, not files: When reviewing AI-generated code, I look at the diff against what I would have written. If the approaches diverge significantly, I dig into why.
Keep a "things I learned" log: Every time I catch an issue in AI-generated code, I write down what was wrong and why. After a month, you start seeing patterns.

The Honest Tradeoff

Look, I'm faster with AI tools. Meaningfully faster, especially on greenfield work and boilerplate-heavy tasks. Going fully hand-written for three months cost me velocity.

But I also shipped fewer bugs. I spent less time debugging. I understood my codebase better. And when things broke, I fixed them faster.

The sweet spot isn't "always AI" or "never AI." It's knowing when to lean on the tool and when to lean on yourself. The three months taught me where that line is — and it's probably different for you. But if you're finding yourself staring at code you wrote last week and having no idea how it works, that's your signal. Scale back, write more by hand, rebuild the muscle.

Your future self, debugging at 2 AM, will thank you.

Why Your AI Agent Orchestration Breaks Down (and How DSLs Help)

Alan West — Sat, 18 Apr 2026 20:10:52 +0000

If you've spent any time wiring up multi-step AI agent workflows in Python or TypeScript, you've hit the wall. You know the one — your orchestration code starts as a clean function, then grows into a tangled mess of retry logic, context management, prompt chaining, and error handling that makes spaghetti code look organized.

I've been there. Last month I was debugging an agent pipeline that was supposed to summarize documents, extract entities, and then cross-reference them against a knowledge base. Three steps. Should be simple. Except the orchestration code was 400 lines of Python and the actual business logic was maybe 30 lines buried somewhere in the middle.

That's the core problem: general-purpose languages are terrible at expressing AI workflows declaratively.

The Root Cause: Impedance Mismatch

When you orchestrate AI agents in Python or JavaScript, you're fighting the language. These languages were designed for sequential, deterministic computation. AI agent workflows are fundamentally different:

They're non-deterministic — the same input can produce different outputs
They require context windows that need careful management
They involve structured data flowing between steps with type coercion
Error handling isn't just try/catch — it's "the model hallucinated, retry with a different prompt"

Here's what typical orchestration code looks like in Python:

async def process_document(doc: str) -> Result:
    # Step 1: Summarize
    summary = await call_llm(
        model="claude-sonnet-4-6",
        prompt=f"Summarize: {doc}",
        max_tokens=500
    )

    # Step 2: Extract entities — but what if summary is garbage?
    if not validate_summary(summary):
        # Retry with more context? Different model? Give up?
        summary = await call_llm(
            model="claude-sonnet-4-6",
            prompt=f"Summarize more carefully: {doc}",
            max_tokens=800  # more tokens, maybe that helps?
        )

    # Step 3: Now extract entities from the summary
    entities = await call_llm(
        model="claude-sonnet-4-6",
        prompt=f"Extract entities from: {summary}",
        response_format="json"
    )

    # Step 4: Parse the JSON... which might not be valid JSON
    try:
        parsed = json.loads(entities)
    except json.JSONDecodeError:
        # Here we go again
        entities = await call_llm(
            model="claude-sonnet-4-6",
            prompt=f"Extract entities as valid JSON: {summary}",
            response_format="json"
        )
        parsed = json.loads(entities)  # fingers crossed

    return Result(summary=summary, entities=parsed)

See the problem? Half the code is dealing with the incidental complexity of working with non-deterministic systems using deterministic tools. The actual workflow is four lines. Everything else is duct tape.

The DSL Approach

This is exactly why projects like Weft — a programming language specifically designed for AI systems — are showing up on GitHub Trending. The idea is straightforward: instead of shoehorning AI orchestration into Python, build a language where AI-native concepts are first-class citizens.

I haven't done a deep dive into Weft's specific implementation yet, so I'll speak to the general pattern that AI-focused DSLs are converging on. The core insight is that AI workflows have a few primitives that deserve language-level support:

1. Declarative Pipeline Definitions

Instead of imperative step-by-step code, you declare what the pipeline is:

# Pseudocode representing the DSL pattern
pipeline document_analysis:
  input: document (text)

  step summarize:
    model: claude-sonnet-4-6
    prompt: "Summarize the following document"
    context: $document
    retry: 2
    validate: length > 50

  step extract_entities:
    model: claude-sonnet-4-6
    prompt: "Extract named entities as JSON"
    context: $summarize.output
    output_format: json
    retry: 3

  output:
    summary: $summarize.output
    entities: $extract_entities.output

Notice what disappeared: the manual retry logic, the JSON parsing boilerplate, the validation plumbing. The DSL handles all of it because it understands what these operations are.

2. Built-in Retry and Validation Semantics

In a general-purpose language, retry logic for AI calls is always hand-rolled. In an AI-focused DSL, retry is a primitive with sensible defaults:

Retry with the same prompt (transient failures)
Retry with an augmented prompt (quality failures)
Retry with a different model (capability failures)
Fail gracefully with a fallback value

This isn't just convenience — it's correctness. I've seen production systems where a developer forgot to handle one retry path and the whole pipeline would silently return partial results.

3. Type-Aware Context Passing

The biggest footgun in agent orchestration is context management. When you chain steps together, you need to track what data flows where. DSLs can enforce this at the language level, catching errors before runtime.

Step-by-Step: Applying DSL Thinking Today

You don't need to adopt a new language tomorrow to benefit from this pattern. Here's how to apply DSL thinking to your existing orchestration code:

Step 1: Separate workflow definition from execution.

# Define the workflow as data, not code
workflow = {
    "steps": [
        {
            "name": "summarize",
            "model": "claude-sonnet-4-6",
            "prompt_template": "Summarize: {input}",
            "retry": {"max": 2, "strategy": "augment_prompt"},
            "validate": lambda output: len(output) > 50
        },
        {
            "name": "extract_entities",
            "model": "claude-sonnet-4-6",
            "prompt_template": "Extract entities from: {summarize.output}",
            "output_format": "json",
            "retry": {"max": 3, "strategy": "same_prompt"}
        }
    ]
}

# Generic executor handles all the plumbing
result = await execute_workflow(workflow, input=document)

Step 2: Build a small executor that handles the common patterns. Retry logic, JSON parsing, validation — write it once in the executor, not in every pipeline.

Step 3: Add observability at the executor level. Log every step's input, output, latency, and retry count. When something breaks at 2 AM, you'll thank yourself.

Prevention: Designing for Non-Determinism

The deeper lesson here isn't about any specific tool. It's about acknowledging that AI orchestration is a fundamentally different programming paradigm. A few principles that have saved me headaches:

Never assume a single LLM call will succeed. Always have a retry strategy, even if it's just "try twice."
Validate outputs structurally before using them downstream. Don't just check for errors — check that the shape of the data is what you expect.
Keep prompts and orchestration logic separate. When you need to tweak a prompt, you shouldn't have to touch control flow code.
Treat context like a typed data pipeline. Know exactly what data each step receives and produces. If you can't draw it on a whiteboard, your pipeline is too complex.

Whether you end up using a dedicated DSL like Weft or building your own lightweight abstraction on top of Python, the key insight is the same: stop writing AI orchestration code like it's a regular web app. It isn't. The sooner your tools reflect that, the fewer 2 AM pages you'll get.

Worth Watching

The AI orchestration DSL space is still early. Projects like Weft are exploring what it means to make AI concepts first-class language primitives, and it's worth keeping an eye on how these approaches mature. If you're building anything with multi-step agent workflows, I'd recommend at least reading through Weft's repository to see what patterns they've identified — even if you don't adopt the language itself, the design decisions are informative.

How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum

Alan West — Sat, 18 Apr 2026 18:52:13 +0000

Chat apps are where knowledge goes to die. If you've ever searched Slack for that one config snippet someone shared six months ago and found yourself scrolling through 200 messages about lunch plans, you know exactly what I mean.

I hit this wall hard on a project last year. We had critical deployment notes buried in Discord threads, architecture decisions scattered across DMs, and onboarding docs that were basically "ask Sarah." When Sarah went on vacation, we were cooked.

The fix? We stood up a self-hosted forum. And honestly, it solved problems I didn't even realize we had.

Why Chat Fails as a Knowledge Base

The root cause is simple: chat is optimized for real-time conversation, not information retrieval. Messages are chronological, not topical. Threads help, but they're an afterthought in most platforms.

Here's what actually breaks down:

Search is terrible — Chat search returns individual messages without context. You find the answer but not the question, or vice versa.
Knowledge expires — Free tiers delete old messages. Even paid tiers bury content under months of noise.
No structure — There's no hierarchy. A channel called #backend contains everything from "how do we handle auth" to "the coffee machine is broken again."
Onboarding is impossible — New team members can't catch up by reading chat history. Nobody does that.

Forums solve all of this by design. Topics are categorized, searchable, and persistent. The good stuff floats to the top instead of drowning in the timeline.

Choosing Your Forum Software

There are three solid open-source options worth considering. I've deployed two of them in production, so I'll share what I actually ran into.

Discourse

The heavyweight. Built with Ruby on Rails and Ember.js. It's what most open-source projects use for community forums, and for good reason.

# docker-compose.yml for Discourse
# Note: Discourse officially recommends their own launcher,
# but this works for development/testing
version: '2'
services:
  discourse:
    image: discourse/base:2.0.20231218-0429
    ports:
      - "80:80"
    volumes:
      - discourse_data:/shared
    environment:
      DISCOURSE_HOSTNAME: forum.yourteam.dev
      DISCOURSE_DEVELOPER_EMAILS: you@yourteam.dev
      DISCOURSE_SMTP_ADDRESS: smtp.mailgun.org
      DISCOURSE_SMTP_PORT: 587
      DISCOURSE_SMTP_USER_NAME: postmaster@yourteam.dev
      DISCOURSE_SMTP_PASSWORD: ${SMTP_PASSWORD}

volumes:
  discourse_data:

Fair warning: Discourse is resource-hungry. It wants at least 2GB of RAM, and 4GB is more realistic once you have a handful of active users. The official install process uses their own discourse_docker launcher rather than a standard Docker Compose setup, so check their official install guide before going to production.

Flarum

The lightweight alternative. PHP-based, modern UI, much easier on server resources.

# Install Flarum — requires PHP 8.1+ and Composer
composer create-project flarum/flarum my-forum
cd my-forum

# Set up your web server to point to the /public directory
# Then visit the URL to run the web installer

# For nginx, the key location block:
# location / {
#     try_files $uri $uri/ /index.php?$query_string;
# }

Flarum runs comfortably on a 1GB VPS. The extension ecosystem is smaller than Discourse, but it covers the basics: Markdown, tags, mentions, SSO. I ran Flarum for a side project community and it handled ~500 users without breaking a sweat.

NodeBB

If your team lives in the Node.js ecosystem, NodeBB feels right at home. It uses either MongoDB or PostgreSQL as its data store and Redis for sessions and caching.

# Quick NodeBB setup
git clone -b v3.x https://github.com/NodeBB/NodeBB.git
cd NodeBB

# Install dependencies
npm install --production

# Run the interactive setup
./nodebb setup

# Start it up
./nodebb start

NodeBB has real-time features baked in via WebSockets, which gives it a more "modern" feel compared to traditional forums. The plugin system is npm-based, so extending it feels natural if you're already writing JavaScript.

The Actual Migration: Step by Step

Here's how I approached moving our team's scattered knowledge into a forum without losing momentum.

Step 1: Set Up Categories That Match Your Workflow

Don't just recreate your chat channels. Think about how people will search for things later.

# Bad (mirrors chat channels)
General
Backend
Frontend
Random

# Better (mirrors how people look for answers)
Deployment & Infrastructure
Architecture Decisions
Debugging Notes
Onboarding & How-Tos
RFC / Proposals

The second structure works because when someone is stuck on a deploy, they go straight to "Deployment & Infrastructure" instead of guessing which channel the answer was in.

Step 2: Seed It With Existing Knowledge

This is the step everyone skips, and it's why most internal forums die within a month. An empty forum is a dead forum.

Spend an afternoon pulling the most valuable discussions out of your chat history. That deployment runbook someone typed up at 2am? That's a forum post now. The architecture discussion from three months ago? Pin it.

Step 3: Make the Forum the System of Record

This is where it either works or doesn't. You need a simple rule: if it's worth keeping, it goes on the forum. Chat is for ephemeral stuff. The forum is for everything else.

In practice, this means when someone asks a question in chat and gets a good answer, someone pastes it into a forum topic. It takes 30 seconds and saves hours later.

Step 4: Set Up SSO

Don't make people create another account. Most forum platforms support OAuth2 or SAML out of the box. Point it at your existing identity provider and move on.

# Example: Discourse SSO payload (simplified)
import base64
import hashlib
import hmac
import urllib.parse

def generate_discourse_sso(nonce, user_email, user_id, username, sso_secret):
    payload = urllib.parse.urlencode({
        'nonce': nonce,
        'email': user_email,
        'external_id': user_id,
        'username': username,
    })

    # Base64 encode the payload
    b64_payload = base64.b64encode(payload.encode()).decode()

    # Sign it with your secret
    signature = hmac.new(
        sso_secret.encode(),
        b64_payload.encode(),
        hashlib.sha256
    ).hexdigest()

    return b64_payload, signature

Common Gotchas

A few things that bit me during setup:

Email configuration is required, not optional. Forums need to send notifications, password resets, and digests. Budget time for SMTP setup and test it early. A forum nobody gets notifications from is a forum nobody visits.
Backups are your responsibility. You're self-hosting, so automate database backups from day one. A simple cron job dumping PostgreSQL to an S3 bucket works fine.
SSL is non-negotiable. Use Let's Encrypt with Certbot. It's free, it auto-renews, and there's no excuse not to have it in 2026.
Start small on resources, then scale. Don't over-provision. A $10-15/month VPS handles most forum software for teams under 100 people.

Is It Worth It?

After running a self-hosted forum for about a year, I can say the time investment paid off within the first month. The big win wasn't the software itself — it was changing the team's mindset from "chat-first" to "if it matters, write it down properly."

Forums aren't sexy. They're not new. But they solve a real problem that Slack and Discord fundamentally can't, because they were never designed to be knowledge bases.

If your team's institutional knowledge is trapped in chat threads that nobody will ever find again, spinning up a Discourse or Flarum instance is a weekend project that keeps paying dividends. Just make sure you seed it with content on day one, and make it the default place for anything worth remembering.

The irony of forums making a comeback isn't lost on me. Sometimes the old solutions were the right ones all along — they just needed better software.

How to Replace Cloud Object Storage With a Self-Hosted S3-Compatible Setup

Alan West — Sat, 18 Apr 2026 18:40:55 +0000

Your cloud storage bill just tripled. Or maybe you're staring at egress charges that make no sense for what should be a simple "store files and serve them" workflow. Either way, you're wondering: can I just run this myself?

Short answer: yes. And it's more practical than you think in 2026.

I recently went through this migration on a project where we were storing monitoring data and attachments in a managed object storage service. The monthly cost had crept from "barely noticeable" to "we should probably talk about this." Here's how I approached moving to self-hosted object storage without losing my mind.

Why Cloud Object Storage Costs Sneak Up on You

The pricing model for most cloud object storage looks great on paper. A few dollars per terabyte for storage, pennies per thousand requests. But the costs that get you are the ones you don't think about upfront:

Egress fees — every byte that leaves the provider's network costs money
API request charges — LIST and GET operations add up fast with monitoring or logging workloads
Minimum storage duration — delete a file after a day, still pay for 30 days on some tiers
Cross-region transfer — if your compute and storage aren't co-located, you're paying twice

For applications that do a lot of small reads and writes — think health check pings, log aggregation, or time-series attachments — these costs compound quickly. The per-request pricing model works against you when your access pattern is "millions of tiny operations."

Choosing a Self-Hosted Solution

The two heavyweights in the self-hosted S3-compatible storage space are MinIO and Garage. There are others (SeaweedFS, Ceph with its S3 gateway), but these two cover most use cases.

MinIO is the obvious first choice. It's mature, well-documented, and implements the S3 API thoroughly enough that most applications work without code changes. It's what I reached for.

# Quick single-node MinIO setup for evaluation
# Don't use this in production without proper volume configuration
mkdir -p /data/minio

docker run -d \
  --name minio \
  -p 9000:9000 \
  -p 9001:9001 \
  -v /data/minio:/data \
  -e MINIO_ROOT_USER=minioadmin \
  -e MINIO_ROOT_PASSWORD=your-secure-password-here \
  minio/minio server /data --console-address ":9001"

Garage is worth considering if you need a lightweight, multi-node setup without the operational overhead. It's designed for geo-distributed deployments and uses significantly less memory than MinIO. I haven't tested it thoroughly yet in a high-throughput scenario, but the architecture looks promising for smaller teams.

For most single-server or small-cluster deployments, MinIO is the pragmatic choice. The documentation is excellent and the community is large enough that you'll find answers to most questions.

The Migration: Step by Step

Here's the approach that worked for me. The key insight is that because we're targeting S3-compatible APIs, the application code changes are minimal — it's mostly infrastructure work.

Step 1: Set Up MinIO With Proper Disk Configuration

For production, you want erasure coding. MinIO needs at least 4 drives for this (it splits data across drives with parity for fault tolerance).

# docker-compose.yml for a single-node, multi-drive setup
services:
  minio:
    image: minio/minio:latest
    command: server /data/{1...4} --console-address ":9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_ROOT_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
    volumes:
      - /mnt/disk1:/data/1
      - /mnt/disk2:/data/2
      - /mnt/disk3:/data/3
      - /mnt/disk4:/data/4
    ports:
      - "9000:9000"
      - "9001:9001"
    healthcheck:
      test: ["CMD", "mc", "ready", "local"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

The {1...4} syntax tells MinIO to use these as an erasure coding set. You get redundancy — lose one drive, keep serving data.

Step 2: Update Your Application's S3 Configuration

This is where self-hosted storage shines. If your app already uses an S3 SDK, you typically just change the endpoint URL:

import boto3

# Before: pointing at a cloud provider
# s3 = boto3.client('s3')

# After: pointing at your MinIO instance
s3 = boto3.client(
    's3',
    endpoint_url='https://minio.yourdomain.com',
    aws_access_key_id='your-access-key',
    aws_secret_access_key='your-secret-key',
    region_name='us-east-1'  # MinIO ignores this but some SDKs require it
)

# Everything else stays the same
s3.put_object(Bucket='my-bucket', Key='data/file.json', Body=payload)
obj = s3.get_object(Bucket='my-bucket', Key='data/file.json')

That's it. No application logic changes. The S3 API compatibility means your existing code, backup scripts, and CLI tools all work.

Step 3: Migrate Existing Data

The mc (MinIO Client) tool handles this well. It can mirror data from any S3-compatible source to your new setup:

# Add your source (cloud provider) and destination (self-hosted)
mc alias set cloudsrc https://s3.amazonaws.com ACCESS_KEY SECRET_KEY
mc alias set local https://minio.yourdomain.com ACCESS_KEY SECRET_KEY

# Create the destination bucket
mc mb local/my-bucket

# Mirror everything — this preserves metadata and handles retries
mc mirror cloudsrc/my-bucket local/my-bucket --watch

# The --watch flag keeps syncing new objects during migration
# Remove it once you've cut over

Step 4: Put a Reverse Proxy in Front

Don't expose MinIO directly. Use nginx or Caddy to handle TLS and add a layer of access control:

# nginx config for MinIO behind a reverse proxy
server {
    listen 443 ssl;
    server_name minio.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/minio.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/minio.yourdomain.com/privkey.pem;

    # Important: MinIO needs these for large uploads
    client_max_body_size 0;
    proxy_buffering off;

    location / {
        proxy_pass http://127.0.0.1:9000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Required for streaming large objects
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

What You Need to Handle Yourself

Self-hosting means you own the operational burden. Be honest with yourself about whether you're ready for this:

Backups — MinIO's erasure coding protects against drive failures, not against you accidentally deleting a bucket. Set up mc mirror to a separate backup location or use MinIO's built-in bucket replication.
Monitoring — MinIO exposes Prometheus metrics at /minio/v2/metrics/cluster. Hook these up to your alerting. At minimum, watch disk usage, request latency, and error rates.
Disk management — Plan your capacity. Running out of disk space on an object store is a bad day. Set alerts at 80% utilization.
Updates — MinIO releases frequently. Stay reasonably current, especially for security patches.

When Self-Hosting Doesn't Make Sense

I want to be fair here. Self-hosted object storage isn't always the right call:

If your storage needs are under 100GB and access patterns are simple, cloud storage is probably cheaper when you factor in your time
If you need cross-region replication with single-digit millisecond failover, the cloud providers have a significant edge
If you don't have someone on your team comfortable with Linux server administration, the operational overhead will bite you

The Results

For my project — roughly 500GB of monitoring data with high read/write frequency — the cost went from around $80/month (mostly egress and API calls) to effectively $15/month in additional server costs (extra disk on existing infrastructure). Performance actually improved because storage is now co-located with compute. No more cross-network latency for every read.

The migration took about a weekend. Most of that was testing, not actual infrastructure work.

The S3 API has become such a universal standard that switching between providers — cloud or self-hosted — is genuinely straightforward. If your storage bill is making you wince, running your own object storage is a legitimate option. Just go in with your eyes open about the operational trade-offs.

HTML PPT Skill: AI-Powered Presentations Without PowerPoint

Alan West — Sat, 18 Apr 2026 17:36:11 +0000

I've been keeping an eye on the intersection of AI agents and developer tooling for a while now, and something popped up on GitHub Trending this week that caught my attention: html-ppt-skill, a project that lets AI agents generate full HTML-based slide decks.

The pitch is straightforward — instead of firing up PowerPoint or Google Slides, you describe what you want and an AI agent builds it for you as pure HTML. No proprietary formats, no export headaches, just web tech doing what web tech does best.

What Is This, Actually?

HTML PPT Skill (or "HTML PPT Studio" as the repo calls it) is an AgentSkill — essentially a plugin that gives AI agents the ability to create presentation slides. Think of it as a capability module you can plug into agent frameworks so they know how to build well-structured slide decks.

According to the repository, it ships with:

24 themes for visual variety
31 layouts for different content structures
20+ animations for transitions and element effects

The output is plain HTML, which means you can host it anywhere, version control it in Git, and tweak it with CSS if you need to.

Why HTML Presentations Make Sense for Developers

If you've ever used reveal.js or Slidev, you already know the appeal. HTML presentations give you:

Version control — diffs that actually mean something
Portability — runs in any browser, no software required
Programmability — embed live code, interactive demos, or API-driven content
Consistency — apply themes across your whole team's decks with shared CSS

The difference here is the AI agent layer on top. Instead of hand-writing your slide markup, you're describing what you need and letting the agent handle the layout and styling decisions.

How the AgentSkill Pattern Works

This is the part that interests me most. The "AgentSkill" pattern is becoming a common way to extend what AI agents can do. Rather than building monolithic agents that try to do everything, you give them modular skills they can invoke when needed.

A simplified version of how you'd integrate something like this:

# Pseudocode — the actual API will depend on your agent framework
from agent_skills import HTMLPresentationSkill

# Register the skill with your agent
agent.register_skill(HTMLPresentationSkill(
    theme="corporate-blue",  # pick from available themes
    default_layout="two-column",  # sensible default for most content
    animations=True  # enable slide transitions
))

# Now the agent can respond to presentation requests
agent.run("Create a 10-slide deck about our Q2 engineering metrics")

The agent handles the heavy lifting — choosing appropriate layouts for different content types, applying consistent styling, and generating the final HTML output.

Building a Quick Presentation Workflow

Here's where I think this gets genuinely useful. Imagine combining this with a few other tools in a pipeline:

// A simple Node.js script to automate presentation generation
const fs = require('fs');
const path = require('path');

async function generatePresentation(topic, outputDir) {
  // Step 1: Agent generates the HTML slides
  const slides = await agent.invoke('html-ppt-skill', {
    topic: topic,
    theme: 'minimal-dark',
    slideCount: 8,
    includeAnimations: true
  });

  // Step 2: Write the output
  const outputPath = path.join(outputDir, 'presentation.html');
  fs.writeFileSync(outputPath, slides.html);

  // Step 3: Optional — spin up a local server for preview
  console.log(`Presentation saved to ${outputPath}`);
  console.log('Open in any browser to present');
}

generatePresentation('API Design Best Practices', './output');

The beauty is that the HTML output is self-contained. You can throw it on any static hosting — Netlify, Vercel, even a simple Nginx server — and share a link instead of emailing a 50MB PowerPoint file.

Tracking Presentation Views

One thing you can do with HTML presentations that you can't easily do with PowerPoint: analytics. Since your deck is just a web page, you can add lightweight tracking to see how people interact with it. Privacy-focused options like Umami or Plausible give you full data ownership without creeping out your audience with cookie banners. A single script tag and you know which slides people actually spend time on.

Where This Fits in the Bigger Picture

The HTML presentation space already has solid players. Reveal.js is battle-tested and feature-rich. Slidev is fantastic if you're in the Vue ecosystem and want to write slides in Markdown. Marp is great for Markdown-to-slides conversion.

What html-ppt-skill adds to the conversation is the agent-first approach. You're not writing slides — you're describing slides and letting an AI figure out the layout, theme application, and animation timing. For quick internal presentations, sprint demos, or project updates, that could save a decent chunk of time.

That said, I'd keep expectations realistic. AI-generated presentations will probably need some manual tweaking for anything client-facing or high-stakes. The sweet spot is likely:

Internal team updates — where speed matters more than pixel-perfect design
Quick prototypes — when you need a rough deck to align on structure before polishing
Documentation — turning technical docs into walkthrough presentations
Repetitive formats — weekly status decks, sprint reviews, standup summaries

Things I'd Watch For

The project is still relatively new and trending on GitHub, which means it's worth keeping an eye on but maybe not betting your workflow on just yet. A few things I'd want to see before going all-in:

Theme customization depth — can you modify themes easily, or are you locked into presets?
Export options — PDF export for when someone inevitably asks for "just a PDF"
Responsive design — do the slides look good on different screen sizes?
Agent framework compatibility — which agent platforms does this actually integrate with?

The Bottom Line

The idea of AI agents that can produce polished HTML presentations is compelling. We're past the point where AI-generated content needs to look amateur, and the combination of curated themes, layouts, and animations suggests this project is trying to clear that bar.

If you're already experimenting with AI agent workflows, html-ppt-skill is worth a look. Clone it, try generating a few decks, and see if the output quality meets your standards. Worst case, you spend 20 minutes and learn something about how AgentSkill patterns work. Best case, you never open PowerPoint for a sprint demo again.

And honestly? That alone might be worth the price of admission.

Traditional Quantization vs 1.58-Bit Ternary Models: A Practical Comparison

Alan West — Sat, 18 Apr 2026 16:05:36 +0000

If you've been running local LLMs, you already know the drill: download a 70B model, quantize it to 4-bit with GPTQ or GGUF, cross your fingers, and hope your GPU doesn't catch fire. It works. It's practical. But there's a fundamentally different approach gaining serious traction — ternary quantization at 1.58 bits per weight.

The concept behind projects like Ternary Bonsai and Microsoft's BitNet b1.58 research is almost absurdly simple: what if every weight in your model could only be -1, 0, or +1? Three possible values means log₂(3) ≈ 1.58 bits per parameter. That's it. No floating point math, no complex dequantization kernels. Just addition and subtraction.

Let me walk through how this compares to the quantization approaches most of us are already using.

How Traditional Quantization Works

Standard post-training quantization (PTQ) takes a trained FP16 model and compresses the weights down to fewer bits. The most common approaches:

INT8 (8-bit): Roughly halves memory. Almost no quality loss. The safe default.
INT4 (4-bit): Quarter the memory. Noticeable but acceptable quality loss for most tasks.
GPTQ / AWQ: Smarter 4-bit methods that calibrate quantization using sample data.
GGUF (llama.cpp): Mixed quantization — important layers get more bits, less critical ones get fewer.

Here's what loading a 4-bit GPTQ model looks like in practice:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/Llama-2-7B-GPTQ"

# GPTQ models load with quantization config baked in
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",  # automatically distributes across available GPUs
)

# Inference is the same as any HF model
inputs = tokenizer("Explain ternary quantization:", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

This is battle-tested. The tooling is mature. You can grab a GPTQ or GGUF model from Hugging Face right now and run it on consumer hardware. That's the upside.

The downside? You're still doing multiply-accumulate operations with dequantized weights during inference. The compute pattern is fundamentally the same as FP16 — you've just compressed the storage.

The 1.58-Bit Ternary Approach

Ternary quantization flips the script. Instead of training a full-precision model and then compressing it, the 1.58-bit approach (pioneered by the BitNet b1.58 paper from Microsoft Research) trains models from scratch with ternary constraints.

Every weight is one of three values: {-1, 0, +1}.

This changes everything about the math. Matrix multiplication — the operation that dominates LLM inference — becomes pure addition and subtraction. No multiplies at all.

import torch

# Traditional linear layer: multiply and accumulate
# output = input @ weight.T + bias
# Every element requires a floating-point multiply

# Ternary linear layer (conceptual)
def ternary_linear(x, weight_ternary):
    # weight_ternary contains only -1, 0, +1
    # Where weight is +1: add the input
    # Where weight is -1: subtract the input
    # Where weight is 0: skip entirely (free sparsity!)

    pos_mask = (weight_ternary == 1)   # positions to add
    neg_mask = (weight_ternary == -1)  # positions to subtract

    # No multiplications needed — just masked addition/subtraction
    result = torch.zeros(weight_ternary.shape[0], x.shape[-1])
    result += (pos_mask.float() @ x.T).T   # add where weight = +1
    result -= (neg_mask.float() @ x.T).T   # subtract where weight = -1
    return result

Now, this simplified code still uses PyTorch ops that internally do multiplies. The real gains come from custom kernels and hardware that can exploit the ternary structure directly. But it illustrates the core idea: your "multiplication" is now a conditional add/subtract/skip.

Side-by-Side: What Actually Matters

Memory Footprint

Approach	Bits/Param	7B Model Size	70B Model Size
FP16	16	~14 GB	~140 GB
INT8	8	~7 GB	~70 GB
INT4 (GPTQ)	4	~3.5 GB	~35 GB
Ternary (1.58-bit)	1.58	~1.4 GB	~14 GB

Those ternary numbers are striking. A 70B-class model fitting in 14 GB of memory — that's a single consumer GPU.

Quality

This is where it gets nuanced. Post-training quantization to 4-bit loses information from a model that was trained at full precision. The ternary approach trains with constraints from the start, so the model learns to work within them.

According to the BitNet b1.58 research, ternary models can reportedly match full-precision transformer performance at equivalent parameter counts, starting around 3B parameters. I haven't independently verified these claims across all benchmarks, so take them as promising research results rather than settled science.

Traditional 4-bit quantization is well-understood territory. Quality loss is predictable and the community has extensive benchmark data.

Inference Speed

Ternary models have a theoretical advantage: replacing multiplications with additions could yield significant speedups. But — and this is a big but — you need specialized kernels or hardware to realize those gains. Running ternary weights through standard CUDA kernels won't magically speed things up.

Traditional quantization benefits from years of kernel optimization. GGUF on llama.cpp is screaming fast on CPUs and GPUs because the kernels are incredibly well-tuned.

Tooling Maturity

This isn't close. Traditional quantization wins by a mile:

GPTQ/AWQ: Mature Python ecosystem, HuggingFace integration, thousands of pre-quantized models
GGUF/llama.cpp: Battle-tested C++ inference, runs on everything from Raspberry Pis to server GPUs
Ternary/1.58-bit: Active research, emerging tooling, limited pre-trained model availability

When to Use What

Stick with traditional quantization (GPTQ/GGUF/AWQ) if you:

Need a production-ready solution today
Want to use existing pre-trained models
Need predictable quality and performance characteristics
Are running on standard hardware with optimized kernels

# This just works, right now, on your machine
# Download a GGUF model and run it with llama.cpp
./llama-cli -m models/llama-7b-q4_K_M.gguf \
  -p "Write a function that" \
  -n 256 \
  --threads 8  # adjust to your CPU core count

Explore ternary 1.58-bit models if you:

Are doing research on efficient architectures
Want to push the boundaries of edge deployment
Have the resources to train (or fine-tune) from scratch with ternary constraints
Are building custom hardware or FPGA accelerators where ternary ops are native

The Honest Tradeoff

Traditional quantization is a compression trick — you take something big and make it smaller, accepting some quality loss. Ternary quantization is an architectural bet — you constrain the model design itself and bet that the efficiency gains outweigh the representational limits.

The "Bonsai" metaphor is actually perfect here. A bonsai tree isn't a big tree that got shrunk. It's grown from the start with constraints that shape it into something small but complete. That's what 1.58-bit models aspire to be.

Right now, I'd recommend traditional quantization for anyone shipping products. The tooling is mature, the models are abundant, and the performance is well-characterized. But if the ternary research continues on its current trajectory, we might look back at 4-bit quantization the way we now look at FP32 inference — technically fine, but leaving a lot of efficiency on the table.

Keep an eye on this space. The gap between research and production is closing faster than most of us expected.

How to Measure and Reduce Your LLM Tokenizer Costs

Alan West — Sat, 18 Apr 2026 15:39:18 +0000

You're shipping an AI-powered feature, the demo looks great, and then the invoice arrives. Suddenly that clever summarization endpoint is costing you $400/day because nobody bothered to measure how many tokens you're actually burning.

I've been there. Twice.

The problem isn't that LLM APIs are expensive — pricing has dropped dramatically. The problem is that most developers have no idea how their text maps to tokens, and that ignorance compounds fast at scale.

Why Token Counts Surprise You

Tokenizers don't work the way your brain does. You see "authentication" as one word. A BPE (Byte Pair Encoding) tokenizer might split it into ["auth", "entic", "ation"] — three tokens. Multiply that mismatch across thousands of requests per hour and your cost estimates are fiction.

Different models use different tokenizers, too. Swapping from one model family to another can change your token counts by 10-20% on the same input text. I found this out the hard way when migrating a document processing pipeline between providers and watching costs drift upward despite "cheaper" per-token pricing.

The root causes of unexpected token costs usually boil down to:

Verbose system prompts that get sent with every single request
Uncompressed context windows stuffed with raw text instead of summaries
No measurement — you're guessing instead of counting
Ignoring output tokens, which are typically 3-5x more expensive than input tokens

Step 1: Actually Measure Your Tokens

Before optimizing anything, instrument your calls. Most LLM API responses include token usage in the response metadata. If you're not logging this, start now.

import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

def call_with_tracking(messages, model, system=None):
    """Wrapper that logs token usage for every call."""
    kwargs = {"model": model, "max_tokens": 1024, "messages": messages}
    if system:
        kwargs["system"] = system

    response = client.messages.create(**kwargs)

    usage = response.usage
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "model": model,
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        # cache reads are cheaper — track them separately
        "cache_read_tokens": getattr(usage, "cache_read_input_tokens", 0),
        "cache_creation_tokens": getattr(usage, "cache_creation_input_tokens", 0),
    }

    # Ship this to your observability stack
    print(json.dumps(log_entry))
    return response

Run this for a day in production. You'll probably discover that 60% of your token spend is on input — specifically, on the same system prompt and context being resent over and over.

Step 2: Count Tokens Before You Send Them

Waiting for the API response to tell you token counts is like checking your bank balance after the vacation. You want to know before you make the call.

Anthropic provides a token counting API, and for local estimation, the tiktoken library (originally built for OpenAI's models) gives you a rough baseline for BPE tokenizers generally:

import tiktoken

def estimate_tokens(text, encoding_name="cl100k_base"):
    """Rough token estimate using a BPE tokenizer.
    Note: actual counts will vary by model — use the
    provider's counting API for precision."""
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)
    return len(tokens)

# Compare what you think vs. reality
test_strings = [
    "Hello world",
    "Authentication failed for user@example.com",
    '{"error": "rate_limit_exceeded", "retry_after": 30}',
    "The quick brown fox " * 100,  # repetitive text
]

for s in test_strings:
    count = estimate_tokens(s)
    ratio = count / len(s.split())
    print(f"Words: {len(s.split()):>4} | Tokens: {count:>4} | Ratio: {ratio:.2f}")

That ratio column is the number to watch. For English prose it's usually around 1.3. For code, it jumps to 1.5-2.0. For JSON with lots of punctuation and special characters? I've seen it hit 2.5.

Step 3: Slash Your System Prompt Costs

This is where the biggest wins hide. If your system prompt is 2,000 tokens and you're making 10,000 requests per day, that's 20 million input tokens daily just on instructions that never change.

Three strategies that actually work:

Prompt caching. Anthropic and other providers support caching of static prompt prefixes. The first request pays full price, but subsequent requests within the cache TTL (usually around 5 minutes) get charged at a fraction of the cost — sometimes 90% less.

# With Anthropic's prompt caching, mark your static content
response = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": your_long_system_prompt,  # 2000+ tokens
            "cache_control": {"type": "ephemeral"}  # enables caching
        }
    ],
    messages=[{"role": "user", "content": user_input}]
)
# Check response.usage.cache_read_input_tokens to verify it's working

Compress your instructions. I rewrote a 1,800-token system prompt down to 600 tokens by removing redundant phrasing, using shorthand, and cutting examples that weren't improving output quality. Test your outputs before and after — you'll often find that shorter prompts work just as well.

Move context to retrieval. Instead of stuffing 50 pages of documentation into every request, use RAG (retrieval-augmented generation) to pull in only the relevant chunks. This alone cut one of my project's token costs by 70%.

Step 4: Control Output Token Bloat

Output tokens cost more, and models love to be verbose. Fight back:

Set max_tokens to a reasonable limit, not the maximum
Add explicit length instructions: "Respond in under 100 words"
For structured data, ask for JSON — it's more token-dense than prose
Use streaming so you can abort early if the response is going off-track

Step 5: Build a Cost Dashboard

Once you're logging token usage per request, aggregate it. You don't need anything fancy — a simple script that groups by endpoint and calculates daily cost is enough to catch problems:

def calculate_daily_cost(logs, input_price_per_mtok, output_price_per_mtok):
    """Calculate cost from a list of usage log entries."""
    total_input = sum(log["input_tokens"] for log in logs)
    total_output = sum(log["output_tokens"] for log in logs)
    # Subtract cached tokens — they're billed at reduced rate
    cached = sum(log.get("cache_read_tokens", 0) for log in logs)

    full_price_input = total_input - cached

    cost = (
        (full_price_input / 1_000_000) * input_price_per_mtok
        + (cached / 1_000_000) * input_price_per_mtok * 0.1  # 90% discount
        + (total_output / 1_000_000) * output_price_per_mtok
    )
    return {
        "total_input_tokens": total_input,
        "total_output_tokens": total_output,
        "cached_tokens": cached,
        "estimated_cost_usd": round(cost, 4)
    }

Run this weekly. Set alerts for when daily cost exceeds your baseline by more than 20%. I guarantee it'll catch a runaway prompt or an unexpected traffic spike before it empties your credits.

Prevention: Bake This Into Your Workflow

The real fix isn't one-time optimization — it's making token cost a first-class metric:

Add token counts to your CI. If a PR changes a system prompt, log the before/after token count in the PR description.
Set per-endpoint budgets. "This summarization endpoint should average under 800 input tokens per call." Alert when it drifts.
Review your model selection. A smaller, faster model might handle 80% of your requests at a fraction of the cost. Route only complex queries to the expensive model.
Benchmark when switching models or providers. Run your actual production prompts through the new tokenizer and compare counts before committing to a migration.

Token costs are one of those problems that's trivially easy to measure and absurdly expensive to ignore. Spend an afternoon instrumenting your calls, and you'll probably find savings that pay for that afternoon a hundred times over.

The tools exist. The APIs report usage. There's genuinely no excuse for flying blind on this anymore.

How to Debug Encrypted API Traffic When Console.log Isn't Enough

Alan West — Sat, 18 Apr 2026 12:44:44 +0000

We've all been there. Your app is sending requests to a third-party API, something's going wrong, and all you can see in your browser's Network tab is a bunch of opaque responses that tell you absolutely nothing useful. Maybe the request is getting silently modified by a middleware layer. Maybe response headers are being stripped. Maybe the WebSocket connection keeps dropping and you have no idea why.

I spent an embarrassing amount of time last month debugging a payment integration where the API kept returning 400 Bad Request — and the browser DevTools showed me a perfectly valid-looking payload. Turns out, a reverse proxy was mutating my Content-Type header in a way that was invisible from the client side.

This is the kind of problem that makes you reach for something more powerful than console.log.

Why Browser DevTools Fall Short

Browser DevTools are fantastic for basic request inspection. But they have real limitations:

You only see the browser's perspective. If something between your client and the server is modifying traffic (CDN, reverse proxy, API gateway), you won't see it.
TLS termination hides everything. Once traffic leaves the browser, it's encrypted. You can't inspect what actually hits the wire.
WebSocket and streaming protocols are painful. The DevTools WebSocket inspector is bare-bones at best.
No replay or modification. You can't easily re-send a captured request with tweaked headers to isolate the issue.

The root cause of many "impossible" API bugs is that there's a gap between what you think you're sending and what actually arrives at the server.

Enter MITM Proxies: Seeing the Unseeable

A Man-in-the-Middle (MITM) proxy sits between your client and the destination server, intercepting and decrypting TLS traffic so you can inspect it in plain text. Before you panic about the name — this is a standard, legitimate debugging technique. Tools like mitmproxy have been used by developers for years.

Here's how the basic flow works:

Your App → MITM Proxy (decrypts, inspects, re-encrypts) → Target Server
                ↕
        You see everything

The proxy generates its own TLS certificate on the fly. Your client trusts the proxy's CA cert, so the connection completes normally — but now you can see every byte.

Setting Up mitmproxy for API Debugging

Let's walk through a concrete debugging workflow. Say you've got a Node.js service that's hitting a REST API and getting unexpected responses.

First, install mitmproxy:

# macOS
brew install mitmproxy

# Or pip (works anywhere)
pip install mitmproxy

# Start the proxy on port 8080
mitmproxy --listen-port 8080

Now configure your app to route traffic through it:

// Point your HTTP client at the proxy
const axios = require('axios');
const HttpsProxyAgent = require('https-proxy-agent');

const agent = new HttpsProxyAgent('http://127.0.0.1:8080');

// Trust the mitmproxy CA cert for this request
process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0'; // dev only!

const response = await axios.get('https://api.example.com/v2/orders', {
  httpsAgent: agent,
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json'
  }
});

Now every request flows through mitmproxy and you can see headers, bodies, timing — everything. The mitmproxy terminal UI lets you arrow through requests and drill into details.

Going Deeper: Scripting Your Proxy

The real power comes when you script the proxy. mitmproxy lets you write Python add-ons that can inspect, modify, or log traffic programmatically.

Here's an example that logs every request where the Content-Type header gets modified between your client and the server response:

# content_type_watcher.py
from mitmproxy import http
import json
from datetime import datetime

def response(flow: http.HTTPFlow) -> None:
    req_ct = flow.request.headers.get("content-type", "none")
    resp_ct = flow.response.headers.get("content-type", "none")

    # Flag mismatches between what we sent and what came back
    if "json" in req_ct and "json" not in resp_ct:
        print(f"[WARN] {datetime.now().isoformat()}")
        print(f"  URL: {flow.request.pretty_url}")
        print(f"  Sent Content-Type: {req_ct}")
        print(f"  Got Content-Type: {resp_ct}")
        print(f"  Status: {flow.response.status_code}")

        # Dump the response body for inspection
        try:
            body = json.loads(flow.response.content)
            print(f"  Body: {json.dumps(body, indent=2)[:500]}")
        except (json.JSONDecodeError, TypeError):
            print(f"  Body (raw): {flow.response.content[:200]}")

Run it with:

mitmproxy -s content_type_watcher.py --listen-port 8080

This is exactly how I found my payment integration bug. The CDN was normalizing application/json; charset=utf-8 to application/json, and the upstream API was strict about the charset parameter. Maddening, but instantly visible once you're looking at the actual wire traffic.

Browser-Based Capture for Frontend Debugging

Sometimes the problem isn't in your backend service — it's in the browser itself. Maybe a Chrome extension is injecting headers. Maybe a service worker is caching stale responses. Maybe CORS preflight is doing something unexpected.

For these cases, you want to intercept traffic at the browser level. You can configure your browser to use the MITM proxy:

# Launch Chrome with proxy settings (macOS)
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
  --proxy-server="http://127.0.0.1:8080" \
  --ignore-certificate-errors-spiffe-only \
  --user-data-dir="/tmp/chrome-proxy-debug"

Now you get full visibility into what the browser is actually sending, not just what DevTools shows you. This is particularly useful for debugging:

CORS issues where preflight OPTIONS requests behave differently than you expect
Cookie handling where SameSite, Secure, or HttpOnly flags cause silent failures
Service worker interference where cached responses mask real API errors

Newer Tools in the Space

The protocol analysis landscape has been evolving. Projects like anything-analyzer are combining multiple approaches — browser capture, MITM proxying, and JS hooks — into unified toolkits. Some of these newer tools are also integrating with AI-powered analysis through MCP (Model Context Protocol) servers, which means you can point an AI assistant at your captured traffic and ask it to spot anomalies.

I haven't tested that particular tool in production yet, but the general trend of combining capture, analysis, and AI in one pipeline is genuinely exciting for debugging complex protocol issues.

Prevention: Making Future Debugging Easier

Once you've solved the immediate fire, here's how to prevent the next one:

Log the full request/response at your API boundary. Not just status codes — headers, content types, and (redacted) body snippets. You'll thank yourself later.
Add request ID headers. Pass a unique X-Request-ID through your entire chain so you can correlate client → proxy → server logs.
Test with strict header validation. If your API cares about Content-Type or Accept headers, add tests that verify the exact values — not just that they're present.
Document your proxy chain. If traffic flows through CDN → load balancer → API gateway → service, write that down. Future-you debugging at 2 AM needs that diagram.

The Takeaway

When you're stuck on an API bug that doesn't make sense from the client side, the answer is almost always "something is happening on the wire that you can't see." MITM proxies give you that visibility. Start with mitmproxy for quick inspection, script it for automated detection, and layer in browser-level capture when the problem is on the frontend.

The five minutes it takes to set up a proxy will save you hours of staring at console.log output wondering why your perfectly valid JSON is getting rejected. Trust me on this one — I've done both, and the proxy wins every time.

How to Fix an Over-Engineered Frontend (When Plain HTML Was Enough)

Alan West — Sat, 18 Apr 2026 12:41:04 +0000

Every few months, I watch a junior dev spin up a new React app with Next.js, Tailwind, a state management library, and three different build tools — for what turns out to be a mostly static page with a contact form.

I've been building for the web since the jQuery days. And look, I genuinely like React. But I've also shipped projects where the framework was the problem, not the solution. The real issue isn't nostalgia — it's that we've lost the ability to diagnose when our tooling is working against us.

Let me walk you through how to recognize an over-engineered frontend and what to do about it.

The Symptoms

You know your frontend stack is fighting you when:

Your build step takes longer than your deploy
You have more config files than actual page templates
node_modules is larger than your entire backend
You're debugging hydration mismatches on a page that barely has interactivity
New team members need a full day just to understand the dev setup

I hit this exact wall last year on a client dashboard project. We'd started with Next.js because "we might need SSR later." Six months in, we had 47 dependencies, a 90-second build, and exactly zero pages that actually needed client-side rendering. The whole thing could have been server-rendered HTML with a couple of <script> tags.

The Root Cause: Defaulting to Complexity

The real problem isn't any specific framework. It's that our industry has normalized starting every project at maximum complexity. We reach for a SPA framework before we've even asked the fundamental question: does this page need to be an application, or is it a document?

Most content on the web is documents. Blog posts, marketing pages, dashboards that display data, admin panels with forms. These don't need a virtual DOM. They don't need client-side routing. They need HTML that the server sends and the browser renders.

The old school devs weren't wrong — they were just solving the right problem with the right tool.

Step 1: Audit Your Interactivity

Before ripping anything out, figure out what actually needs JavaScript. I use a simple test: open your app, disable JavaScript in the browser, and see what breaks.

<!-- This doesn't need React. It's a form. -->
<form action="/api/contact" method="POST">
  <label for="email">Email</label>
  <input type="email" id="email" name="email" required />

  <label for="message">Message</label>
  <textarea id="message" name="message" required></textarea>

  <!-- HTML validation is shockingly capable now -->
  <button type="submit">Send</button>
</form>

You'd be surprised how much of your UI works without JavaScript at all. Native HTML form validation, <details> for accordions, CSS for animations — the platform has caught up to a lot of what we used to need jQuery for.

Step 2: Replace Framework Features with Platform Features

Modern HTML and CSS handle things that used to require a library. Here's a modal dialog that would have been a React component with state management:

<!-- Native dialog element — no JS library needed -->
<dialog id="confirm-dialog">
  <h2>Are you sure?</h2>
  <p>This action cannot be undone.</p>
  <form method="dialog">
    <!-- method="dialog" closes the dialog and returns the value -->
    <button value="cancel">Cancel</button>
    <button value="confirm">Confirm</button>
  </form>
</dialog>

<button onclick="document.getElementById('confirm-dialog').showModal()">
  Delete Item
</button>

<style>
  /* The ::backdrop pseudo-element handles the overlay */
  dialog::backdrop {
    background: rgba(0, 0, 0, 0.5);
  }

  dialog {
    border: 1px solid #ddd;
    border-radius: 8px;
    padding: 2rem;
    max-width: 400px;
  }
</style>

No useState. No useEffect. No portal. No accessibility library — the <dialog> element handles focus trapping and escape-key dismissal natively.

Step 3: Use Server-Side Rendering Where It Belongs

If your backend already has all the data, why send JSON to the client just to template it into HTML there? Cut out the middleman.

Most backend frameworks have excellent templating. Pick your language's standard option:

Python: Jinja2 with Flask or Django templates
Go: html/template in the standard library
Ruby: ERB with Rails
PHP: Blade with Laravel (or just... PHP, which is literally a template language)
Node: EJS, Pug, or Handlebars with Express

# Flask example — the entire "frontend" is server-rendered
from flask import Flask, render_template

app = Flask(__name__)

@app.route("/dashboard")
def dashboard():
    stats = get_dashboard_stats()  # your existing backend logic
    # Template receives data directly — no API layer needed
    return render_template("dashboard.html", stats=stats)

You just eliminated your API layer, your client-side state management, your loading spinners, and your hydration bugs. The browser gets HTML. It renders HTML. Done.

Step 4: Add Interactivity Surgically

For the parts that genuinely need client-side interactivity, you don't have to go full SPA. Libraries like htmx or Alpine.js let you add behavior to server-rendered HTML without a build step.

But honestly? Vanilla JavaScript is fine for most things.

// A lightweight search filter — no framework required
const searchInput = document.querySelector("#search");
const items = document.querySelectorAll(".item");

searchInput.addEventListener("input", (e) => {
  const query = e.target.value.toLowerCase();

  items.forEach((item) => {
    // Toggle visibility based on text content match
    const matches = item.textContent.toLowerCase().includes(query);
    item.style.display = matches ? "" : "none";
  });
});

Twelve lines. No dependencies. No build step. Works in every browser.

When You Actually Need a Framework

I'm not saying burn all your React code. Frameworks earn their keep when you have:

Highly interactive UIs — think Figma, Google Docs, or complex data visualization
Real-time collaborative features where multiple users modify shared state
Complex client-side state — multi-step wizards, drag-and-drop interfaces, offline-first apps
Large teams where component-based architecture helps with code organization and ownership

If your app has a rich text editor, a canvas-based tool, or real-time multiplayer features, absolutely use a framework. That's what they were designed for.

The mistake is using them for everything else too.

Prevention: The 5-Minute Rule

Before starting your next project, spend five minutes with a blank HTML file. Seriously.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My Project</title>
    <style>
      /* Start here. See how far you get. */
      body { font-family: system-ui; max-width: 800px; margin: 0 auto; padding: 1rem; }
    </style>
</head>
<body>
    <h1>Hello</h1>
</body>
</html>

Open it in a browser. No build step. No waiting. Instant feedback. Now ask yourself: at what point does this project actually need a framework? You might be surprised how far raw HTML, CSS, and a server-rendered template get you.

The old school approach wasn't primitive. It was simple. And simple is a feature that most modern stacks have accidentally optimized away.