DEV Community: Charles

How to Extract Structured Data from Any Website Using AI Extraction

Charles — Tue, 19 May 2026 03:27:20 +0000

How to Extract Structured Data from Any Website Using AI Extraction

Traditional web scraping means writing selectors. One CSS class change and everything breaks.

AI extraction changes this.

The Old Way

// Fragile: depends on HTML structure
const title = document.querySelector(".product-title h1 span").innerText;
const price = document.querySelector(".price-amount .current").innerText;

The New Way

// Robust: describe what you want
const result = await client.scrape({
  url: "https://example.com/product",
  extraction: { mode: "llm", schema: { title: "Product name", price: "Current price in USD", rating: "Average rating out of 5" } }
});

Benefits

Selector-free: No CSS selectors to maintain
Structure-proof: Works even if the site redesigns
Flexible: Change what to extract without rewrites
Accurate: LLMs understand context

Try AI extraction with XCrawl API

Web Scraping vs APIs: When to Use Which (And Why)

Charles — Tue, 19 May 2026 03:26:48 +0000

Web Scraping vs APIs: When to Use Which

Every developer faces this choice. Here's my framework.

Use an API when:

The site offers a public API with good documentation
You need structured data (JSON, not HTML)
Rate limits are reasonable (>100 req/hour)
You don't need real-time data

Use Web Scraping when:

The site has no public API
The API is rate-limited or costly
You need data not exposed through the API
The site's data is rendered client-side (SPA)
You need historical/diff data over time

The Hybrid Approach

Many projects need both:

Use the API when possible (faster, more reliable)
Fall back to scraping when the API doesn't have what you need
Use scraping tools that look like APIs

Real Example: E-Commerce Price Monitoring

API approach: Amazon's Product Advertising API — limited data, requires approval, request-based pricing.

Scraping approach: Directly scrape product pages — get every data point, no approval needed, pay per page.

Best approach: A scraping API that abstracts the complexity while giving you API-like simplicity.

XCrawl gives you API-like simplicity with web scraping power: dash.xcrawl.com

5 Web Scraping Mistakes That Cost You Time and Money

Charles — Tue, 19 May 2026 03:26:48 +0000

5 Web Scraping Mistakes That Cost You Time and Money

After building hundreds of scrapers, these are the most expensive mistakes I see developers make.

Mistake 1: Building Your Own Proxy Infrastructure

You think: "I'll buy some proxies and rotate them myself."
Reality: You spend 2 weeks building, 2 hours/week maintaining, and $200/month on proxy services.

Cost: $200/month + 10+ hours/month
Better: Use a scraping API ($49-99/month, zero maintenance)

Mistake 2: No Error Handling

Your scraper works on 80% of pages. The other 20% fail silently. You don't notice until your dataset has holes.

Fix: Always wrap in try/catch. Log every failure. Alert on >10% error rate.

Mistake 3: Ignoring Robots.txt

Scrape a site that blocks you? They update their CDN rules. Now your IP is banned permanently.

Fix: Check robots.txt first. Respect crawl-delay directives.

Mistake 4: Writing One Big Script

A 500-line scraper with no functions. Good luck debugging when it breaks.

Fix: Modular design. Separator: fetcher, parser, storage, notification.

Mistake 5: No Rate Limiting

You send 100 requests/second. The site blocks you after 10 seconds.

Fix: Add delays. 1-3 seconds between requests. Use exponential backoff on 429s.

Avoid these mistakes: XCrawl API

How to Scrape 1000 Pages Per Day Without Getting Banned

Charles — Tue, 19 May 2026 02:53:13 +0000

How to Scrape 1000 Pages Per Day Without Getting Banned

Scaling from 10 pages to 1000 pages per day is where most scrapers fail. Here's how to do it right.

The Golden Rule

Look like a human, not a bot.

Bots are detected by patterns, not volume. A human browsing 1000 pages per day would:

Click on things
Scroll at varied speeds
Spend random time on each page
Come from different IPs
Use different user agents

Proxy Pool

You need at least 10-20 IPs for 1000 pages/day. DIY costs $50-200/month. APIs include it built-in.

Request Patterns

// Bad: Mechanical timing
for each page: scrape(page); wait(2 seconds);

// Good: Human-like timing
for each page: scrape(page); wait(1500 + random(1000, 3000));

Concurrency

Run 3-5 parallel requests. More triggers rate limiting.

Error Handling

429: Back off 30-60s
403: Rotate IP
503: Try later

Sample Pipeline

const client = new XcrawlScraper({ apiKey: "YOUR_KEY" });
const urls = [...]; // 1000 URLs
for (let i = 0; i < urls.length; i += 5) {
  const batch = urls.slice(i, i + 5);
  const results = await Promise.allSettled(
    batch.map(url => client.scrape({ url, js_render: true }))
  );
  await new Promise(r => setTimeout(r, 3000 + Math.random() * 5000));
}

Scale your scraping with XCrawl API

The Complete Guide to Web Scraping E-Commerce Sites in 2026

Charles — Tue, 19 May 2026 02:52:26 +0000

The Complete Guide to Web Scraping E-Commerce Sites in 2026

E-commerce scraping is the most common — and most difficult — scraping task. Here's the complete playbook.

Why E-Commerce is Hard

Anti-bot protection: Amazon, Walmart, Target all use aggressive bot detection
Dynamic content: Products load via JavaScript, not HTML
Rate limits: Aggressive throttling after N requests
Session tracking: Behavioral analysis tracks mouse movements and scroll patterns

Step-by-Step Strategy

Step 1: Choose Your Approach

Approach	Best For	Difficulty
API	Simple sites, small scale	Easy
Headless Browser	JS-rendered, moderate scale	Medium
Scraping API	Any site, any scale	Easy (just configure)

Step 2: Handle Product Pages

Key data to extract:

Title, price, availability
Reviews and ratings
Specifications
Images (URLs)
SKU/ASIN

Step 3: Handle Pagination

Most e-commerce sites paginate. Solutions:

URL parameter cycling (?page=1, ?page=2)
"Show More" button clicking (requires headless browser)
Infinite scroll (requires headless browser)

Step 4: Handle Variants

Products come in colors, sizes, models. Each variant has a different SKU and often a different URL.

Step 5: Scale

Use concurrent requests (5-10 parallel), rotate proxies, add random delays.

Quick Start with XCrawl

const { XcrawlScraper } = require('xcrawl-scraper');
const client = new XcrawlScraper({ apiKey: 'YOUR_KEY' });

const product = await client.scrape({
  url: 'https://amazon.com/dp/EXAMPLE',
  js_render: true,
  proxy: { country: 'US' },
  extraction: {
    mode: 'llm',
    schema: { title: 'string', price: 'string', rating: 'number' }
  }
});

Scrape e-commerce sites reliably: XCrawl API

Why Your Production Web Scraper Keeps Breaking (And How to Fix It)

Charles — Tue, 19 May 2026 02:52:15 +0000

Why Your Production Web Scraper Keeps Breaking

You built a scraper. It worked for a week. Then it broke. You fixed it. It broke again.

This is the lifecycle of every DIY web scraper in production.

The Top 5 Failure Modes

1. HTML Structure Changes

A dev on the target site changes a class name. Your .product-price selector breaks.

Fix: Use semantic selectors (data attributes, text content) instead of CSS classes.

2. IP Blocks

Your scraper sends too many requests from one IP. The CDN blocks you.

Fix: Proxy rotation. Every request from a different IP.

3. Rate Limiting

You hit 429 Too Many Requests. Backoff logic is mandatory.

Fix: Implement exponential backoff. Most APIs need 1-5s between requests.

4. JavaScript Rendered Content

The site switched from SSR to CSR. Suddenly requests.get() returns an empty shell.

Fix: Use js_render: true in your scraping API (like XCrawl).

5. CAPTCHA Walls

After N requests, Google reCAPTCHA appears. Game over for simple scrapers.

Fix: CAPTCHA solving services or — better — use an API that handles this.

The Reliable Stack

JS rendering — Always-on headless browser
Proxy rotation — Residential IP pool
Retry logic — Automatic retry on failure
Alert monitoring — Know when things break

Building all this yourself? Expect 2-4 hours/week of maintenance.

Using a scraping API? Set it and forget it.

Try a production-ready scraping API: XCrawl

Web Scraping 101: What Every Developer Should Know Before Writing Their First Scraper

Charles — Tue, 19 May 2026 02:51:33 +0000

Web Scraping 101: What Every Developer Should Know

Before you write your first scraper, here's what you need to know.

The Three Hard Problems

1. JavaScript Rendering

Modern websites are SPAs. curl and requests won't get you the real content.

Solution: Use a headless browser or an API that handles JS rendering automatically.

2. Anti-Bot Protection

Cloudflare, DataDome, PerimeterX — these actively block scrapers. You need:

Residential proxy rotation
Browser fingerprint spoofing
CAPTCHA solving

3. Rate Limiting

Scrape too fast? You get blocked. Too slow? Takes forever.

Tools Compared

Tool	JS Rendering	Proxies	Cost	Learning Curve
Puppeteer	✅ Built-in	❌ Manual	Free	Medium
Playwright	✅ Built-in	❌ Manual	Free	Medium
Scrapy	❌ (needs splash)	❌ Manual	Free	High
XCrawl API	✅ Auto	✅ Auto	$$	Low

My Advice

Start with a simple API. If a page gives you the HTML, use cheerio. If it blocks you, upgrade to an API that handles the hard parts. Don't build your own proxy infrastructure — it's not worth the time.

Built with XCrawl API

The Hidden Cost of DIY Web Scrapers: Why Your Time is Better Spent on APIs

Charles — Tue, 19 May 2026 02:26:31 +0000

The Hidden Cost of DIY Web Scrapers

Every developer has built one. The "simple" scraper that started as 20 lines of Python... and turned into a maintenance nightmare.

Development Time: 4-8 Hours

Writing selectors, handling pagination, dealing with auth — a "quick" scraper takes a full day.

Maintenance: 2-4 Hours/Week

Websites change their HTML. Your scraper breaks.

Infrastructure: $20-100/month

Proxies, headless browsers, server costs add up fast.

Hidden Costs

CAPTCHA solving: $0.50-2 per 1000 solves
IP blocks = lost data
Debugging non-deterministic failures

The Math

	DIY	API
Setup	4-8 hrs	5 min
Weekly maint.	2-4 hrs	0
Monthly cost	$50-200	$8-49
Reliability	60-80%	99%+

Your time as a developer is valuable. Use an API and focus on what matters.

XCrawl API

From API Key to Production: Setting Up a Web Scraping Pipeline in 10 Minutes

Charles — Tue, 19 May 2026 02:25:28 +0000

From API Key to Production: Setting Up a Web Scraping Pipeline in 10 Minutes

You've got a scraping API key. Now what?

Here's the fastest path from zero to a running scraping pipeline.

Step 1: Get Your API Key

Step 2: Install the SDK

npm install xcrawl-scraper

Step 3: Write Your First Pipeline

const { XcrawlScraper } = require('xcrawl-scraper');

const scraper = new XcrawlScraper({ apiKey: process.env.XCRAWL_API_KEY });

async function monitorPrices() {
  const products = ['https://amazon.com/dp/B0EXAMPLE1', 'https://amazon.com/dp/B0EXAMPLE2'];

  const results = await Promise.all(products.map(url => 
    scraper.scrape({ url, js_render: true, proxy: { country: 'US' } })
  ));

  console.log(results);
}

Step 4: Schedule It

# Run every 6 hours via cron
0 */6 * * * node /path/to/monitor.js >> prices.log

Step 5: Scale

Add retry logic, error handling, and CSV export. The API handles rate limits automatically.

Get started: dash.xcrawl.com

Web Scraping for Data Science: How to Build Datasets Without Writing Spaghetti Code

Charles — Tue, 19 May 2026 02:25:00 +0000

Web Scraping for Data Science: How to Build Datasets Without Writing Spaghetti Code

Every data scientist hits this wall: you find an amazing dataset source on the web, but it's behind paginated pages, dynamic JavaScript, or — worst of all — a CAPTCHA wall.

The Problem

Traditional scraping for data science looks like this:

import requests
from bs4 import BeautifulSoup
import time

response = requests.get('https://example.com/data')
soup = BeautifulSoup(response.text, 'html.parser')
# ... 50 lines of fragile selector logic ...
# ... oh wait, the page uses JS rendering ...
# ... and now I'm blocked ...

The API-First Approach

Modern web scraping APIs abstract away the infrastructure headaches:

curl -X POST https://api.xcrawl.com/v1/scrape \
  -H "x-api-key: YOUR_KEY" \
  -d '{"url": "https://example.com/data", "js_render": true}'

Building a Real Dataset: 1000 GitHub Repos

Here's how to build datasets using XCrawl:

Search

const { XcrawlScraper } = require('xcrawl-scraper');
const client = new XcrawlScraper({ apiKey: 'YOUR_KEY' });
const results = await client.search({ query: 'site:github.com topics', count: 100 });

Scrape & Extract

const data = await client.scrape({
  url: results[0].url,
  js_render: true,
  extraction: { mode: 'llm', schema: { repo_name: 'string', stars: 'number' } }
});

Export

xcrawl search "site:github.com" --count 1000 --output repos.csv

Why This Matters

Approach	Time	Lines of Code	Maintenance
DIY Scraper	2-4 hours	100-200	High (breaks weekly)
API-First	5-10 minutes	10-20	None

Get started with XCrawl API at dash.xcrawl.com

Introducing xcrawl-cli: A Command-Line Web Scraper in Your Terminal

Charles — Tue, 19 May 2026 02:24:14 +0000

Introducing xcrawl-cli: A Command-Line Web Scraper in Your Terminal

In my previous posts, I covered the xcrawl-scraper npm package and the XCrawl API. Today, I want to show you xcrawl-cli — the command-line interface that puts web scraping power directly in your terminal.

What is xcrawl-cli?

xcrawl-cli is a Node.js CLI tool that wraps the XCrawl API into simple terminal commands. No code required — just pipe URLs and get structured data.

Installation

npm install -g xcrawl-cli

Quick Start

Scrape a single page

xcrawl scrape https://example.com --format markdown

Search the web

xcrawl search "latest AI news" --count 10

Output to file

xcrawl search "Python tutorials" --output results.json

Features

Zero config — Just install and run
Multiple output formats — JSON, CSV, Markdown
Smart retry — Automatic retry with JS rendering when pages block you
Concurrent scraping — Up to 5 parallel requests
Proxy rotation — Residential proxies included

Real-World Example: Monitor HN Front Page

xcrawl search "site:news.ycombinator.com" --count 20 --output hn.json

Why Terminal?

Not every scraping task needs a full script. Sometimes you just want to:

Quickly grab page content for debugging
Test a search query before writing code
Schedule a crawl via cron

xcrawl-cli is for those moments.

Built on XCrawl API — handle JS rendering, CAPTCHAs, and IP blocks automatically.

XCrawl vs Puppeteer vs Playwright: Which Web Scraping Tool Saves You More Time in 2026?

Charles — Tue, 19 May 2026 01:18:27 +0000

The Web Scraping Toolkit Spectrum

Let's be real: there are dozens of ways to scrape the web. From raw curl to full-blow browser automation frameworks. But when it comes to JavaScript-rendered pages, most developers reach for one of three tools: Puppeteer, Playwright, or XCrawl.

Here's a no-BS comparison.

1. Puppeteer (Google)

Best for: Chrome-only browser testing and scraping

const puppeteer = require('puppeteer')
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://example.com')
const text = await page.evaluate(() => document.body.innerText)
await browser.close()

Pros: Mature ecosystem, lots of examples
Cons:

Chrome-only (no Firefox/WebKit)
No built-in proxy rotation
No CAPTCHA solving
You manage the browser lifecycle yourself
Memory-heavy per instance

2. Playwright (Microsoft)

Best for: Cross-browser testing and scraping

import { chromium } from 'playwright'
const browser = await chromium.launch()
const page = await browser.newPage()
await page.goto('https://example.com')

Pros: Multi-browser, modern API, auto-wait
Cons:

Still no built-in proxy management
No CAPTCHA handling
Same memory concerns as Puppeteer
You need a proxy service on top

3. XCrawl (Proxy API + SDK)

Best for: Production scraping without infrastructure overhead

import { XCrawl } from 'xcrawl-scraper'

const client = new XCrawl({ apiKey: 'your-key' })

const result = await client.scrapeMarkdown({
  url: 'https://example.com',
  proxyLocation: 'us',
  extractJson: true
})

Pros:

Zero infrastructure - No browser process to manage
Built-in proxy rotation - Residential + datacenter IPs
CAPTCHA bypass - Automatic
AI Extraction - extractJson() extracts structured data
Sticky sessions - Keep the same IP for multi-page crawls
SDK + CLI - Works in Node.js and command line

Cons:

Paid beyond the free tier
Depends on external API (not self-hosted)

Quick Comparison Table

Feature	Puppeteer	Playwright	XCrawl
Browser Management	Manual	Manual	Auto (cloud)
Proxy Rotation	DIY	DIY	Built-in
CAPTCHA Solving	No	No	Yes
AI Extraction	No	No	Yes
Memory Usage	High	High	None (client-side)
Price	Free	Free	Free tier + paid
Multi-browser	Chrome only	?	N/A (cloud)

When to Use What

Local testing / one-off scripts: Puppeteer or Playwright (free, local)
Production scraping at scale: XCrawl (no infra, proxy rotation built-in)
Cross-browser testing: Playwright (it's literally made for this)
Need structured data extraction: XCrawl (AI Extraction saves weeks of parsing)

The Bottom Line

If you're building a serious data pipeline that needs to run 24/7 at scale, you'll spend more time managing Puppeteer/Playwright infrastructure than actually writing logic. XCrawl removes that overhead entirely.

Try it: dash.xcrawl.com (free tier - 1000 credits)
SDK: github.com/yanxvdong123/xcrawl-scraper
npm: npm install xcrawl-scraper