Token-Tax Dodging in Browser Automation

2026-01-12 - Hugo O'Connor

NO CONTEXT WITHOUT CONSENT

⠀⠀⠀⠀⢀⣶⡖⢦⣤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠙⠋⠹⣦⠘⠉⠳⢤⣤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠙⠛⠓⠶⣶⣶⣿⣛⣀⣰⣆⣀⣀⠉⠛⢶⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠈⠉⠉⠉⠉⠁⠉⠉⠙⢷⣦⡀⠈⠻⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡤⠶⢶⣶⣾⣿⡀⠀⢸⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢠⡞⠋⠀⣠⣶⣖⣻⣿⠟⠀⠀⣸⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢻⣄⠀⠀⠈⠙⠛⠉⠁⠀⣠⣾⡟⠉⢷⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢀⣠⠿⠿⣿⣿⣿⡿⠿⢿⣿⣯⣿⡇⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⣰⠟⠁⠀⣀⣤⣤⣤⣤⣤⣤⣼⣿⣿⠇⠀⢈⣟⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⢰⡏⠀⠀⣾⣏⠁⠀⠀⠀⢀⣀⣴⡿⠋⠀⠀⣼⠟⣦⡀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠈⢷⣄⡀⠈⠙⠛⠛⠛⠛⠋⠉⠁⢀⣠⣴⣿⡟⠀⠸⣇⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⣈⣽⣿⣶⣶⣶⣶⣶⣶⣾⣿⣯⣍⠀⠘⡗⠀⠀⣿⠀⠀⠀⠀⠀⠀
⠀⠀⠀⣰⠞⠉⠉⠀⢀⣀⣀⣀⣀⡀⠀⠀⠹⣿⣷⣼⡇⠀⢀⡿⠀⠀⠀⠀⢠⢦
⠀⠀⣾⠋⠀⠀⢰⡿⠛⠉⠉⠉⠉⠙⠿⠷⣦⣼⣿⠏⠀⢀⣾⡃⠀⠀⠀⣀⡿⣹
⠀⠀⢿⡀⠀⠀⠘⢿⣦⣄⣀⣤⣤⣤⣤⡶⠿⠋⠁⠀⣠⣾⡏⠛⠓⠒⠛⣩⡴⠃
⠀⠀⠈⠻⣦⣀⡀⠀⠀⠈⠉⠉⠀⠀⠀⠀⠀⣀⣴⡾⠛⠛⠿⠶⠶⠶⠞⠋⠀⠀
⠀⠀⠀⠀⠈⠛⠿⠿⠿⠿⣿⣶⡶⠶⠶⠿⠟⠛⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

NO TOKEN TAX WITHOUT SEMANTIC COMPRESSION

I built a free software web crawler that chains services together with automatic fallback. I made this to achieve the following goal: price any recipe, anywhere.

FireCrawl misses pages. ScrapingBee times out. Browserless gets blocked. Every crawling service has gaps, so the architecture starts with zero API costs and escalates only when necessary.

Direct HTTP first: fast, free, works for 70% of pages
Local Playwright (browser automation with full JS rendering): no per-request fees
Paid APIs as backup: only when you actually need them

Most crawlers force you to pick one service and hope for the best. This one tries the cheap option first, then escalates.

Built in Racket and open source on Codeberg. You can install it on mac/linux;

curl -fsSL https://files.anuna.io/ar-crawl/latest/install.sh | bash

Why not just use MCP browser tools with an AI agent?

MCP (Model Context Protocol) tools let agents navigate pages directly. It works, as long as you're happy to burn your money.

Every navigate → wait → snapshot → click cycle bloats the context window. Accessibility snapshots alone can exceed 10K tokens per page. Crawl 100 pages and you have consumed millions of tokens on browser state before any reasoning begins.

The alternative that I've built in ar-crawl separates crawling from comprehension:

probe → crawl → sample → extract

Give the LLM agent only what it needs, rather than parsing irrelevant state.

Example: scraping Woolworths product prices

Woolworths and Coles control over 65% of Australian grocery retail. Neither offers a public API for pricing. Their sites are heavily JavaScript-dependent, block common scraping tools, and change structure frequently. Price transparency is not in a duopoly's interest. The ACCC has recommended mandatory live price APIs, but until that happens, making this data accessible requires tooling that can adapt. Here's how to get this price data with ar-crawl:

Probe - Does it need JS rendering?

ar-crawl probe https://woolworths.com.au/shop/browse/fruit-veg

=== Page Load Metrics ===

Timing:
  DOM Content Loaded: 895 ms
  Page Load Complete: 2968 ms
  Network Idle:       6299 ms
  JS Execution (est): 2073 ms

=== Content Analysis ===

  Content Type: Dynamic/SPA
  Recommendation: Use -s playwright for full content

=== Recommended Scraping Parameters ===

  --pw-delay 8500
  --pw-scroll-delay 2500
  --timeout 30000

The tool detects a dynamic SPA and recommends Playwright.

Crawl the category. The crawl-site command starts from a seed URL, discovers links, and follows them within domain boundaries. It maintains a queue with deduplication, respects the URL pattern filter, and stops at the page limit. Content is persisted to JSON or SQLite, so the agent can re-query without re-crawling:

ar-crawl crawl-site https://woolworths.com.au/shop/browse/fruit-veg \
  -s playwright --verbose --output woolies.json \
  --max-pages 50 --url-pattern ".*fruit-veg.*"

Sample one page from the crawl results. The agent inspects this HTML to identify XPath selectors for the content we want to extract:

ar-crawl sample woolies.json --length 10000

The agent reads the sample, identifies the product container and field patterns, then constructs the extraction command.

Extract structured data to SQLite using XPath selectors for the product tile container, name, and price:

ar-crawl extract woolies.json \
  --output prices.db --format sqlite \
  --parent "//div[@class='product-tile']" \
  --fields '{"name": ".//h3", "price": ".//span[@class=\"price\"]"}'

Query the resulting database:

sqlite3 prices.db "SELECT name, price FROM extracted_items LIMIT 5"

Bananas              | $2.90 per kg
Carrots 1kg          | $2.00
Broccoli             | $2.90 each
Royal Gala Apples    | $4.90 per kg
Brown Onions 1kg     | $1.90

An agent runs four shell commands. Gets a SQLite database. Queries it directly. One HTTP call internally; millions of tokens preserved externally.

Recording and replaying browser sessions

The latest addition to ar-crawl is interactive session recording. The session command starts a Playwright browser that an LLM agent controls via JSON commands. Every action is recorded in Chrome DevTools Recorder format.

ar-crawl session
{"sessionId":"abc-123","status":"ready"}

The agent sends JSON actions and receives structured responses:

{"type": "goto", "url": "https://example.com"}
{"success":true,"url":"https://example.com/","title":"Example"}

{"type": "click", "selector": "#login"}
{"success":true,"url":"https://example.com/login","title":"Login"}

State queries filter what the agent sees. MCP Playwright returns ~20k tokens per accessibility snapshot. A single wait_for command consumes more context than some local models can hold. Five to ten navigation steps can exhaust Claude Desktop's token budget. With ar-crawl, the agent requests only what it needs. A 1000x reduction:

This matters beyond cost. Models reason better with focused context. Bury the relevant signal in 20k tokens of accessibility tree noise and the model wastes capacity parsing structure instead of solving problems. Minimal state means the model's attention stays on the task.

state                          # ~20 tokens: URL + title only
state --actions                # Clickable elements, filtered
state --forms                  # Form inputs only
state --fields '{"price": "//span[@class=\"price\"]"}'  # Just the extracted values

When done, commit saves the recording:

commit session.json
{"status":"committed","file":"session.json"}

The recording is standard Chrome DevTools Recorder JSON. Import it directly into Chrome (F12 → Recorder → Import) to step through what the agent did. Or replay it programmatically:

ar-crawl replay session.json -o result.json

Exploratory testing with annotation

Let the LLM agent play-act as one of your users, exploring a web application to test whether a goal can be completed. Give it ar-crawl, the URL, and a high-level objective. The agent explores the interface, attempting to achieve the goal while thinking out loud.

The customStep action type lets the agent annotate its reasoning:

{"type": "customStep", "name": "thought", "parameters": {"note": "Looking for the checkout button. The cart shows 3 items."}}

{"type": "click", "selector": "button.checkout"}

{"type": "customStep", "name": "thought", "parameters": {"note": "Checkout loaded but I don't see a guest checkout option. Trying to find alternatives."}}

state --forms

{"type": "customStep", "name": "thought", "parameters": {"note": "There's a 'Continue as guest' link below the login form. Clicking it."}}

{"type": "click", "selector": "a.guest-checkout"}

The resulting recording contains both the actions and the agent's reasoning. A human reviewer opens it in Chrome DevTools and sees exactly what the agent tried and why. A QA engineer spots that the guest checkout link has poor visibility. A product manager sees that the checkout flow confused even an AI.

Another LLM agent can replay the same session to verify fixes:

ar-crawl replay checkout-test.json -o after-fix.json

The replay output includes per-step success/failure and timing. Pipe it through jq to validate:

jq -e '.recording.stepResults | all(.success)' after-fix.json && echo "All steps passed"

This turns exploratory testing into reproducible artifacts. The agent's session becomes documentation: what it tried, what it thought, what worked. Replay converts that documentation into regression tests.

Because the recordings use Chrome DevTools Recorder format, they integrate with standard E2E test frameworks. Chrome exports natively to Puppeteer. For Playwright and Cypress, CLI tools convert the same JSON: npx @cypress/chrome-recorder or npx playwright-chrome-recorder. An agent's exploratory session becomes a committed test in your CI pipeline. No proprietary formats, just the same JSON that Chrome produces when a human records a test manually.

The separation persists: crawling (or in this case, interactive exploration) happens in cheap, deterministic tooling. The model contributes judgment: deciding what to click, recognising when something looks wrong, articulating why. Recording captures both. Replay executes without the model. Intelligence and automation, cleanly separated.

Use the right tool for the job: deterministic tools for mechanical work, intelligence for decisions that require it.

This post was written with assistance from Claude (Anthropic).

Go back