Can Claude Read Your Website

(johnbrennan.xyz)

2 points | by johnb95 6 hours ago ago

1 comments

johnb95 6 hours ago

TL;DR We conducted a live experiment asking Claude Opus 4.6 to discover and read content across three websites built as React single-page applications with Express backends. At the start of the session, all three sites were effectively invisible — Claude received empty HTML shells with no article content, no navigation, and no discoverable paths to any content. Over several hours of iterative testing, debugging, and deployment, we identified which artifacts make a site legible to AI agents and which failures leave it dark. The single most impactful change was a plain-text sitemap (`sitemap.txt`) — one file, one URL per line, that transformed a completely opaque site into one Claude could navigate autonomously. The experiment also revealed that server-side HTML injection, structured Markdown endpoints, `llms.txt` directories, homepage discovery links, and correct MIME types each play distinct and complementary roles in AI legibility. A final test of the Unified TOON Meta-Index (`utmi.toon`) demonstrated that consolidating crawl rules, site index, AI summaries, and API tool registration into a single token-optimized file is viable and immediately useful to an AI agent — provided the file is served with a text MIME type rather than the default binary content type that web servers assign to unknown file extensions.

Key Takeaways React single-page applications are invisible to AI agents by default. Claude's fetch tools do not execute JavaScript, so any content rendered client-side does not exist from the agent's perspective. A plain-text sitemap (sitemap.txt) was the single most impactful artifact. Once provided, Claude could autonomously discover and read every piece of content on a site. Server-side HTML injection works — but edge caching can mask it entirely. A working injection pipeline appeared broken for over an hour because stale cached responses were being served. Markdown endpoints (.md) are the ideal content format for AI agents. Structured front matter, clean hierarchy, and explicit metadata allow an LLM to parse, cite, and reason about content with zero friction. Homepage discovery is the critical gap. If the homepage returns nothing navigable, an AI agent has no starting point — even if every other endpoint works perfectly. MIME types for novel file formats must be explicitly configured. A .toon file served as application/octet-stream is unreadable binary to an AI agent, regardless of how well-designed the format is. The UTMI format (utmi.toon) consolidates robots.txt, sitemaps, llms.txt, metadata, and API tool registration into a single file that Claude could parse immediately once the MIME type was corrected — demonstrating that unified site manifests are viable and useful for AI agents.