# Build-time markdown mirrors for agent readability: how they compare to Cloudflare's approach - agentmarkup

> Build-time markdown generation for AI readability, including where it helps, where it is optional, and how it compares to Cloudflare runtime extraction.

Source: https://agentmarkup.dev/blog/markdown-mirrors/

By [Sebastian Cochinescu](/authors/sebastian-cochinescu/) · March 20, 2026 · 7 min read

# Build-time markdown mirrors for agent readability: how they compare to Cloudflare's approach

When an AI agent visits your website, it gets HTML. On some sites that is fine. On JS-heavy or layout-heavy pages, the content is buried in noise. Build-time markdown mirrors can give agents a cleaner fetch target without changing the canonical HTML page.

## Not every site needs a markdown mirror

If your pages already ship substantial, well-structured HTML, the raw page may already be a good enough fetch target for agents. Markdown mirrors are most useful when the raw HTML is thin, heavily templated, or dominated by layout chrome.

That is the more honest framing for this feature: markdown mirrors are an optional machine-facing artifact for the pages that benefit from them, not a blanket rule that every site should publish a public `.md` companion for every page.

## The problem: some HTML is a bad fetch target

Many agents can extract useful text from HTML, but the quality of the result still depends on what your raw response looks like. A typical web page can be heavy with navigation, cookie banners, analytics tags, scripts, and layout wrappers that have nothing to do with the main body content.

When the raw HTML is mostly shell and very little body content, fetch-based agents either miss the important text or have to guess too much. That is the case markdown mirrors try to fix.

## What are markdown mirrors?

A markdown mirror is a `.md` file that contains the same content as your HTML page, but stripped of layout, navigation, and scripts. Just the content, in clean markdown format.

For example, `/blog/my-post/index.html` gets a companion file at `/blog/my-post.md`. An AI agent can fetch the markdown version directly instead of parsing the HTML.

Your pages also get a`<link rel="alternate" type="text/markdown">` tag in the HTML head, so crawlers can discover the markdown version automatically when you enable the feature.

## How agentmarkup generates markdown mirrors

Enable the feature in your config and it runs at build time on every HTML page in your output:

```
// shared agentmarkup config
const agentmarkupConfig = {
 site: 'https://example.com',
 name: 'My Site',
 markdownPages: {
 enabled: true,
 },
}
```

The converter:

- Extracts the page title, meta description, and canonical URL from the HTML head
- Finds the main content area (`<main>`, `<article>`, or `<body>`)
- Strips navigation, headers, footers, sidebars, scripts, styles, SVGs, and forms
- Converts headings, lists, links, bold, italic, code, and blockquotes to markdown syntax
- Preserves code blocks intact
- Normalizes whitespace and deduplicates the page title
- Injects a `<link rel="alternate">` tag into the HTML for discovery

The result is a clean markdown file that an agent can read without wading through layout chrome.

## Cloudflare's approach: runtime readability extraction

Cloudflare offers a readability extraction feature that strips HTML to readable content at request time. It is based on Mozilla's Readability library and runs on Cloudflare's edge network.

The key difference is runtime versus build time. Cloudflare processes pages on every request. You do not control the exact output. The extraction algorithm decides what is content and what is noise using heuristics.

## Build-time vs runtime: why it matters

 agentmarkup (build-time) Cloudflare (runtime) When it runs Once, during build Every request Output control You see the .md files in your build output Opaque, algorithm decides Consistency Deterministic, same output every build May vary with algorithm updates Performance cost Zero runtime cost Added latency per request Works with SPAs Yes, uses noscript fallback or pre-rendered HTML Depends on SSR availability Discovery Link tag in HTML head + static .md URL Special URL parameter or header Vendor lock-in None, output is static files Requires Cloudflare Customization Choose which pages, preserve existing .md files All or nothing

## Why build-time can be a good fit for your own content

Cloudflare's runtime extraction makes sense for consuming other people's content, like a reader mode. For your own website, build-time generation can be a better fit because:

- **You control the output.** If the markdown is wrong, you can debug it. You see the actual.md files in your build directory.
- **It works with client-rendered apps.** agentmarkup checks for noscript fallback content in SPAs and uses it when the rendered body is thin. Runtime extractors often get empty content from JavaScript-rendered pages.
- **No vendor dependency.** The markdown files are static. Deploy them anywhere. They work on Cloudflare Pages, Netlify, Vercel, S3, or any static host.
- **Integrated with the rest of the stack.** Markdown mirrors work alongside llms.txt, JSON-LD, and robots.txt. One config, one build, everything consistent.

## How agentmarkup reduces the downside

Public markdown mirrors do create tradeoffs. The main risks are duplicate fetches, indexing ambiguity, and output drift if the markdown becomes a second source of truth.

agentmarkup tries to keep those risks contained by generating the mirrors from final built HTML, preserving HTML as the canonical page, and writing canonical headers from each `.md` file back to the HTML route. If your raw HTML is already substantial, you can also keep `llms.txt` pointing at HTML by setting`llmsTxt.preferMarkdownMirrors` to `false`.

## What the output looks like

For a blog post with a title, description, headings, and paragraphs, the generated markdown looks like:

```
# Why llms.txt matters

> LLMs answer questions by synthesizing web content. llms.txt gives them a structured overview.

Source: https://example.com/blog/why-llms-txt-matters/

## The shift from search engines to AI answers

For two decades, the path to online visibility was clear: optimize for Google...

## What is llms.txt?

llms.txt is a proposed standard that gives LLMs a structured overview of your website...
```

Clean, readable, no HTML artifacts. An AI agent reading this file understands the page quickly.

## Getting started

Add `markdownPages: { enabled: true }` to your agentmarkup config when your raw HTML needs a cleaner machine-facing fetch path. On the next build, every HTML page in your output gets a companion `.md` file. When markdown mirrors are enabled, same-site page entries in `llms.txt` also default to the generated markdown URLs so cold agents discover the cleaner fetch path first. Check the [llms.txt guide](/docs/llms-txt/) for the opt-out if you want HTML-first links instead.

If your site already serves rich raw HTML, you do not need to treat markdown mirrors as mandatory. They are a tactical option, not the whole product.

```
pnpm add -D @agentmarkup/next # or @agentmarkup/vite or @agentmarkup/astro
```

## Verify the protective headers in production

agentmarkup generates two sets of headers for markdown mirrors in the `_headers` file. Both are important for keeping search engines and agents on the right page.

**Canonical Link headers** tell search engines that the `.md` file is a mirror of the HTML page, not a separate indexable URL. Each mirror gets its own entry:

```
# from the generated _headers file
/blog/my-post.md
 Link: <https://example.com/blog/my-post>; rel="canonical"
```

**Content-Signal headers** tell agents whether they are allowed to use the content for training, search, and input. agentmarkup generates a wildcard rule that covers all paths including `.md` files:

```
/*
 Content-Signal: ai-train=yes, search=yes, ai-input=yes
```

These headers only work if your hosting platform actually serves them. Cloudflare Pages, Netlify, and Vercel all support `_headers` files, but the behavior can vary. After deploying, verify that the headers are present on a live `.md` URL:

```
curl -I https://yoursite.com/blog/my-post.md

# look for these in the response:
# Link: <https://yoursite.com/blog/my-post>; rel="canonical"
# Content-Signal: ai-train=yes, search=yes, ai-input=yes
```

If the `Link` header is missing, your host may not be applying path-specific `_headers` rules to `.md` files. Check your platform documentation or add equivalent headers through server configuration.

## Make your website machine-readable

agentmarkup is an open-source build-time toolkit for Vite, Astro, and Next.js that generates llms.txt, injects JSON-LD structured data, creates optional markdown mirrors from final HTML when raw pages need a cleaner agent-facing fetch path, manages AI crawler robots.txt rules, patches optional Content-Signal and canonical mirror headers, and validates everything at build time. Zero runtime cost.

 Learn more GitHub
```
pnpm add -D @agentmarkup/vite # or @agentmarkup/astro or @agentmarkup/next
```

Written by

[Sebastian Cochinescu](/authors/sebastian-cochinescu/) · Developer of agentmarkup

Builder of developer tools for machine-readable websites. Developer of agentmarkup. Founder of Anima Felix.

## More from the blog

### How to add llms.txt, JSON-LD, and AI crawler controls to Next.js

Use @agentmarkup/next to generate llms.txt, inject JSON-LD, manage AI crawler rules, and understand the dynamic SSR boundary in Next.js.

 March 23, 2026 · 8 min read

### When markdown mirrors help, and when they do not

A practical guide to when generated markdown mirrors add signal, when HTML is already enough, and how to avoid unnecessary downsides.

 March 20, 2026 · 7 min read

### Is your website ready for AI? Free LLM discoverability checker

Audit your website for llms.txt, JSON-LD, robots.txt, markdown mirrors, and sitemap. Free tool for e-commerce and brand websites.

 March 20, 2026 · 8 min read

### How to make your brand appear in AI conversations

Organization schema, llms.txt, and FAQ markup make your brand visible in ChatGPT, Claude, and Perplexity answers.

 March 20, 2026 · 7 min read

### Why LLM-optimized e-commerce websites sell more

Product JSON-LD, llms.txt, and AI crawler access make your store visible in AI product recommendations.

 March 20, 2026 · 8 min read

### Every AI crawler indexing your website in 2026

Complete list: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and more. What each does and how to control access.

 March 20, 2026 · 8 min read

### JSON-LD structured data: the complete guide for web developers

Schema types, JSON-LD vs microdata, common mistakes, and build-time validation.

 March 20, 2026 · 10 min read

### What is GEO? Generative Engine Optimization explained for developers

What is real, what is hype, and what you can do today to make your site citeable by AI.

 March 20, 2026 · 7 min read

### Why llms.txt matters: making your website discoverable by AI

LLMs answer questions by synthesizing web content. llms.txt gives them a structured overview of your site.

 March 20, 2026 · 6 min read
