Content extraction API

Any URL in. Clean Markdown out.

Distil fetches a page, renders the JavaScript, strips nav, ads and chrome, and hands back the main content as tidy Markdown your model can actually read. One POST — structured, deterministic, ready for retrieval.

Get an API key Read the docs

No credit card · 1,000 free pages · p50 1.4s per fetch

request.sh

$ curl https://api.distil.dev/v1/extract \
  -H "Authorization: Bearer $DISTIL_KEY" \
  -H "Content-Type: application/json" \
  -X POST -d '{
    "url": "https://example.com/blog/rag-tips",
    "format": "markdown",
    "main_only": true
  }'

response.json 200 OK

{
  "title": "Seven RAG tips",
  "word_count": 2184,
  "markdown": "# Seven RAG tips..."
}

Install $ npm i @distil/sdk

Powering retrieval at

Meridian AI
Halcyon Labs
Northwind
Corvus
Lumen Stack

Built for pipelines

The boilerplate problem, solved at the API.

You don’t want raw HTML in your context window. Distil returns the part that matters — and tells you exactly what it kept.

Typed SDK

Five lines to a clean document.

Point the client at a URL and stream structured Markdown straight into your splitter. Retries, render-wait and rate limits are handled for you.

ingest.ts

import { Distil } from "@distil/sdk";

const distil = new Distil(process.env.DISTIL_KEY);

const doc = await distil.extract({
  url: "https://example.com/docs",
  format: "markdown",
  render: "auto",     // wait for JS only if needed
});

index.add(doc.markdown);  // straight into your vector store

Main content, not the menu.

A readability model isolates the article body and drops nav, cookie banners, footers and related-post rails — so your embeddings learn signal, not chrome.

Headings, lists & tables preserved
Links and image alt-text kept inline
Scripts, styles & tracking removed

Deterministic & observable.

Same URL, same Markdown — every call returns a content hash, token count and the rules that fired, so a re-index is a diff, not a guessing game.

content_hash on every response
Token count for your budget
Webhooks for batch crawls

Quickstart

Live in three lines.

Install the SDK, point extract at a URL, and read clean Markdown back. No headless browser to babysit, no HTML to scrub — the happy path is the whole path.

Full SDK reference

quickstart.ts ready

import { extract } from "@distil/sdk";

const doc = await extract("https://example.com/blog/rag-tips");

console.log(doc.markdown);  // → "# Seven RAG tips\n\n## 1. Chunk on…"

Pricing

Pay for pages, not seats.

Metered per successful extraction. Failed fetches are never billed.

Free

1,000 pages / month

Markdown & JSON output
Auto JS rendering
Community support

Start building

Most teams

Scale

$0.0008

per page · volume tiers

Everything in Free
Batch crawl & webhooks
99.9% uptime SLA
Priority render queue

Get API key

Enterprise

Custom

dedicated & on-prem

Private rendering region
SSO, audit log, DPA
Solutions engineer

Talk to us