How to Convert Any Website to Markdown with an API
AI agents and RAG pipelines require clean, structured data to function. Raw HTML is filled with noise—navigation bars, footer links, scripts, and CSS—that wastes LLM tokens and confuses the model. By ...
How to Convert Any Website to Markdown with an API
Meta description: Learn how to use a website to markdown API to convert web pages into LLM-ready text. Step-by-step guide with Python, TypeScript, and curl examples.
AI agents and RAG pipelines require clean, structured data to function. Raw HTML is filled with noise—navigation bars, footer links, scripts, and CSS—that wastes LLM tokens and confuses the model. By using a website to markdown api, you can strip away the boilerplate and deliver only the core content in a format that LLMs understand natively.
In this guide, we will show you how to use Ilmenite to convert any URL into clean markdown. We will cover basic implementation using curl, Python, and TypeScript, and explain how to handle complex JavaScript-heavy sites.
Why Markdown is Better Than HTML for LLMs
Before implementing the API, it is important to understand why you should avoid feeding raw HTML into a Large Language Model (LLM).
Token Efficiency
LLMs have finite context windows. A typical web page might have 50KB of HTML but only 5KB of actual content. The rest is metadata, scripts, and styling. Converting a page to markdown reduces the token count significantly, allowing you to fit more information into a single prompt and reducing your API costs.
Noise Reduction
HTML contains "noise" that can lead to hallucinations. For example, a sidebar containing "Related Articles" might be interpreted by an LLM as part of the main body text. A dedicated website to markdown api removes these elements, ensuring the model focuses only on the primary content.
Structural Preservation
Unlike plain text, markdown preserves the semantic structure of a page. It keeps headers (#), lists (-), and links (text). This allows the LLM to understand the hierarchy of the information, which is critical for tasks like summarization or data extraction.
Prerequisites
To follow this tutorial, you will need:
- An Ilmenite API key. You can sign up for a free account to get started.
- A terminal with
curlinstalled. - Python 3.8+ or Node.js 16+ installed on your machine.
- A target URL you wish to convert to markdown.
Implementing the Website to Markdown API
The core of this process is the /v1/scrape endpoint. Unlike traditional scrapers that require you to write complex CSS selectors, this endpoint handles the cleaning and conversion automatically.
Step 1: Basic Request with curl
The fastest way to test the API is via curl. This request sends a URL to the server, which then renders the page and returns the markdown.
curl -X POST https://api.ilmenite.dev/v1/scrape \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"url": "https://example.com",
"format": "markdown"
}'
The response will be a JSON object containing the cleaned markdown text, the page title, and metadata.
Step 2: Implementation in Python
For AI agent builders using LangChain or LlamaIndex, Python is the standard. We use the requests library to communicate with the API.
import requests
def convert_to_markdown(url, api_key):
endpoint = "https://api.ilmenite.dev/v1/scrape"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
data = {
"url": url,
"format": "markdown"
}
response = requests.post(endpoint, json=data, headers=headers)
if response.status_code == 200:
return response.json().get("markdown")
else:
return f"Error: {response.status_code} - {response.text}"
# Usage
API_KEY = "your_api_key_here"
TARGET_URL = "https://docs.python.org/3/"
markdown_content = convert_to_markdown(TARGET_URL, API_KEY)
print(markdown_content)
Step 3: Implementation in TypeScript
If you are building a web-based AI tool or a Node.js backend, TypeScript is the best choice.
import axios from 'axios';
async function convertToMarkdown(url: string, apiKey: string) {
const endpoint = 'https://api.ilmenite.dev/v1/scrape';
try {
const response = await axios.post(endpoint, {
url: url,
format: 'markdown',
}, {
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${apiKey}`,
},
});
return response.data.markdown;
} catch (error) {
console.error('Error converting website to markdown:', error);
throw error;
}
}
// Usage
const API_KEY = 'your_api_key_here';
const TARGET_URL = 'https://typescriptlang.org/';
convertToMarkdown(TARGET_URL, API_KEY).then(console.log);
Step 4: Handling JavaScript-Heavy Websites
Many modern websites use React, Vue, or Next.js. A standard HTTP request to these sites often returns an empty shell because the content is rendered in the browser via JavaScript.
Ilmenite solves this by using a browser engine built in Rust. While our native engine handles most sites, some complex Single Page Applications (SPAs) require full Chrome rendering. You can trigger this by adding the render_js parameter to your request.
Updated curl request for JS rendering:
curl -X POST https://api.ilmenite.dev/v1/scrape \
-H "Content-Type": "application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"url": "https://complex-react-site.com",
"format": "markdown",
"render_js": true
}'
Note that rendering JavaScript is more resource-intensive and costs 3 credits per request, compared to 1 credit for standard scraping. You can find more details on credit costs in our pricing page.
Performance and Architecture
When choosing a website to markdown api, performance matters—especially for autonomous agents that need to browse the web in real-time.
Ilmenite is built in pure Rust, which eliminates the overhead associated with Node.js or Python-based wrappers. Our browser engine starts in 0.19ms and uses only 2MB of RAM per session. This is 100x lighter than Chrome-based alternatives.
For developers who require strict data residency or air-gapped environments, Ilmenite can be self-hosted as a single binary or a 12MB Docker image. This ensures your data never leaves your infrastructure while maintaining sub-millisecond startup times.
Full Working Example: RAG-Ready Scraper
Below is a complete Python script that takes a list of URLs and prepares them for a vector database by converting them to markdown.
import requests
import json
class WebToMarkdownConverter:
def __init__(self, api_key):
self.api_key = api_key
self.endpoint = "https://api.ilmenite.dev/v1/scrape"
def scrape(self, url):
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.api_key}"
}
payload = {
"url": url,
"format": "markdown",
"render_js": True # Ensure we get content from SPAs
}
try:
response = requests.post(self.endpoint, json=payload, headers=headers)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Failed to scrape {url}: {e}")
return None
def main():
API_KEY = "your_api_key_here"
urls = [
"https://ilmenite.dev",
"https://rust-lang.org",
"https://openai.com/blog"
]
converter = WebToMarkdownConverter(API_KEY)
processed_data = []
for url in urls:
print(f"Processing {url}...")
result = converter.scrape(url)
if result:
processed_data.append({
"url": url,
"title": result.get("title"),
"content": result.get("markdown")
})
# Save for RAG pipeline indexing
with open("web_data.json", "w") as f:
json.dump(processed_data, f, indent=2)
print("Successfully converted all pages to markdown.")
if __name__ == "__main__":
main()
Next Steps
Now that you can convert websites to markdown, you can expand your AI agent's capabilities:
- Crawl Entire Sites: Use the
/v1/crawlendpoint to index an entire documentation site rather than single pages. See the crawl documentation. - Structured Extraction: If you need specific data (like product prices) instead of a full page, use the
/v1/extractendpoint to get structured JSON. - Integrate with Claude: Use our MCP (Model Context Protocol) server to give Claude native access to the web without writing custom glue code.
Ready to start building? You can start free or test the API in our playground.