Extract Structured Data from Any Website with a JSON Schema
If you are building an AI agent or a data pipeline, you don't need a wall of markdown; you need structured data. Using an extract data from website api allows you to turn unstructured HTML into a clea...
Extract Structured Data from Any Website with a JSON Schema
If you are building an AI agent or a data pipeline, you don't need a wall of markdown; you need structured data. Using an extract data from website api allows you to turn unstructured HTML into a clean JSON object based on a schema you define. Instead of writing custom regex or fragile CSS selectors for every site, you define what you want, and Ilmenite handles the extraction.
What we're building
In this tutorial, we will implement a structured data extractor using the Ilmenite /v1/extract endpoint. We will build a system that can take any URL—whether it is an e-commerce product page, a job listing, or a news article—and return a strictly typed JSON object. You will learn how to define JSON schemas that the API uses to identify and isolate specific data points from the page content.
Prerequisites
Before you begin, ensure you have the following:
- An Ilmenite API key (you can sign up for a free account).
- Python 3.8+ installed on your machine.
- The
requestslibrary installed (pip install requests). - A basic understanding of JSON schema structures.
How to use the extract data from website api
The /v1/extract endpoint differs from the standard scrape endpoint. While /v1/scrape returns the entire page as markdown, /v1/extract uses a combination of our Rust-based browser engine and a language model to find specific fields.
Step 1: Define your JSON schema
The schema is the most important part of the request. It tells the API exactly what fields to look for and what data type they should be. If you are extracting a product page, you don't want the whole description; you want the price, the currency, and the availability.
For a product page, your schema should look like this:
{
"product_name": "string",
"price": "number",
"currency": "string",
"in_stock": "boolean",
"specifications": "array"
}
Step 2: Make the API request
You send a POST request to https://api.ilmenite.dev/v1/extract. The request body must include the url and the schema.
Here is a basic example using curl:
curl -X POST https://api.ilmenite.dev/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-ecommerce.com/product/123",
"schema": {
"product_name": "string",
"price": "number",
"currency": "string"
}
}'
Each request to the extract endpoint costs 5 credits. If the page requires heavy JavaScript rendering (React or Next.js), an additional 3 credits will be applied for the Chrome render. You can view the full breakdown in the pricing section.
Step 3: Implement use cases for different page types
Different websites require different schemas. To make your agent flexible, you should maintain a library of schemas based on the page type.
Case A: Job Listings
When scraping a job board, you need to isolate the role, the company, and the salary range.
Schema:
{
"job_title": "string",
"company_name": "string",
"salary_min": "number",
"salary_max": "number",
"remote": "boolean",
"required_skills": "array"
}
Case B: News Articles
For a RAG pipeline or a news aggregator, you need the core metadata and the primary claim of the article.
Schema:
{
"headline": "string",
"author": "string",
"publish_date": "string",
"summary": "string",
"main_entities": "array"
}
Optimizing your extract data from website api requests
To get the most reliable results, you should follow these technical guidelines when designing your schemas and requests.
Be explicit with types
While the API is flexible, using explicit types like number or boolean instead of generic string fields helps the extraction engine validate the data. If a price is listed as "$49.99", specifying the type as number tells the API to strip the currency symbol and return 49.99.
Use arrays for lists
If you are extracting a list of features or skills, always use the array type. This prevents the API from returning a single long string of comma-separated values, which is difficult to parse in a production database.
Combine with JavaScript rendering
Many modern sites are Single Page Applications (SPAs). If you find that the API is returning null values for fields that you can see in your browser, you likely need JavaScript rendering. This is handled automatically by our engine, but you can explicitly ensure it by checking the extract endpoint documentation for rendering flags.
Full working code example
Below is a complete Python implementation. This script includes a function that can handle different schemas based on the type of page you are scraping.
import requests
import json
def extract_web_data(url, schema, api_key):
endpoint = "https://api.ilmenite.dev/v1/extract"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
payload = {
"url": url,
"schema": schema
}
try:
response = requests.post(endpoint, headers=headers, json=payload)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error extracting data: {e}")
return None
# Your Ilmenite API Key
API_KEY = "your_api_key_here"
# Example 1: Product Page
product_url = "https://example-store.com/item/rust-book"
product_schema = {
"name": "string",
"price": "number",
"currency": "string",
"availability": "string"
}
# Example 2: Job Listing
job_url = "https://example-jobs.com/listing/software-engineer"
job_schema = {
"role": "string",
"company": "string",
"salary_range": "string",
"location": "string"
}
# Execute extractions
print("Extracting product data...")
product_data = extract_web_data(product_url, product_schema, API_KEY)
print(json.dumps(product_data, indent=2))
print("\nExtracting job data...")
job_data = extract_web_data(job_url, job_schema, API_KEY)
print(json.dumps(job_data, indent=2))
Next steps
Now that you can extract structured data, you can integrate this into a larger AI workflow. Most developers use this as the first step in a RAG (Retrieval-Augmented Generation) pipeline, where the structured JSON is stored in a vector database like Pinecone or Weaviate.
To further optimize your web data collection, explore these resources:
- Check the documentation to learn about the
/v1/crawlendpoint for indexing entire domains. - Try the playground to test your JSON schemas against live URLs without writing code.
- Explore the MCP integration to give Claude native access to these extraction capabilities.