Is Ilmenite really a browser?

Yes — a headless browser built in pure Rust. It parses HTML, queries CSS selectors, runs JavaScript, and extracts content. Not a Chromium fork.

How does pricing work?

Credits-based, pay per use. 1 scrape = 1 credit. Free tier includes 500 credits/month. Pay as you go after that — no subscriptions, no browser-hour metering.

All posts

TutorialApril 2, 2026·4 min·Ilmenite Team

Extract Structured Data from Any Website with a JSON Schema

If you are building an AI agent or a data pipeline, you don't need a wall of markdown; you need structured data. Using an extract data from website api allows you to turn unstructured HTML into a clean JSON object based on a schema you define. Instead of writing custom regex or fragile CSS selectors for every site, you define what you want, and Ilmenite handles the extraction.

What we're building

In this tutorial, we will implement a structured data extractor using the Ilmenite /v1/extract endpoint. We will build a system that can take any URL—whether it is an e-commerce product page, a job listing, or a news article—and return a strictly typed JSON object. You will learn how to define JSON schemas that the API uses to identify and isolate specific data points from the page content.

Prerequisites

Before you begin, ensure you have the following:

An Ilmenite API key (you can sign up for a free account).
Python 3.8+ installed on your machine.
The requests library installed (pip install requests).
A basic understanding of JSON schema structures.

How to use the extract data from website api

The /v1/extract endpoint differs from the standard scrape endpoint. While /v1/scrape returns the entire page as markdown, /v1/extract uses a combination of our Rust-based headless browser and a language model to find specific fields.

Step 1: Define your JSON schema

The schema is the most important part of the request. It tells the API exactly what fields to look for and what data type they should be. If you are extracting a product page, you don't want the whole description; you want the price, the currency, and the availability.

For a product page, your schema should look like this:

{
  "product_name": "string",
  "price": "number",
  "currency": "string",
  "in_stock": "boolean",
  "specifications": "array"
}

Step 2: Make the API request

You send a POST request to https://api.ilmenite.dev/v1/extract. The request body must include the url and the schema.

Here is a basic example using curl:

curl -X POST https://api.ilmenite.dev/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-ecommerce.com/product/123",
    "schema": {
      "product_name": "string",
      "price": "number",
      "currency": "string"
    }
  }'

Each request to the extract endpoint costs $0.005. If the page requires heavy JavaScript rendering (React or Next.js), a browser session at $0.005 will be added for the Chrome render. You can view the full breakdown in the pricing section.

Step 3: Implement use cases for different page types

Different websites require different schemas. To make your agent flexible, you should maintain a library of schemas based on the page type.

Case A: Job Listings

When scraping a job board, you need to isolate the role, the company, and the salary range.

Schema:

{
  "job_title": "string",
  "company_name": "string",
  "salary_min": "number",
  "salary_max": "number",
  "remote": "boolean",
  "required_skills": "array"
}

Case B: News Articles

For a RAG pipeline or a news aggregator, you need the core metadata and the primary claim of the article.

Schema:

{
  "headline": "string",
  "author": "string",
  "publish_date": "string",
  "summary": "string",
  "main_entities": "array"
}

Optimizing your extract data from website api requests

To get the most reliable results, you should follow these technical guidelines when designing your schemas and requests.

Be explicit with types

While the API is flexible, using explicit types like number or boolean instead of generic string fields helps the extraction engine validate the data. If a price is listed as "$49.99", specifying the type as number tells the API to strip the currency symbol and return 49.99.

Use arrays for lists

If you are extracting a list of features or skills, always use the array type. This prevents the API from returning a single long string of comma-separated values, which is difficult to parse in a production database.

Combine with JavaScript rendering

Many modern sites are Single Page Applications (SPAs). If you find that the API is returning null values for fields that you can see in your browser, you likely need JavaScript rendering. This is handled automatically by our engine, but you can explicitly ensure it by checking the extract endpoint documentation for rendering flags.

Full working code example

Below is a complete Python implementation. This script includes a function that can handle different schemas based on the type of page you are scraping.

import requests
import json

def extract_web_data(url, schema, api_key):
    endpoint = "https://api.ilmenite.dev/v1/extract"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    payload = {
        "url": url,
        "schema": schema
    }

    try:
        response = requests.post(endpoint, headers=headers, json=payload)
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error extracting data: {e}")
        return None

# Your Ilmenite API Key
API_KEY = "your_api_key_here"

# Example 1: Product Page
product_url = "https://example-store.com/item/rust-book"
product_schema = {
    "name": "string",
    "price": "number",
    "currency": "string",
    "availability": "string"
}

# Example 2: Job Listing
job_url = "https://example-jobs.com/listing/software-engineer"
job_schema = {
    "role": "string",
    "company": "string",
    "salary_range": "string",
    "location": "string"
}

# Execute extractions
print("Extracting product data...")
product_data = extract_web_data(product_url, product_schema, API_KEY)
print(json.dumps(product_data, indent=2))

print("\nExtracting job data...")
job_data = extract_web_data(job_url, job_schema, API_KEY)
print(json.dumps(job_data, indent=2))

Next steps

Now that you can extract structured data, you can integrate this into a larger AI workflow. Most developers use this as the first step in a RAG (Retrieval-Augmented Generation) pipeline, where the structured JSON is stored in a vector database like Pinecone or Weaviate.

To further optimize your web data collection, explore these resources:

Check the documentation to learn about the /v1/crawl endpoint for indexing entire domains.
Try the playground to test your JSON schemas against live URLs without writing code.
Explore the MCP integration to give Claude native access to these extraction capabilities.