|
| 1 | +# Structured Data Extraction Feature |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +BOSS Ghost MCP now includes powerful structured data extraction capabilities inspired by FireCrawl's Extract API. Extract structured data from web pages using schemas with three modes: DOM (fast), LLM (intelligent), or Hybrid (best of both). |
| 6 | + |
| 7 | +## Quick Example |
| 8 | + |
| 9 | +```typescript |
| 10 | +// Extract product data from an e-commerce page |
| 11 | +await structuredExtract.handler({ |
| 12 | + params: { |
| 13 | + schema: { |
| 14 | + productName: 'string', |
| 15 | + price: 'number', |
| 16 | + description: 'string', |
| 17 | + inStock: 'boolean', |
| 18 | + images: 'string[]' |
| 19 | + }, |
| 20 | + extractionMode: 'hybrid', // Try DOM first, fall back to LLM |
| 21 | + selector: '.product-details' // Optional: limit scope |
| 22 | + } |
| 23 | +}, response, context); |
| 24 | +``` |
| 25 | + |
| 26 | +## Features |
| 27 | + |
| 28 | +### Three Extraction Modes |
| 29 | + |
| 30 | +1. **DOM Mode** (Default - No API Keys Required) |
| 31 | + - Fast, pattern-based extraction |
| 32 | + - Uses CSS selectors and common HTML patterns |
| 33 | + - Free, deterministic, reliable for structured pages |
| 34 | + ```typescript |
| 35 | + extractionMode: 'dom' |
| 36 | + ``` |
| 37 | + |
| 38 | +2. **LLM Mode** (Requires API Keys) |
| 39 | + - Intelligent AI-powered extraction |
| 40 | + - Cascading providers: OpenAI GPT-4o-mini → Claude 3.5 Haiku |
| 41 | + - Handles complex, unstructured, or ambiguous content |
| 42 | + ```typescript |
| 43 | + extractionMode: 'llm', |
| 44 | + llmInstructions: 'Extract product details focusing on pricing and availability' |
| 45 | + ``` |
| 46 | + |
| 47 | +3. **Hybrid Mode** (Recommended) |
| 48 | + - Best of both worlds |
| 49 | + - Tries DOM first (fast, free) |
| 50 | + - Falls back to LLM only if DOM fails |
| 51 | + ```typescript |
| 52 | + extractionMode: 'hybrid' |
| 53 | + ``` |
| 54 | + |
| 55 | +### Cascading LLM Providers |
| 56 | + |
| 57 | +The LLM extraction system uses intelligent fallback: |
| 58 | + |
| 59 | +``` |
| 60 | +1st Attempt: OpenAI GPT-4o-mini |
| 61 | + ├─ Fast (< 1 second) |
| 62 | + ├─ Cheap (~$0.0002 per extraction) |
| 63 | + └─ Deterministic (JSON mode) |
| 64 | + ↓ [If fails] |
| 65 | +2nd Attempt: Claude 3.5 Haiku |
| 66 | + ├─ Reliable fallback |
| 67 | + ├─ Still fast |
| 68 | + └─ Slightly more expensive (~$0.0003) |
| 69 | +``` |
| 70 | + |
| 71 | +**Reliability**: With both keys configured, you get 99.9%+ uptime! |
| 72 | + |
| 73 | +## Setup |
| 74 | + |
| 75 | +### 1. Install Dependencies |
| 76 | + |
| 77 | +Already included in `package.json`: |
| 78 | +- `zod` - Schema validation |
| 79 | +- `openai` - OpenAI GPT-4o-mini (primary) |
| 80 | +- `@anthropic-ai/sdk` - Claude Haiku (fallback) |
| 81 | + |
| 82 | +### 2. Configure API Keys |
| 83 | + |
| 84 | +**Option A: Environment Variables** |
| 85 | +```bash |
| 86 | +export OPENAI_API_KEY="sk-proj-your-key-here" |
| 87 | +export ANTHROPIC_API_KEY="sk-ant-your-key-here" |
| 88 | +``` |
| 89 | + |
| 90 | +**Option B: .env File (Recommended)** |
| 91 | +```bash |
| 92 | +cp .env.example .env |
| 93 | +# Edit .env and add your keys |
| 94 | +``` |
| 95 | + |
| 96 | +**Get API Keys:** |
| 97 | +- OpenAI: https://platform.openai.com/api-keys |
| 98 | +- Anthropic: https://console.anthropic.com/settings/keys |
| 99 | + |
| 100 | +**Minimum Requirement**: At least ONE key (OpenAI or Anthropic) |
| 101 | +**Recommended**: Both keys for maximum reliability |
| 102 | + |
| 103 | +### 3. Verify Setup |
| 104 | + |
| 105 | +```bash |
| 106 | +npm test -- tests/utils/extraction/ |
| 107 | +``` |
| 108 | + |
| 109 | +## Usage Examples |
| 110 | + |
| 111 | +### E-Commerce Product Extraction |
| 112 | + |
| 113 | +```typescript |
| 114 | +await structuredExtract.handler({ |
| 115 | + params: { |
| 116 | + schema: { |
| 117 | + name: 'string', |
| 118 | + price: 'number', |
| 119 | + description: 'string', |
| 120 | + inStock: 'boolean', |
| 121 | + images: 'string[]', |
| 122 | + rating: 'number?', // Optional field |
| 123 | + reviews: 'number?' |
| 124 | + }, |
| 125 | + extractionMode: 'hybrid', |
| 126 | + selector: '.product-main' |
| 127 | + } |
| 128 | +}, response, context); |
| 129 | +``` |
| 130 | + |
| 131 | +### Job Listings |
| 132 | + |
| 133 | +```typescript |
| 134 | +await structuredExtract.handler({ |
| 135 | + params: { |
| 136 | + schema: { |
| 137 | + title: 'string', |
| 138 | + company: 'string', |
| 139 | + location: 'string', |
| 140 | + salary: 'string', |
| 141 | + remote: 'boolean', |
| 142 | + skills: 'string[]' |
| 143 | + }, |
| 144 | + extractionMode: 'dom', |
| 145 | + selector: '.job-posting' |
| 146 | + } |
| 147 | +}, response, context); |
| 148 | +``` |
| 149 | + |
| 150 | +### Blog Post Metadata (LLM Mode) |
| 151 | + |
| 152 | +```typescript |
| 153 | +await structuredExtract.handler({ |
| 154 | + params: { |
| 155 | + schema: { |
| 156 | + title: 'string', |
| 157 | + author: 'string', |
| 158 | + publishDate: 'string', |
| 159 | + tags: 'string[]', |
| 160 | + readingTime: 'number' |
| 161 | + }, |
| 162 | + extractionMode: 'llm', |
| 163 | + llmInstructions: 'Extract blog metadata. Calculate reading time based on word count (250 words per minute).' |
| 164 | + } |
| 165 | +}, response, context); |
| 166 | +``` |
| 167 | + |
| 168 | +## Supported Schema Types |
| 169 | + |
| 170 | +| Type | Example | Description | |
| 171 | +|------|---------|-------------| |
| 172 | +| `string` | `'Hello'` | Text content | |
| 173 | +| `number` | `42` | Numeric values | |
| 174 | +| `boolean` | `true` | True/false values | |
| 175 | +| `string[]` | `['a', 'b']` | Array of strings | |
| 176 | +| `number[]` | `[1, 2, 3]` | Array of numbers | |
| 177 | +| `boolean[]` | `[true, false]` | Array of booleans | |
| 178 | +| `date` | `'2025-12-28'` | ISO date string | |
| 179 | +| `string?` | Optional string | Optional fields (add `?`) | |
| 180 | + |
| 181 | +## DOM Extraction Patterns |
| 182 | + |
| 183 | +The DOM extractor automatically looks for common patterns: |
| 184 | + |
| 185 | +### By Field Name |
| 186 | +```html |
| 187 | +<!-- Automatically detected for field "title" --> |
| 188 | +<h1>...</h1> |
| 189 | +<h2>...</h2> |
| 190 | +<title>...</title> |
| 191 | +<div id="title">...</div> |
| 192 | +<div class="title">...</div> |
| 193 | +<div data-field="title">...</div> |
| 194 | +``` |
| 195 | + |
| 196 | +### By Semantic HTML |
| 197 | +```html |
| 198 | +<!-- Using schema.org microdata --> |
| 199 | +<span itemprop="price">$99.99</span> |
| 200 | +<div itemprop="description">...</div> |
| 201 | +<img itemprop="image" src="..." /> |
| 202 | +``` |
| 203 | + |
| 204 | +### By Common Patterns |
| 205 | +```html |
| 206 | +<!-- Email --> |
| 207 | +<a href="mailto:user@example.com">Contact</a> |
| 208 | + |
| 209 | +<!-- Price --> |
| 210 | +<span class="price">$99.99</span> |
| 211 | +<div itemprop="price">1,234.56</div> |
| 212 | + |
| 213 | +<!-- Images --> |
| 214 | +<img src="product.jpg" /> |
| 215 | +``` |
| 216 | + |
| 217 | +## Cost Management |
| 218 | + |
| 219 | +### Expected Costs |
| 220 | + |
| 221 | +| Mode | Cost per Extraction | When to Use | |
| 222 | +|------|---------------------|-------------| |
| 223 | +| DOM | **Free** | Structured, semantic HTML | |
| 224 | +| LLM (OpenAI) | ~$0.0002 | Unstructured content | |
| 225 | +| LLM (Claude) | ~$0.0003 | Fallback only | |
| 226 | +| Hybrid | ~$0.0002 (when needed) | General purpose (recommended) | |
| 227 | + |
| 228 | +### Tips to Minimize Costs |
| 229 | + |
| 230 | +1. **Start with DOM mode** - often sufficient for well-structured pages |
| 231 | +2. **Use `selector` parameter** - reduces HTML sent to LLM |
| 232 | +3. **Use hybrid mode** - only pays for LLM when DOM fails |
| 233 | +4. **Cache results** - don't re-extract the same page |
| 234 | +5. **Monitor usage** - set up billing alerts |
| 235 | + |
| 236 | +## Troubleshooting |
| 237 | + |
| 238 | +### "At least one API key required" |
| 239 | +- **Solution**: Set `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` |
| 240 | +- **Note**: DOM mode works without keys |
| 241 | + |
| 242 | +### DOM Extraction Fails |
| 243 | +- **Issue**: Required fields not found |
| 244 | +- **Solutions**: |
| 245 | + 1. Use more specific `selector` to limit scope |
| 246 | + 2. Add data attributes to your HTML (`data-field="fieldname"`) |
| 247 | + 3. Try LLM or hybrid mode |
| 248 | + 4. Check field names match HTML patterns |
| 249 | + |
| 250 | +### LLM Extraction Fails |
| 251 | +- **Issue**: Invalid API key or rate limit |
| 252 | +- **Solutions**: |
| 253 | + 1. Verify API key is correct |
| 254 | + 2. Check API key has sufficient credits |
| 255 | + 3. Wait a few minutes if rate limited |
| 256 | + 4. Try the other provider (OpenAI ↔ Anthropic) |
| 257 | + |
| 258 | +### Validation Errors |
| 259 | +- **Issue**: Extracted data doesn't match schema |
| 260 | +- **Solutions**: |
| 261 | + 1. Make fields optional with `?` suffix |
| 262 | + 2. Adjust schema to match actual data |
| 263 | + 3. Use LLM mode with custom instructions |
| 264 | + 4. Check the page actually contains the data |
| 265 | + |
| 266 | +## Performance |
| 267 | + |
| 268 | +| Mode | Speed | Accuracy | Cost | |
| 269 | +|------|-------|----------|------| |
| 270 | +| DOM | ⚡ <100ms | High (structured pages) | Free | |
| 271 | +| LLM | 🚀 1-2s | Very High | ~$0.0002 | |
| 272 | +| Hybrid | ⚡ <100ms - 2s | Very High | $0 - $0.0002 | |
| 273 | + |
| 274 | +## Testing |
| 275 | + |
| 276 | +### Run All Extraction Tests |
| 277 | +```bash |
| 278 | +npm test -- tests/utils/extraction/ |
| 279 | +npm test -- tests/tools/extraction.test.ts |
| 280 | +``` |
| 281 | + |
| 282 | +### Test Specific Features |
| 283 | +```bash |
| 284 | +# DOM extraction only (no API keys needed) |
| 285 | +npm test -- tests/utils/extraction/dom-extractor.test.ts |
| 286 | + |
| 287 | +# LLM extraction (requires API keys) |
| 288 | +npm test -- tests/utils/extraction/llm-extractor.test.ts |
| 289 | + |
| 290 | +# Full integration tests |
| 291 | +npm test -- tests/tools/extraction.test.ts |
| 292 | +``` |
| 293 | + |
| 294 | +## Architecture |
| 295 | + |
| 296 | +``` |
| 297 | +structured_extract (MCP Tool) |
| 298 | + ├─ DOM Extractor |
| 299 | + │ ├─ CSS Selector patterns |
| 300 | + │ ├─ Semantic HTML detection |
| 301 | + │ └─ Zod validation |
| 302 | + │ |
| 303 | + └─ LLM Extractor |
| 304 | + ├─ OpenAI GPT-4o-mini (primary) |
| 305 | + │ ├─ JSON mode |
| 306 | + │ ├─ Temperature 0 |
| 307 | + │ └─ ~1s response |
| 308 | + │ |
| 309 | + └─ Claude 3.5 Haiku (fallback) |
| 310 | + ├─ JSON extraction |
| 311 | + └─ ~1.5s response |
| 312 | +``` |
| 313 | + |
| 314 | +## Contributing |
| 315 | + |
| 316 | +See [FIRECRAWL_BOSS_GHOST_INTEGRATION.md](./FIRECRAWL_BOSS_GHOST_INTEGRATION.md) for implementation details and roadmap. |
| 317 | + |
| 318 | +--- |
| 319 | + |
| 320 | +**Questions?** Check the [Setup Guide](./SETUP_GUIDE.md) or open an issue! |
0 commit comments