|
| 1 | +--- |
| 2 | +name: edgeparse |
| 3 | +description: Extract structured content from any PDF for AI agents, RAG pipelines, and Copilot Skills. Use this skill whenever the user wants to read, analyze, or reason about a PDF document; needs to feed document content to an LLM; mentions PDF extraction, parsing, or conversion; wants tables, headings, or bounding boxes from a PDF; is building a RAG pipeline; or asks an agent to process a document. Install with: pip install edgeparse |
| 4 | +license: Apache-2.0 |
| 5 | +metadata: |
| 6 | + authors: "EdgeParse Contributors" |
| 7 | + version: "0.1.0" |
| 8 | + package: "edgeparse" |
| 9 | + install_python: "pip install edgeparse" |
| 10 | + install_node: "npm install edgeparse" |
| 11 | + source: "raphaelmansuy/edgeparse" |
| 12 | +--- |
| 13 | + |
| 14 | +# EdgeParse Skill |
| 15 | + |
| 16 | +Enables AI agents to extract clean, structured content from any PDF — headings, tables, paragraphs, lists, bounding boxes — deterministically, without ML dependencies or GPU requirements. |
| 17 | + |
| 18 | +**Install:** `pip install edgeparse` · **Node.js:** `npm install edgeparse` |
| 19 | +**Speed:** ~0.023 s/doc (Apple M4 Max, 200-doc benchmark) |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## When to reach for this skill |
| 24 | + |
| 25 | +Activate when the workflow involves: |
| 26 | +- Reading or analyzing a PDF document on behalf of a user |
| 27 | +- Building a RAG pipeline that ingests PDFs |
| 28 | +- Feeding PDF content to an LLM for summarization, Q&A, or synthesis |
| 29 | +- Extracting tables from financial reports, research papers, or invoices |
| 30 | +- Processing a batch of documents for indexing or search |
| 31 | +- An agent tool that must "open" a PDF and return its contents |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## Quick start |
| 36 | + |
| 37 | +```python |
| 38 | +import edgeparse |
| 39 | + |
| 40 | +# Convert any PDF to Markdown — best for LLM context windows |
| 41 | +text = edgeparse.convert("report.pdf", format="markdown") |
| 42 | + |
| 43 | +# Convert to JSON with bounding boxes and full structure |
| 44 | +import json |
| 45 | +doc = json.loads(edgeparse.convert("report.pdf", format="json")) |
| 46 | + |
| 47 | +# Plain text (fast, minimal) |
| 48 | +plain = edgeparse.convert("report.pdf", format="text") |
| 49 | +``` |
| 50 | + |
| 51 | +The `format` parameter controls output: |
| 52 | +| Value | Best for | |
| 53 | +|-------|----------| |
| 54 | +| `"markdown"` | LLM context — headings, tables, lists in Markdown | |
| 55 | +| `"json"` | Bounding boxes, citations, structured element metadata | |
| 56 | +| `"html"` | Web rendering, semantic HTML5 | |
| 57 | +| `"text"` | Simple full-text search, minimal output | |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +## Core API |
| 62 | + |
| 63 | +### `edgeparse.convert()` |
| 64 | + |
| 65 | +```python |
| 66 | +result: str = edgeparse.convert( |
| 67 | + input_path, # str or Path — required |
| 68 | + format="markdown", # output format (see table above) |
| 69 | + pages=None, # e.g. "1-5" or "1,3,7-10" — specific pages only |
| 70 | + password=None, # for password-protected PDFs |
| 71 | + reading_order="xycut", # "xycut" (spatial sort, default) or "off" |
| 72 | + table_method="default", # "default" (ruling-line) or "cluster" (borderless) |
| 73 | + image_output="off", # "off", "embedded" (base64), "external" (files) |
| 74 | +) |
| 75 | +``` |
| 76 | + |
| 77 | +Returns the extracted content as a **string**. Raises `FileNotFoundError` for missing files and `ValueError` for corrupt PDFs or bad options. |
| 78 | + |
| 79 | +### `edgeparse.convert_file()` |
| 80 | + |
| 81 | +```python |
| 82 | +out_path: str = edgeparse.convert_file( |
| 83 | + input_path, |
| 84 | + output_dir="output", # write output file to this directory |
| 85 | + format="markdown", |
| 86 | + pages=None, |
| 87 | + password=None, |
| 88 | +) |
| 89 | +``` |
| 90 | + |
| 91 | +Writes the output file and returns its path. |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +## Common patterns |
| 96 | + |
| 97 | +### Feed a PDF to an LLM |
| 98 | + |
| 99 | +```python |
| 100 | +import edgeparse |
| 101 | +import anthropic |
| 102 | + |
| 103 | +doc = edgeparse.convert("report.pdf", format="markdown") |
| 104 | + |
| 105 | +client = anthropic.Anthropic() |
| 106 | +response = client.messages.create( |
| 107 | + model="claude-opus-4-5", |
| 108 | + max_tokens=4096, |
| 109 | + messages=[{ |
| 110 | + "role": "user", |
| 111 | + "content": f"Analyze this document and summarize the key findings:\n\n{doc}" |
| 112 | + }] |
| 113 | +) |
| 114 | +print(response.content[0].text) |
| 115 | +``` |
| 116 | + |
| 117 | +### RAG pipeline — chunk with metadata |
| 118 | + |
| 119 | +```python |
| 120 | +import edgeparse, json |
| 121 | + |
| 122 | +raw = edgeparse.convert("paper.pdf", format="json") |
| 123 | +doc = json.loads(raw) |
| 124 | + |
| 125 | +chunks = [] |
| 126 | +for el in doc["elements"]: |
| 127 | + if el["type"] in ("paragraph", "heading", "table"): |
| 128 | + chunks.append({ |
| 129 | + "text": el["text"], |
| 130 | + "metadata": { |
| 131 | + "page": el["page_number"], |
| 132 | + "type": el["type"], |
| 133 | + "bbox": el["bounding_box"], # for citation highlights |
| 134 | + "order": el["reading_order"], |
| 135 | + } |
| 136 | + }) |
| 137 | + |
| 138 | +# Now embed chunks["text"] and store chunks["metadata"] in your vector store |
| 139 | +``` |
| 140 | + |
| 141 | +### Batch processing |
| 142 | + |
| 143 | +```python |
| 144 | +import edgeparse |
| 145 | +from pathlib import Path |
| 146 | + |
| 147 | +results = {} |
| 148 | +for pdf in Path("documents/").glob("*.pdf"): |
| 149 | + try: |
| 150 | + results[pdf.name] = edgeparse.convert(str(pdf), format="markdown") |
| 151 | + except Exception as e: |
| 152 | + results[pdf.name] = f"ERROR: {e}" |
| 153 | +``` |
| 154 | + |
| 155 | +### Extract specific pages only |
| 156 | + |
| 157 | +```python |
| 158 | +# Pages 1–5 |
| 159 | +text = edgeparse.convert("report.pdf", format="markdown", pages="1-5") |
| 160 | + |
| 161 | +# Non-contiguous pages |
| 162 | +text = edgeparse.convert("report.pdf", format="markdown", pages="1,3,7-10") |
| 163 | +``` |
| 164 | + |
| 165 | +### Borderless table extraction |
| 166 | + |
| 167 | +Many financial reports and invoices use tables without ruling lines. |
| 168 | +Use `table_method="cluster"` to handle them: |
| 169 | + |
| 170 | +```python |
| 171 | +text = edgeparse.convert( |
| 172 | + "earnings.pdf", |
| 173 | + format="markdown", |
| 174 | + table_method="cluster" # spatial clustering for borderless tables |
| 175 | +) |
| 176 | +``` |
| 177 | + |
| 178 | +### Password-protected PDF |
| 179 | + |
| 180 | +```python |
| 181 | +text = edgeparse.convert("secure.pdf", format="markdown", password="mypassword") |
| 182 | +``` |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +## Node.js usage |
| 187 | + |
| 188 | +```js |
| 189 | +import { convert } from 'edgeparse'; |
| 190 | + |
| 191 | +const markdown = convert('report.pdf', { format: 'markdown' }); |
| 192 | +const json = convert('report.pdf', { format: 'json' }); |
| 193 | + |
| 194 | +// With options |
| 195 | +const result = convert('report.pdf', { |
| 196 | + format: 'markdown', |
| 197 | + pages: '1-5', |
| 198 | + readingOrder: 'xycut', |
| 199 | + tableMethod: 'cluster', |
| 200 | +}); |
| 201 | +``` |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +## JSON output schema |
| 206 | + |
| 207 | +When `format="json"`, the output is a JSON string with shape: |
| 208 | + |
| 209 | +```json |
| 210 | +{ |
| 211 | + "page_count": 10, |
| 212 | + "title": "Document Title", |
| 213 | + "elements": [ |
| 214 | + { |
| 215 | + "type": "heading", |
| 216 | + "level": 1, |
| 217 | + "text": "Introduction", |
| 218 | + "page_number": 1, |
| 219 | + "reading_order": 0, |
| 220 | + "bounding_box": { "x0": 72, "y0": 144, "x1": 540, "y1": 180 } |
| 221 | + }, |
| 222 | + { |
| 223 | + "type": "table", |
| 224 | + "text": "| Col A | Col B |\n|-------|-------|\n| val1 | val2 |", |
| 225 | + "page_number": 2, |
| 226 | + "bounding_box": { "x0": 72, "y0": 200, "x1": 540, "y1": 350 } |
| 227 | + }, |
| 228 | + { |
| 229 | + "type": "paragraph", |
| 230 | + "text": "This is body text...", |
| 231 | + "page_number": 1, |
| 232 | + "reading_order": 2, |
| 233 | + "bounding_box": { "x0": 72, "y0": 190, "x1": 540, "y1": 220 } |
| 234 | + } |
| 235 | + ] |
| 236 | +} |
| 237 | +``` |
| 238 | + |
| 239 | +Element `type` values: `heading`, `paragraph`, `table`, `list`, `list_item`, `figure`, `caption`, `header`, `footer`. |
| 240 | + |
| 241 | +--- |
| 242 | + |
| 243 | +## Error handling |
| 244 | + |
| 245 | +```python |
| 246 | +import edgeparse |
| 247 | + |
| 248 | +try: |
| 249 | + text = edgeparse.convert("report.pdf", format="markdown") |
| 250 | +except FileNotFoundError: |
| 251 | + # PDF file not found — check the path |
| 252 | + pass |
| 253 | +except ValueError as e: |
| 254 | + # Invalid format, corrupt PDF, wrong password, or bad page range |
| 255 | + print(f"Extraction failed: {e}") |
| 256 | +``` |
| 257 | + |
| 258 | +--- |
| 259 | + |
| 260 | +## For more detail |
| 261 | + |
| 262 | +Read these reference files when the SKILL.md body isn't enough: |
| 263 | +- `references/api.md` — complete Python + Node.js API with all parameters and types |
| 264 | +- `references/patterns.md` — LangChain, LlamaIndex, MCP tool, CrewAI, and async batch patterns |
0 commit comments