Skip to content

Commit a0e732f

Browse files
authored
feat(edgeparse): add EdgeParse plugin to marketplace (#152)
Add EdgeParse as a built-in plugin providing Rust-native PDF extraction that converts PDFs to Markdown/JSON/HTML/text. Includes a skill for intelligent activation, marketplace registration, and updates MarkItDown skill to recommend EdgeParse for PDF use cases.
1 parent b83144b commit a0e732f

9 files changed

Lines changed: 848 additions & 1 deletion

File tree

.claude-plugin/marketplace.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -568,6 +568,14 @@
568568
"ref": "main"
569569
},
570570
"homepage": "https://github.com/pleaseai/gemini-plugin-cc/tree/main/plugins/gemini"
571+
},
572+
{
573+
"name": "edgeparse",
574+
"description": "Extract structured content from any PDF — headings, tables, paragraphs, lists, bounding boxes — deterministically, without ML dependencies or GPU requirements",
575+
"category": "document",
576+
"keywords": ["pdf", "extraction", "parsing", "markdown", "rag"],
577+
"tags": ["skill", "document"],
578+
"source": "./plugins/edgeparse"
571579
}
572580
]
573581
}

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -282,6 +282,11 @@ ASK (Agent Skills Kit) — AI agent skills for managing library documentation re
282282

283283
**Install:** `/plugin install ask@pleaseai` | **Source:** [pleaseai/ask](https://github.com/pleaseai/ask)
284284

285+
#### EdgeParse
286+
Extract structured content from any PDF — headings, tables, paragraphs, lists, bounding boxes — deterministically, without ML dependencies or GPU requirements.
287+
288+
**Install:** `/plugin install edgeparse@pleaseai` | **Source:** [plugins/edgeparse](https://github.com/pleaseai/claude-code-plugins/tree/main/plugins/edgeparse)
289+
285290
## Quick Start
286291

287292
The fastest way to get started — install the marketplace and let the plugin recommender auto-detect what you need:
Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
---
2+
name: edgeparse
3+
description: Extract structured content from any PDF for AI agents, RAG pipelines, and Copilot Skills. Use this skill whenever the user wants to read, analyze, or reason about a PDF document; needs to feed document content to an LLM; mentions PDF extraction, parsing, or conversion; wants tables, headings, or bounding boxes from a PDF; is building a RAG pipeline; or asks an agent to process a document. Install with: pip install edgeparse
4+
license: Apache-2.0
5+
metadata:
6+
authors: "EdgeParse Contributors"
7+
version: "0.1.0"
8+
package: "edgeparse"
9+
install_python: "pip install edgeparse"
10+
install_node: "npm install edgeparse"
11+
source: "raphaelmansuy/edgeparse"
12+
---
13+
14+
# EdgeParse Skill
15+
16+
Enables AI agents to extract clean, structured content from any PDF — headings, tables, paragraphs, lists, bounding boxes — deterministically, without ML dependencies or GPU requirements.
17+
18+
**Install:** `pip install edgeparse` · **Node.js:** `npm install edgeparse`
19+
**Speed:** ~0.023 s/doc (Apple M4 Max, 200-doc benchmark)
20+
21+
---
22+
23+
## When to reach for this skill
24+
25+
Activate when the workflow involves:
26+
- Reading or analyzing a PDF document on behalf of a user
27+
- Building a RAG pipeline that ingests PDFs
28+
- Feeding PDF content to an LLM for summarization, Q&A, or synthesis
29+
- Extracting tables from financial reports, research papers, or invoices
30+
- Processing a batch of documents for indexing or search
31+
- An agent tool that must "open" a PDF and return its contents
32+
33+
---
34+
35+
## Quick start
36+
37+
```python
38+
import edgeparse
39+
40+
# Convert any PDF to Markdown — best for LLM context windows
41+
text = edgeparse.convert("report.pdf", format="markdown")
42+
43+
# Convert to JSON with bounding boxes and full structure
44+
import json
45+
doc = json.loads(edgeparse.convert("report.pdf", format="json"))
46+
47+
# Plain text (fast, minimal)
48+
plain = edgeparse.convert("report.pdf", format="text")
49+
```
50+
51+
The `format` parameter controls output:
52+
| Value | Best for |
53+
|-------|----------|
54+
| `"markdown"` | LLM context — headings, tables, lists in Markdown |
55+
| `"json"` | Bounding boxes, citations, structured element metadata |
56+
| `"html"` | Web rendering, semantic HTML5 |
57+
| `"text"` | Simple full-text search, minimal output |
58+
59+
---
60+
61+
## Core API
62+
63+
### `edgeparse.convert()`
64+
65+
```python
66+
result: str = edgeparse.convert(
67+
input_path, # str or Path — required
68+
format="markdown", # output format (see table above)
69+
pages=None, # e.g. "1-5" or "1,3,7-10" — specific pages only
70+
password=None, # for password-protected PDFs
71+
reading_order="xycut", # "xycut" (spatial sort, default) or "off"
72+
table_method="default", # "default" (ruling-line) or "cluster" (borderless)
73+
image_output="off", # "off", "embedded" (base64), "external" (files)
74+
)
75+
```
76+
77+
Returns the extracted content as a **string**. Raises `FileNotFoundError` for missing files and `ValueError` for corrupt PDFs or bad options.
78+
79+
### `edgeparse.convert_file()`
80+
81+
```python
82+
out_path: str = edgeparse.convert_file(
83+
input_path,
84+
output_dir="output", # write output file to this directory
85+
format="markdown",
86+
pages=None,
87+
password=None,
88+
)
89+
```
90+
91+
Writes the output file and returns its path.
92+
93+
---
94+
95+
## Common patterns
96+
97+
### Feed a PDF to an LLM
98+
99+
```python
100+
import edgeparse
101+
import anthropic
102+
103+
doc = edgeparse.convert("report.pdf", format="markdown")
104+
105+
client = anthropic.Anthropic()
106+
response = client.messages.create(
107+
model="claude-opus-4-5",
108+
max_tokens=4096,
109+
messages=[{
110+
"role": "user",
111+
"content": f"Analyze this document and summarize the key findings:\n\n{doc}"
112+
}]
113+
)
114+
print(response.content[0].text)
115+
```
116+
117+
### RAG pipeline — chunk with metadata
118+
119+
```python
120+
import edgeparse, json
121+
122+
raw = edgeparse.convert("paper.pdf", format="json")
123+
doc = json.loads(raw)
124+
125+
chunks = []
126+
for el in doc["elements"]:
127+
if el["type"] in ("paragraph", "heading", "table"):
128+
chunks.append({
129+
"text": el["text"],
130+
"metadata": {
131+
"page": el["page_number"],
132+
"type": el["type"],
133+
"bbox": el["bounding_box"], # for citation highlights
134+
"order": el["reading_order"],
135+
}
136+
})
137+
138+
# Now embed chunks["text"] and store chunks["metadata"] in your vector store
139+
```
140+
141+
### Batch processing
142+
143+
```python
144+
import edgeparse
145+
from pathlib import Path
146+
147+
results = {}
148+
for pdf in Path("documents/").glob("*.pdf"):
149+
try:
150+
results[pdf.name] = edgeparse.convert(str(pdf), format="markdown")
151+
except Exception as e:
152+
results[pdf.name] = f"ERROR: {e}"
153+
```
154+
155+
### Extract specific pages only
156+
157+
```python
158+
# Pages 1–5
159+
text = edgeparse.convert("report.pdf", format="markdown", pages="1-5")
160+
161+
# Non-contiguous pages
162+
text = edgeparse.convert("report.pdf", format="markdown", pages="1,3,7-10")
163+
```
164+
165+
### Borderless table extraction
166+
167+
Many financial reports and invoices use tables without ruling lines.
168+
Use `table_method="cluster"` to handle them:
169+
170+
```python
171+
text = edgeparse.convert(
172+
"earnings.pdf",
173+
format="markdown",
174+
table_method="cluster" # spatial clustering for borderless tables
175+
)
176+
```
177+
178+
### Password-protected PDF
179+
180+
```python
181+
text = edgeparse.convert("secure.pdf", format="markdown", password="mypassword")
182+
```
183+
184+
---
185+
186+
## Node.js usage
187+
188+
```js
189+
import { convert } from 'edgeparse';
190+
191+
const markdown = convert('report.pdf', { format: 'markdown' });
192+
const json = convert('report.pdf', { format: 'json' });
193+
194+
// With options
195+
const result = convert('report.pdf', {
196+
format: 'markdown',
197+
pages: '1-5',
198+
readingOrder: 'xycut',
199+
tableMethod: 'cluster',
200+
});
201+
```
202+
203+
---
204+
205+
## JSON output schema
206+
207+
When `format="json"`, the output is a JSON string with shape:
208+
209+
```json
210+
{
211+
"page_count": 10,
212+
"title": "Document Title",
213+
"elements": [
214+
{
215+
"type": "heading",
216+
"level": 1,
217+
"text": "Introduction",
218+
"page_number": 1,
219+
"reading_order": 0,
220+
"bounding_box": { "x0": 72, "y0": 144, "x1": 540, "y1": 180 }
221+
},
222+
{
223+
"type": "table",
224+
"text": "| Col A | Col B |\n|-------|-------|\n| val1 | val2 |",
225+
"page_number": 2,
226+
"bounding_box": { "x0": 72, "y0": 200, "x1": 540, "y1": 350 }
227+
},
228+
{
229+
"type": "paragraph",
230+
"text": "This is body text...",
231+
"page_number": 1,
232+
"reading_order": 2,
233+
"bounding_box": { "x0": 72, "y0": 190, "x1": 540, "y1": 220 }
234+
}
235+
]
236+
}
237+
```
238+
239+
Element `type` values: `heading`, `paragraph`, `table`, `list`, `list_item`, `figure`, `caption`, `header`, `footer`.
240+
241+
---
242+
243+
## Error handling
244+
245+
```python
246+
import edgeparse
247+
248+
try:
249+
text = edgeparse.convert("report.pdf", format="markdown")
250+
except FileNotFoundError:
251+
# PDF file not found — check the path
252+
pass
253+
except ValueError as e:
254+
# Invalid format, corrupt PDF, wrong password, or bad page range
255+
print(f"Extraction failed: {e}")
256+
```
257+
258+
---
259+
260+
## For more detail
261+
262+
Read these reference files when the SKILL.md body isn't enough:
263+
- `references/api.md` — complete Python + Node.js API with all parameters and types
264+
- `references/patterns.md` — LangChain, LlamaIndex, MCP tool, CrewAI, and async batch patterns

0 commit comments

Comments
 (0)