#

article-extraction

Here are 33 public repositories matching this topic...

ieg-dhr / NLP-Course4Humanities_2024

This repository is part of an NLP course for humanities and cultural studies. This course uses historical newspapers as a source and applies NLP methods to them. NLP tasks: Tokenization, Lemmatization, TF-IDF, Part-of-speech tagging, semantic search with transformers, article extraction and OCR post-correction with LLMs, NER and text classification

nlp webpage text-classification teaching ner semantic-search nlp-machine-learning university-course historical-newspapers transformers-models llms article-extraction

Updated Jun 5, 2025
Jupyter Notebook

dstark5 / gnews-scraper

GNewsScraper is a TypeScript package that scrapes article data from Google News based on a keyword or phrase. It returns the results as an array of JSON objects, making it convenient to access and use the scraped information

typescript web-scraping json-parsing web-crawling google-news data-scraping google-news-scraper web-data-extraction web-automation keyword-search gnews news-scraping gnews-api article-extraction gnews-scraper

Updated Aug 19, 2023
TypeScript

nasplycc / wechat-mp-reader

OpenClaw Skill：读取微信公众号文章、识别公众号并拉取文章列表

python skill wechat wechat-official-account playwright article-extraction openclaw

Updated Apr 2, 2026
Python

levindixon / WebMD

📋 WebMD is a Chrome extension that transforms web pages into Markdown documents with surgical precision.

javascript chrome-extension markdown gfm github-flavored-markdown html-to-markdown web-scraping readability browser-extension markdown-converter content-extraction web-tools turndown manifest-v3 article-extraction

Updated Jul 3, 2025
JavaScript

Yasser03 / pipescraper

A pipe-based news article scraping and metadata extraction library for Python

python crawler data-science scraper spider data-collection news-scraper osint-python llms article-extraction trafilatura newspaper4k

Updated Mar 20, 2026
Python

riainzhang / html-2-markdown

A simple HTML-to-Markdown converter with article extraction, selector filtering, and batch conversion.

python html markdown cli converter html-to-markdown batch-conversion article-extraction

Updated Jun 7, 2026
Python

UtrechtUniversity / dataQuest

A configurable pipeline for extracting and filtering articles from large corpora, tailored for the Delpher Kranten corpus, with support for features like keyword filtering and tf-idf-based relevance scoring.

information-retrieval corpus-processing article-extraction keyword-filtering delpher-kranten

Updated Apr 18, 2025
Python

stn1slv / md-fetch

Python library that extracts article content from Medium and dev.to and returns it as clean, well-structured Markdown.

python markdown medium scraping devto article-extraction

Updated Jun 2, 2026
Python

alxytaylor41 / zendesk-help-center

Zendesk articles extraction toolkit

python zendesk scraper help html-parsing web-crawling center article-extraction knowledge-base-scraper customer-support-data

Updated Dec 5, 2025

jeffgreendesign / mdyoink

Chrome extension that yoinks webpages into clean markdown. Supports article extraction, full-page capture, YouTube transcripts, and visual element picking.

javascript chrome-extension markdown browser-extension youtube-transcripts article-extraction web-clipping

Updated Jun 8, 2026
JavaScript

yrstm / mantis

Capture exactly what the user sees and turn any page into structured JSON or clean Markdown. Built for read-later and bookmarking apps, and for AI agents that need token-efficient input. Readability-style, zero dependencies, single file.

markdown scraper bookmarklet html-to-markdown readability ai-agents read-later llm article-extraction

Updated Jun 10, 2026
JavaScript

0x4D44 / readex

HTML main-content extraction for Rust — ports of Mozilla Readability, Trafilatura, and htmldate.

html rust text-extraction web-scraping html-parser readability content-extraction metadata-extraction boilerplate-removal article-extraction trafilatura htmldate

Updated May 26, 2026
HTML

pankaj28843 / article-extractor

Pure-Python article extraction library and HTTP API - Extract clean content from web pages as Markdown or HTML

python api docker markdown html-to-markdown web-scraping readability content-extraction fastapi llm article-extraction

Updated Jun 8, 2026
Python

voidkingultramaster / ksl-scraper

KSL news article scraper

python data-mining scraper requests beautifulsoup media-monitoring ksl news-scraping article-extraction

Updated Dec 12, 2025

moamen1358 / WannaScrape

Production web scraper with Playwright, bot-detection plugins, fingerprint rotation, and CAPTCHA solving. CLI + FastAPI.

python captcha web-scraping browser-automation fastapi anti-detection playwright article-extraction

Updated May 9, 2026
Python

miclle / readability.go

Go implementation of Mozilla Readability with fixture-level compatibility

go html-parser readability mozilla-readability article-extraction

Updated May 1, 2026
Go

thedavidmurray / claude-article-extractor

Claude Code skill for article extraction with 4-backend fallback chain

web-scraping developer-tools article-extraction claude-code

Updated May 3, 2026

pulsedev2gwencd / us-magazine-scraper

US Magazine news extractor

nodejs javascript scraper web-scraping us magazine media-monitoring news-scraping article-extraction

Updated Dec 12, 2025

phantommanzonek / the-daily-beast-scraper

Daily Beast news scraper

python scraper beast web-scraping daily data-extraction the media-monitoring news-scraping article-extraction

Updated Dec 13, 2025

calebdshenk-a11y / article-word-counter

Chrome/Edge extension that estimates article word count, reading time, and lets you double-click any word to update the toolbar badge with your reading progress.

javascript reading browser-extension word-count article-extraction

Updated Apr 30, 2026
JavaScript

Improve this page

Add a description, image, and links to the article-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the article-extraction topic, visit your repo's landing page and select "manage topics."