This repository contains a sequence of Python scripts designed to systematically query, download, and extract structured data from academic literature (PubMed, Scopus) and patent databases (EPO) regarding antimicrobial peptides. It executes PDF acquisition and utilizes the Google Gemini API for structured data extraction via Large Language Models.
Ensure the following Python packages are installed:
requestspandasbiopython(Bio)python-dateutilelsapypython-dotenvtqdmepo-ops-clientgoogle-genai
Create a .env file in the root directory containing:
X-ELS-INST: Scopus Institutional Token.X-ELS-APIKEY: Scopus API Key.GEMINI_KEY: Google Gemini API Key.
The scripts expect the following directory structure to exist:
data/queries/: Text files containing search queries (pubmed.txt,scopus.txt,epo.txt).data/prompts/step_04/: LLM prompt (01_extraction_prompt.txt) and response schema (01_extraction_prompt_response_format.json).data/literature_data/: Output directory for literature CSV metadata.data/literature_data/pubmed/: Output directory for PubMed PDFs.data/literature_data/scopus/: Output directory for Scopus PDFs.data/patent_data/epo/: Output directory for EPO JSON and PDF data.data/llm_data_extraction/literature_data/: Output directory for parsed LLM JSON results.
step_01_download_pubmed_papers_metadata.pyQueries PubMed from 1900 to present. Iterates month-by-month to bypass the 10,000 record ESEARCH limit. Outputspubmed_metadata.csv.step_02_download_scopus_papers_metadata.pyQueries Scopus utilizingelsapy. Iterates by month to circumvent API limits. Outputsscopus_metadata.csv.
step_03_download_and_extract_papers_data.pyReads metadata CSVs, generates unique internal identifiers (UUIDs/Hashes), and attempts to download the corresponding PDFs using external utility scripts via DOI. Maintains state in[source]_downloaded.csvto prevent duplicate downloads.
step_05_download_epo_data.pyQueries the European Patent Office (EPO) Open Patent Services (OPS). Retrieves bibliographic data and claims as JSON. Attempts to retrieve full-text PDFs directly from EPO OPS, falling back to FreePatentsOnline or Google Patents via cURL/wget.
step_10_extract_peptide_data_using_llm.pyIterates through the downloaded PDFs. Uploads each document to Google Gemini (gemini-2.5-pro) alongside an extraction prompt and JSON schema constraint. Outputs structured data to JSON files mapped to the internal identifiers.
