Scientific Paper Harvester MCP Server
A comprehensive Model Context Protocol (MCP) server that provides LLMs with real-time access to scientific papers from 6 major academic sources: arXiv, OpenAlex, PMC (PubMed Central), Europe PMC, bioRxiv/medRxiv, and CORE.
🚀 Features
Comprehensive Source Coverage
- arXiv: Computer science, physics, mathematics preprints and papers
- OpenAlex: Open catalog of scholarly papers with citation data
- PMC: PubMed Central biomedical and life science literature
- Europe PMC: European life science literature database
- bioRxiv/medRxiv: Biology and medical preprint servers
- CORE: World's largest collection of open access research papers
Advanced Capabilities
- Paper Fetching: Get latest papers from any source by category/concept
- Paper Search: Search papers by title, abstract, author, or full-text across 4 major sources
- Full-Text Extraction: Extract complete text content with intelligent fallback strategies
- Citation Analysis: Find top cited papers from OpenAlex since a specific date
- Paper Lookup: Retrieve full metadata for specific papers by ID
- Category Discovery: Browse available categories from all sources
- Smart Rate Limiting: Respectful API usage with per-source rate limiting
- DOI Resolution: Advanced DOI resolver with Unpaywall → Crossref → Semantic Scholar fallback
- Dual Interface: Both MCP protocol and CLI access
- TypeScript: Full type safety with ESM modules
📊 Coverage Statistics
- Total Sources: 6 academic databases
- Category Coverage: 100+ categories across all disciplines
- Paper Access: 200M+ papers with intelligent text extraction
- Text Extraction Success: >90% for supported paper types
- Response Time: <15 seconds average for paper fetching
🛠 Installation
npm install
npm run build
📋 MCP Client Configuration
To use this server with an MCP client (like Claude Desktop), add the following to your MCP client configuration:
For published package (available on npm):
Option 1: Using npx (recommended for AI tools like Claude)
{
"mcpServers": {
"scientific-papers": {
"command": "npx",
"args": [
"-y",
"@futurelab-studio/latest-science-mcp@latest"
]
}
}
}
Option 2: Global installation
npm install -g @futurelab-studio/latest-science-mcp
Then configure:
{
"mcpServers": {
"scientific-papers": {
"command": "latest-science-mcp"
}
}
}
📖 Usage
CLI Interface
List Categories
# List arXiv categories
node dist/cli.js list-categories --source=arxiv
# List OpenAlex concepts
node dist/cli.js list-categories --source=openalex
# List PMC biomedical categories
node dist/cli.js list-categories --source=pmc
# List Europe PMC life science categories
node dist/cli.js list-categories --source=europepmc
# List bioRxiv/medRxiv categories (includes both servers)
node dist/cli.js list-categories --source=biorxiv
# List CORE academic categories
node dist/cli.js list-categories --source=core
Fetch Latest Papers
# Get latest AI papers from arXiv
node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=10
# Get latest biology papers from bioRxiv
node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=5
# Get latest immunology papers from PMC
node dist/cli.js fetch-latest --source=pmc --category=immunology --count=3
# Get latest papers from CORE by subject
node dist/cli.js fetch-latest --source=core --category=computer_science --count=5
# Search by concept name (OpenAlex)
node dist/cli.js fetch-latest --source=openalex --category="machine learning" --count=3
Fetch Top Cited Papers
# Get top 20 cited papers in machine learning since 2024
node dist/cli.js fetch-top-cited --concept="machine learning" --since=2024-01-01 --count=20
# Get top cited papers by concept ID
node dist/cli.js fetch-top-cited --concept=C41008148 --since=2023-06-01 --count=10
Search Papers
# Search by keywords across all fields
node dist/cli.js search-papers --source=arxiv --query="machine learning" --count=10
# Search by paper title
node dist/cli.js search-papers --source=openalex --query="neural networks" --field=title --count=5
# Search by author name
node dist/cli.js search-papers --source=europepmc --query="John Smith" --field=author --count=10
# Search full-text content sorted by citations
node dist/cli.js search-papers --source=core --query="climate change" --field=fulltext --sortBy=citations --count=20
Fetch Specific Paper Content
# Get arXiv paper by ID
node dist/cli.js fetch-content --source=arxiv --id=2401.12345
# Get bioRxiv paper by DOI
node dist/cli.js fetch-content --source=biorxiv --id="10.1101/2021.01.01.425001"
# Get PMC paper by ID
node dist/cli.js fetch-content --source=pmc --id=PMC8245678
# Get CORE paper by ID
node dist/cli.js fetch-content --source=core --id=12345678
# Show text content with preview
node dist/cli.js fetch-content --source=arxiv --id=2401.12345 --show-text --text-preview=500
🔧 Available Tools
list_categories
Lists available categories/concepts from any data source.
Parameters:
source
:"arxiv"
|"openalex"
|"pmc"
|"europepmc"
|"biorxiv"
|"core"
Returns:
- Array of category objects with
id
,name
, and optionaldescription
Examples:
{
"name": "list_categories",
"arguments": {
"source": "biorxiv"
}
}
fetch_latest
Fetches the latest papers from any source for a given category with metadata only (no text extraction).
Parameters:
source
:"arxiv"
|"openalex"
|"pmc"
|"europepmc"
|"biorxiv"
|"core"
category
: Category ID or concept name (varies by source)count
: Number of papers to fetch (default: 50, max: 200)
Category Examples by Source:
- arXiv:
"cs.AI"
,"physics.gen-ph"
,"math.CO"
- OpenAlex:
"artificial intelligence"
,"machine learning"
,"C41008148"
- PMC:
"immunology"
,"genetics"
,"neuroscience"
- Europe PMC:
"biology"
,"medicine"
,"cancer"
- bioRxiv/medRxiv:
"biorxiv:neuroscience"
,"medrxiv:psychiatry"
- CORE:
"computer_science"
,"mathematics"
,"physics"
Returns:
- Array of paper objects with metadata (id, title, authors, date, pdf_url)
- Text field: Empty string (
text: ""
) - usefetch_content
for full text
fetch_top_cited
Fetches the top cited papers from OpenAlex for a given concept since a specific date.
Parameters:
concept
: Concept name or OpenAlex concept IDsince
: Start date in YYYY-MM-DD formatcount
: Number of papers to fetch (default: 50, max: 200)
search_papers
Searches for papers across multiple academic sources with field-specific search and sorting options.
Parameters:
source
:"arxiv"
|"openalex"
|"europepmc"
|"core"
query
: Search query string (max 1500 characters)field
:"all"
|"title"
|"abstract"
|"author"
|"fulltext"
(default: "all")count
: Number of results to return (default: 50, max: 200)sortBy
:"relevance"
|"date"
|"citations"
(default: "relevance")
Search Capabilities by Source:
- arXiv: Title, abstract, author, and general search with Boolean operators
- OpenAlex: Advanced search with relevance scoring and citation sorting
- Europe PMC: Biomedical literature with MeSH terms and full-text search
- CORE: Global academic papers with advanced query language
Example Queries:
- Keywords:
"machine learning"
,"climate change"
- Phrases:
"artificial intelligence"
(use quotes for exact phrases) - Boolean:
"deep learning AND neural networks"
(arXiv supports this) - Authors:
"John Smith"
,"Smith J"
Returns:
- Array of paper objects with metadata (id, title, authors, date, pdf_url)
- Text field: Empty string (
text: ""
) - usefetch_content
for full text
fetch_content
Fetches full metadata and text content for a specific paper by ID with complete text extraction.
Parameters:
source
: Any of the 6 supported sourcesid
: Paper ID (format varies by source)
ID Formats by Source:
- arXiv:
"2401.12345"
,"cs/0601001"
,"1234.5678v2"
- OpenAlex:
"W2741809807"
or numeric2741809807
- PMC:
"PMC8245678"
or"12345678"
- Europe PMC:
"PMC8245678"
,"12345678"
, or DOI - bioRxiv/medRxiv:
"10.1101/2021.01.01.425001"
or"2021.01.01.425001"
- CORE: Numeric ID like
"12345678"
📄 Paper Metadata Format
All tools return paper objects with the following structure:
{
id: string; // Paper ID
title: string; // Paper title
authors: string[]; // List of author names
date: string; // Publication date (ISO format)
pdf_url?: string; // PDF URL (if available)
text: string; // Extracted full text content
textTruncated?: boolean; // Warning: text was truncated due to size limits
textExtractionFailed?: boolean; // Warning: text extraction failed
}
🧠 Advanced Text Extraction
Multi-Source Strategy
Each source has specialized text extraction approaches:
- arXiv: HTML from
arxiv.org/html
withar5iv.labs.arxiv.org
fallback - OpenAlex: HTML sources with DOI resolver fallback chain
- PMC: E-utilities API with XML/HTML extraction
- Europe PMC: REST API with multiple URL strategies
- bioRxiv/medRxiv: Direct HTML extraction with abstract fallback
- CORE: PDF/HTML with source URL fallback
DOI Resolution Chain
Advanced DOI resolver with multiple fallback strategies:
- Unpaywall → Free full-text sources
- Crossref → Publisher metadata and links
- Semantic Scholar Academic Graph → Alternative access
Performance & Reliability
- Text Extraction Success: >90% for HTML-available papers
- Graceful Degradation: Always returns metadata even if text extraction fails
- Size Management: 6MB text limit with intelligent truncation
- Caching: 24-hour LRU cache for DOI resolution
🔄 Rate Limiting
Respectful API usage with per-source rate limiting:
- arXiv: 5 requests per minute
- OpenAlex: 10 requests per minute
- PMC: 3 requests per second
- Europe PMC: 10 requests per minute
- bioRxiv/medRxiv: 5 requests per minute
- CORE: 10 requests per minute (public), higher with API key
CORE API Configuration
For enhanced CORE access, set environment variable:
export CORE_API_KEY="your-api-key"
🧪 Testing
Run Test Suite
# Run all tests
npm test
# Run integration tests
npm run test -- tests/integration
# Run end-to-end workflow tests
npm run test -- tests/e2e
# Run performance benchmarks
npm run test -- tests/integration/performance.test.ts
Test Coverage
- Integration Tests: All 6 sources tested end-to-end
- Performance Tests: Response time and throughput benchmarks
- Workflow Tests: Real research scenarios across multiple sources
- Unit Tests: Core components and edge cases
🏗 Architecture
Modular Driver System
- Clean separation between sources
- Consistent interface across all drivers
- Specialized text extraction per source
Advanced Features
- DOI Resolution: Multi-provider fallback chain
- Rate Limiting: Token bucket algorithm per source
- Text Processing: HTML cleaning and normalization
- Error Handling: Structured responses with actionable suggestions
- Caching: Intelligent caching for DOI resolution
Technology Stack
- TypeScript + ESM: Modern JavaScript with full type safety
- Modular Design: Clean separation of concerns
- Graceful Degradation: Always functional even with partial failures
- Response Size Management: Automatic truncation and warnings
📊 Source Comparison
Source | Papers | Disciplines | Full-Text | Citation Data | Preprints | Search |
---|---|---|---|---|---|---|
arXiv | 2.3M+ | STEM | HTML ✓ | Limited | ✓ | ✓✓✓ |
OpenAlex | 200M+ | All | Variable | ✓✓✓ | ✓ | ✓✓✓ |
PMC | 7M+ | Biomedical | XML/HTML ✓ | Limited | ✗ | Limited |
Europe PMC | 40M+ | Life Sciences | HTML ✓ | Limited | ✓ | ✓✓✓ |
bioRxiv/medRxiv | 500K+ | Bio/Medical | HTML ✓ | Limited | ✓✓✓ | Limited |
CORE | 200M+ | All | PDF/HTML ✓ | Limited | ✓ | ✓✓✓ |
🔧 Development
Build
npm run build
Test Individual Sources
# Test specific sources
node dist/cli.js list-categories --source=arxiv
node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=3
node dist/cli.js fetch-content --source=core --id=12345678
# Test search functionality
node dist/cli.js search-papers --source=arxiv --query="artificial intelligence" --count=5
node dist/cli.js search-papers --source=openalex --query="quantum computing" --field=title --count=3
Performance Testing
# Run performance benchmarks
npm run test -- tests/integration/performance.test.ts
# Test memory usage
npm run test -- --reporter=verbose
🚨 Error Handling
Comprehensive error handling for all sources:
- Invalid paper IDs with format suggestions
- Rate limiting with retry-after information
- API timeouts and server errors
- Missing authentication (CORE API key)
- Network connectivity issues
- Text extraction failures with fallback strategies
🔍 Troubleshooting
Common Issues
- Rate limiting: Automatic retry with exponential backoff
- Missing papers: Try alternative sources for the same content
- Text extraction failures: Fallback to abstract or metadata
- CORE API limits: Set
CORE_API_KEY
environment variable
Performance Optimization
- Use appropriate
count
parameters (smaller for faster responses) - Cache results when possible
- Use
fetch_latest
for discovery,fetch_content
for detailed reading
📝 License
MIT
Ready to explore the world's scientific knowledge? Start with any of the 6 sources and discover papers across all academic disciplines! 🔬📚