Content
# Crawl4AI MCP Server
[](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server)
This is an intelligent information retrieval server based on MCP (Model Context Protocol) that provides powerful search capabilities and LLM-optimized webpage content understanding for AI assistant systems. Through multi-engine search and intelligent content extraction, it helps AI systems efficiently acquire and understand internet information, converting webpage content into formats best suited for LLM processing.
## Features
- 🔍 Powerful multi-engine search capability, supporting DuckDuckGo and Google
- 📚 LLM-optimized webpage content extraction, intelligently filtering non-core content
- 🎯 Focus on information value, automatically identifying and preserving key content
- 📝 Multiple output formats, supporting citation tracing
- 🚀 High-performance asynchronous design based on FastMCP
## Installation
### Method 1: Most Common Installation Scenario
1. Ensure your system meets the following requirements:
- Python >= 3.9
- Recommended to use a dedicated virtual environment
2. Clone the repository:
```bash
git clone https://github.com/yourusername/crawl4ai-mcp-server.git
cd crawl4ai-mcp-server
```
3. Create and activate virtual environment:
```bash
python -m venv crawl4ai_env
source crawl4ai_env/bin/activate # Linux/Mac
# or
.\crawl4ai_env\Scripts\activate # Windows
```
4. Install dependencies:
```bash
pip install -r requirements.txt
```
5. Install playwright browsers:
```bash
playwright install
```
### Method 2: Install to Claude Desktop Client via Smithery
Install and automatically configure Crawl4AI MCP's Claude desktop service to your local `Claude Extension Center` through [Smithery](https://smithery.ai/server/@weidwonder/crawl4ai-mcp-server):
```bash
npx -y @smithery/cli install @weidwonder/crawl4ai-mcp-server --client claude
```
## Usage
The server provides the following tools:
### search
Powerful web search tool supporting multiple search engines:
- DuckDuckGo search (default): No API key required, comprehensive handling of AbstractText, Results, and RelatedTopics
- Google search: Requires API key configuration, provides precise search results
- Supports using multiple engines simultaneously for more comprehensive results
Parameters:
- `query`: Search query string
- `num_results`: Number of results to return (default 10)
- `engine`: Search engine selection
- "duckduckgo": DuckDuckGo search (default)
- "google": Google search (requires API key)
- "all": Use all available search engines simultaneously
Example:
```python
# DuckDuckGo search (default)
{
"query": "python programming",
"num_results": 5
}
# Use all available engines
{
"query": "python programming",
"num_results": 5,
"engine": "all"
}
```
### read_url
LLM-optimized webpage content understanding tool, providing intelligent content extraction and format conversion:
- `markdown_with_citations`: Markdown with inline citations (default), maintaining information traceability
- `fit_markdown`: LLM-optimized streamlined content, removing redundant information
- `raw_markdown`: Basic HTML→Markdown conversion
- `references_markdown`: Separate citations/references section
- `fit_html`: Filtered HTML that generated fit_markdown
- `markdown`: Default Markdown format
Example:
```python
{
"url": "https://example.com",
"format": "markdown_with_citations"
}
```
To use Google search, configure API keys in config.json:
```json
{
"google": {
"api_key": "your-api-key",
"cse_id": "your-cse-id"
}
}
```
## LLM Content Optimization
The server employs a series of content optimization strategies for LLM:
- Intelligent Content Recognition: Automatically identifies and preserves article body and key information paragraphs
- Noise Filtering: Automatically filters navigation bars, advertisements, footers, and other content unhelpful for understanding
- Information Integrity: Preserves URL references, supports information traceability
- Length Optimization: Uses minimum word count threshold (10) to filter invalid segments
- Format Optimization: Default output in markdown_with_citations format, convenient for LLM understanding and citation
## Development Notes
Project structure:
```
crawl4ai_mcp_server/
├── src/
│ ├── index.py # Server main implementation
│ └── search.py # Search functionality implementation
├── config_demo.json # Configuration file example
├── pyproject.toml # Project configuration
├── requirements.txt # Dependency list
└── README.md # Project documentation
```
## Configuration
1. Copy the configuration example file:
```bash
cp config_demo.json config.json
```
2. To use Google search, configure API keys in config.json:
```json
{
"google": {
"api_key": "your-google-api-key",
"cse_id": "your-google-cse-id"
}
}
```
## Changelog
- 2025.02.08: Added search functionality, supporting DuckDuckGo (default) and Google search
- 2025.02.07: Refactored project structure, implemented using FastMCP, optimized dependency management
- 2025.02.07: Optimized content filtering configuration, improved token efficiency while maintaining URL integrity
## License
MIT License
## Contributing
Issues and Pull Requests are welcome!
## Author
- Owner: weidwonder
- Coder: Claude Sonnet 3.5
- 100% Code written by Claude. Cost: $9 ($2 for code writing, $7 cost for Debugging😭)
- 3 hours time cost. 0.5 hours for code writing, 0.5 hours for env preparing, 2 hours for debugging.😭
## Acknowledgments
Thanks to all developers who contributed to the project!
Special thanks to:
- [Crawl4ai](https://github.com/crawl4ai/crawl4ai) project providing excellent webpage content extraction technology support
Connection Info
You Might Also Like
Filesystem
Model Context Protocol Servers
Fetch
Model Context Protocol Servers
Chrome Devtools MCP
Chrome DevTools for coding agents
mcp-chrome
Chrome MCP Server is a Chrome extension-based Model Context Protocol (MCP)...
Firecrawl
Firecrawl MCP Server enables web scraping, crawling, and content extraction.
firecrawl
🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to...