Content
# Crawl-MCP: Unofficial MCP Server for crawl4ai
> **⚠️ Important**: This is an **unofficial** MCP server implementation for the excellent [crawl4ai](https://github.com/unclecode/crawl4ai) library.
> **Not affiliated** with the original crawl4ai project.
A comprehensive Model Context Protocol (MCP) server that wraps the powerful crawl4ai library with advanced AI capabilities. Extract and analyze content from **any source**: web pages, PDFs, Office documents, YouTube videos, and more. Features intelligent summarization to dramatically reduce token usage while preserving key information.
## 🌟 Key Features
- **🔍 Google Search Integration** - 7 optimized search genres with Google official operators
- **🔍 Advanced Web Crawling**: JavaScript support, deep site mapping, entity extraction
- **🌐 Universal Content Extraction**: Web pages, PDFs, Word docs, Excel, PowerPoint, ZIP archives
- **🤖 AI-Powered Summarization**: Smart token reduction (up to 88.5%) while preserving essential information
- **🎬 YouTube Integration**: Extract video transcripts and summaries without API keys
- **⚡ Production Ready**: 13 specialized tools with comprehensive error handling
## 🚀 Quick Start
### Prerequisites (Required First)
- Python 3.11 以上(FastMCP が Python 3.11+ を要求)
**Install system dependencies for Playwright:**
**Ubuntu 24.04 LTS (Manual Required):**
```bash
# Manual setup required due to t64 library transition
sudo apt update && sudo apt install -y \
libnss3 libatk-bridge2.0-0 libxss1 libasound2t64 \
libgbm1 libgtk-3-0t64 libxshmfence-dev libxrandr2 \
libxcomposite1 libxcursor1 libxdamage1 libxi6 \
fonts-noto-color-emoji fonts-unifont python3-venv python3-pip
python3 -m venv venv && source venv/bin/activate
pip install playwright==1.55.0 && playwright install chromium
sudo playwright install-deps
```
**Other Linux/macOS:**
```bash
sudo bash scripts/prepare_for_uvx_playwright.sh
```
**Windows (as Administrator):**
```powershell
scripts/prepare_for_uvx_playwright.ps1
```
### Installation
**UVX (Recommended - Easiest):**
```bash
# After system preparation above - that's it!
uvx --from git+https://github.com/walksoda/crawl-mcp crawl-mcp
```
**Docker (Production-Ready):**
```bash
# Clone the repository
git clone https://github.com/walksoda/crawl-mcp
cd crawl-mcp
# Build and run with Docker Compose (STDIO mode)
docker-compose up --build
# Or build and run HTTP mode on port 8000
docker-compose --profile http up --build crawl4ai-mcp-http
# Or build manually
docker build -t crawl4ai-mcp .
docker run -it crawl4ai-mcp
```
**Docker Features:**
- 🔧 **Multi-Browser Support**: Chromium, Firefox, Webkit headless browsers
- 🐧 **Google Chrome**: Additional Chrome Stable for compatibility
- ⚡ **Optimized Performance**: Pre-configured browser flags for Docker
- 🔒 **Security**: Non-root user execution
- 📦 **Complete Dependencies**: All required libraries included
### Claude Desktop Setup
**UVX Installation:**
Add to your `claude_desktop_config.json`:
```json
{
"mcpServers": {
"crawl-mcp": {
"transport": "stdio",
"command": "uvx",
"args": [
"--from",
"git+https://github.com/walksoda/crawl-mcp",
"crawl-mcp"
],
"env": {
"CRAWL4AI_LANG": "en"
}
}
}
}
```
**Docker HTTP Mode:**
```json
{
"mcpServers": {
"crawl-mcp": {
"transport": "http",
"baseUrl": "http://localhost:8000"
}
}
}
```
**For Japanese interface:**
```json
"env": {
"CRAWL4AI_LANG": "ja"
}
```
## 📖 Documentation
| Topic | Description |
|-------|-------------|
| **[Installation Guide](docs/INSTALLATION.md)** | Complete installation instructions for all platforms |
| **[API Reference](docs/API_REFERENCE.md)** | Full tool documentation and usage examples |
| **[Configuration Examples](docs/CONFIGURATION_EXAMPLES.md)** | Platform-specific setup configurations |
| **[HTTP Integration](docs/HTTP_INTEGRATION.md)** | HTTP API access and integration methods |
| **[Advanced Usage](docs/ADVANCED_USAGE.md)** | Power user techniques and workflows |
| **[Development Guide](docs/DEVELOPMENT.md)** | Contributing and development setup |
### Language-Specific Documentation
- **English**: [docs/](docs/) directory
- **日本語**: [docs/ja/](docs/ja/) directory
## 🛠️ Tool Overview
### Web Crawling
- `crawl_url` - Single page crawling with JavaScript support
- `deep_crawl_site` - Multi-page site mapping and exploration
- `crawl_url_with_fallback` - Robust crawling with retry strategies
- `batch_crawl` - Process multiple URLs (max 5)
- `multi_url_crawl` - Advanced multi-URL configuration
### Search Integration
- `search_google` - Genre-filtered Google search
- `search_and_crawl` - Combined search and content extraction
- `batch_search_google` - Multiple search queries (max 3)
### Data Extraction
- `extract_structured_data` - CSS/XPath/LLM-based structured extraction
### Media Processing
- `process_file` - PDF, Office, ZIP to markdown conversion
- `extract_youtube_transcript` - Video transcript extraction
- `batch_extract_youtube_transcripts` - Multiple videos (max 3)
- `get_youtube_video_info` - Video metadata retrieval
## 🎯 Common Use Cases
**Content Research:**
```bash
search_and_crawl → extract_structured_data → analysis
```
**Documentation Mining:**
```bash
deep_crawl_site → batch processing → extraction
```
**Media Analysis:**
```bash
extract_youtube_transcript → summarization workflow
```
**Site Mapping:**
```bash
batch_crawl → multi_url_crawl → comprehensive data
```
## 🚨 Quick Troubleshooting
**Installation Issues:**
1. Re-run setup scripts with proper privileges
2. Try development installation method
3. Check browser dependencies are installed
**Performance Issues:**
- Use `wait_for_js: true` for JavaScript-heavy sites
- Increase timeout for slow-loading pages
- Use `extract_structured_data` for targeted extraction
**Configuration Issues:**
- Check JSON syntax in `claude_desktop_config.json`
- Verify file paths are absolute
- Restart Claude Desktop after configuration changes
## 🏗️ Project Structure
- **Original Library**: [crawl4ai](https://github.com/unclecode/crawl4ai) by unclecode
- **MCP Wrapper**: This repository (walksoda)
- **Implementation**: Unofficial third-party integration
## 📄 License
This project is an unofficial wrapper around the crawl4ai library. Please refer to the original [crawl4ai license](https://github.com/unclecode/crawl4ai) for the underlying functionality.
## 🤝 Contributing
See our [Development Guide](docs/DEVELOPMENT.md) for contribution guidelines and development setup instructions.
## 🔗 Related Projects
- [crawl4ai](https://github.com/unclecode/crawl4ai) - The underlying web crawling library
- [Model Context Protocol](https://modelcontextprotocol.io/) - The standard this server implements
- [Claude Desktop](https://docs.anthropic.com/claude/docs/claude-desktop) - Primary client for MCP servers
Connection Info
You Might Also Like
markitdown
Python tool for converting files and office documents to Markdown.
oh-my-opencode
Background agents · Curated agents like oracle, librarians, frontend...
chatbox
User-friendly Desktop Client App for AI Models/LLMs (GPT, Claude, Gemini, Ollama...)
continue
Continue is an open-source project for seamless server management.
semantic-kernel
Build and deploy intelligent AI agents with Semantic Kernel's orchestration...
repomix
Repomix packages your codebase into AI-friendly formats for easy integration.