Content
# webmcp
`webmcp` is an MCP server for web search and content extraction. LLM agents can use it to:
- search the web with DuckDuckGo (default) or SearXNG (optional)
- fetch and clean page content from one or more URLs
- send cleaned content to a local LLM for structured extraction
## Features
- `search_web(query, limit=10)` returns web results (title, URL, description)
- `extract(urls, prompt=None, schema=None, use_browser=True)` extracts data from pages
- browser-based fetching with Playwright for JavaScript-heavy sites
- lightweight HTTP fetching mode for faster/simple pages
- persistent tool-call logging to `tool_calls.log.json`
- configurable search provider: DDG by default, optional SearXNG
## Critical Requirement
For the main researcher llama.cpp server, include `--webui-mcp-proxy` in launch parameters. Without this flag, this workflow will not function correctly.
## Prompting And Tested Setup
For best results, use `research_prompt.txt` as your system prompt. This prompt is a core part of the intended workflow and quality; it is effectively half of how this repository is meant to function.
Tested setup:
- Main researcher LLM: `Qwen3.5:27b-Q3_K_M.gguf` via llama.cpp on an RTX 4090, context length 200,000, about 40 tok/s.
- Extract tool LLM: `Qwen3.5:9b-Q4_K_M.gguf` via llama.cpp on a GTX 1080 Ti, context length 32,768, about 40 tok/s.
- This workflow has been tested with the llama.cpp WebUI specifically, and has not been validated with other MCP clients yet.
## Requirements
- Python 3.10+
- A local OpenAI-compatible LLM endpoint (for example, llama.cpp, LM Studio, vLLM, ollama, etc)
## Configuration
The app reads LLM settings from environment variables and supports a local `.env` file.
1. Copy `.env.example` to `.env`
2. Set values:
```env
LLM_URL=http://localhost:1234
LLM_MODEL=your-model-name
SEARCH_PROVIDER=ddg
# Optional when SEARCH_PROVIDER=searxng
SEARXNG_URL=http://localhost:8080
```
`LLM_URL` and `LLM_MODEL` are required at startup.
`SEARCH_PROVIDER` defaults to `ddg`. Set it to `searxng` to replace DDG, and provide `SEARXNG_URL`.
## Search Providers
`search_web` supports two providers:
- `ddg` (default): uses DuckDuckGo via `ddgs`
- `searxng`: uses your SearXNG instance
SearXNG notes:
- Set `SEARCH_PROVIDER=searxng`
- Set `SEARXNG_URL` to your instance base URL (for example, `http://192.168.0.55:8888`)
- `webmcp` calls `<SEARXNG_URL>/search` with `format=json`
## Install
Install dependencies from the pinned requirements file:
```bash
pip install -r requirements.txt
python -m playwright install chromium
```
## Run
```bash
python app.py
```
Server starts on:
- `http://0.0.0.0:8642`
## MCP Usage Notes
- `extract(..., use_browser=True)` is best for dynamic pages that require JS rendering.
- `extract(..., use_browser=False)` is faster for static pages.
- If extraction quality is poor, the LLM should provide a more specific `prompt` and/or a stricter `schema`.
## TODO
- Revisit JS page rendering and extraction strategy. Right now, roughly 25-30% of pages return little or no usable content even when fetched successfully.
- Improve anti-bot handling for page fetches. Many targets still return 400-range errors, so investigate stronger browser mimicry (Playwright/Chromium behavior, headers, fingerprinting, and potentially user-agent/profile rotation).
## License
MIT. See `LICENSE`.
Connection Info
You Might Also Like
Filesystem
Node.js MCP Server for filesystem operations with dynamic access control.
Fetch
Retrieve and process content from web pages by converting HTML into markdown format.
context7-mcp
Context7 MCP Server provides natural language access to documentation for...
Context 7
Context7 MCP provides up-to-date code documentation for any prompt.
mempalace
The highest-scoring AI memory system ever benchmarked. And it's free.
chrome-devtools-mcp
Chrome DevTools for coding agents