Content

# PullMD Self-hosted URL-to-Markdown service for humans and AI agents. <p align="center"> <img src="assets/screenshot.png" alt="PullMD web interface" width="800"> </p> PullMD takes any web URL and returns clean, readable Markdown — no navigation, no ads, no boilerplate. It auto-detects Reddit threads (with full comment trees), uses Cloudflare's native Markdown when available, runs Mozilla Readability + Trafilatura on static HTML, and as a last resort renders JavaScript-heavy pages via headless Chromium (Playwright sidecar) before extracting. It ships as: - a **PWA frontend** with dark/paper themes, history, archive, share links - a **REST API** at `GET /api?url=…` - an **MCP server** at `POST /mcp` (Streamable-HTTP transport, stateless) - a **Claude Code skill** as a downloadable zip Every conversion gets an 8-hex **share id** that works as a stable live-endpoint: `GET /s/:id` returns the cached markdown and re-fetches from the source if older than one hour. Use the share id as a fixed URL that always returns fresh content — useful for subreddit feeds and similar. --- ## Quick start Pre-built multi-arch images (`linux/amd64`, `linux/arm64`) live on Docker Hub. Drop the compose file somewhere and run: ```bash mkdir pullmd && cd pullmd curl -O https://raw.githubusercontent.com/AeternaLabsHQ/pullmd/main/docker-compose.yml docker compose up -d # → http://localhost:3000 ``` That's it. No `.env` needed: every variable has a sensible default and PullMD listens on port `3000`. Add a `.env` next to the compose file to override anything (see [Configuration](#configuration)). ### `docker-compose.yml` (zero-config) ```yaml services: pullmd: image: aeternalabshq/pullmd:latest container_name: pullmd restart: unless-stopped ports: - "${PORT:-3000}:3000" environment: - PUBLIC_URL=${PUBLIC_URL:-http://localhost:${PORT:-3000}} - TRAFILATURA_URL=http://trafilatura:8001/extract - PLAYWRIGHT_URL=http://playwright:8002/render - REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID:-} - REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET:-} - REDDIT_USER_AGENT=${REDDIT_USER_AGENT:-} volumes: - ./data:/app/data networks: - pullmd-internal depends_on: - trafilatura - playwright trafilatura: image: aeternalabshq/pullmd-trafilatura:latest container_name: pullmd-trafilatura restart: unless-stopped networks: - pullmd-internal playwright: image: aeternalabshq/pullmd-playwright:latest container_name: pullmd-playwright restart: unless-stopped networks: - pullmd-internal networks: pullmd-internal: driver: bridge ``` > **Note:** the Playwright sidecar adds **~3.7 GB** to your image cache > (Chromium + Firefox + WebKit binaries from the official Playwright > base image). It's optional — leave `PLAYWRIGHT_URL` unset and the > `playwright` service block off, and PullMD silently degrades to > static extraction with a fallback note in the metadata. > **Mirror on GHCR:** `ghcr.io/aeternalabshq/{pullmd,pullmd-trafilatura,pullmd-playwright}`. > Replace the `image:` lines if you prefer GitHub's registry. ### Behind Traefik For deployments behind Traefik with TLS, use `docker-compose.traefik.yml` instead. Same images, but with Traefik labels and the `proxy` external network. Set `HOST_DOMAIN` in `.env`: ```bash curl -O https://raw.githubusercontent.com/AeternaLabsHQ/pullmd/main/docker-compose.traefik.yml echo "HOST_DOMAIN=pullmd.example.com" > .env docker compose -f docker-compose.traefik.yml up -d ``` ### Local development (no Docker) ```bash git clone https://github.com/AeternaLabsHQ/pullmd.git cd pullmd npm install npm start # http://localhost:3000 npm test # node --test ``` --- ## Configuration All variables go in `.env` (copy from `.env.example`): | Variable | Required | Purpose | | ---------------------- | -------- | ---------------------------------------------------------------------------------------------------- | | `HOST_DOMAIN` | yes | Public hostname without scheme. Used by Traefik routing and as fallback for `PUBLIC_URL`. | | `PUBLIC_URL` | no | Full public origin embedded in `/help` and the skill zip. Defaults to `https://${HOST_DOMAIN}`. | | `TRAFILATURA_URL` | no | URL of the Trafilatura sidecar's `/extract` endpoint. Unset → skip Trafilatura, Readability only. | | `PLAYWRIGHT_URL` | no | URL of the Playwright sidecar's `/render` endpoint. Unset → skip Playwright fallback for JS pages. | | `REDDIT_CLIENT_ID` | no | OAuth credentials for Reddit. Without them, PullMD uses the public JSON API (lower rate limit). | | `REDDIT_CLIENT_SECRET` | no | | | `REDDIT_USER_AGENT` | no | Reddit requires a unique UA. Default: `PullMD/1.0 (URL-to-Markdown service)`. | `PUBLIC_URL` matters for self-hosting: the help page and downloadable skill embed it as the canonical endpoint. Set it correctly and your users get a copy-paste setup that points at *your* instance. --- ## AI-agent integration Three install paths. Once your instance is running, `${PULLMD_URL}/help` shows the same boxes with your URL pre-filled. Replace `${PULLMD_URL}` below with your hostname (e.g. `https://pullmd.example.com`). ### 1. Universal prompt Drop into any chat agent (ChatGPT, Claude, Gemini, …): ``` When you need to read a web page, fetch via PullMD instead of raw HTML: GET ${PULLMD_URL}/api?url=<URL> Returns clean Markdown (text/markdown). Optional query params: comments=false skip Reddit comments comment_depth=N comment nesting depth (default 3) frontmatter=true prepend YAML metadata block format=text strip Markdown, return plain text nocache=true bypass the 1h cache and refetch render=force|skip override the auto Playwright fallback lang=de|en language for the comments section header Response headers worth checking: X-Source reddit | cloudflare | readability | playwright X-Quality 0.0-1.0 extraction confidence X-Share-Id 8-hex permalink, openable as /s/<id> Reddit URLs are auto-detected (incl. redd.it short links and /s/ shares). Use this whenever you would otherwise fetch raw HTML — the markdown is much cleaner and saves significant context window space. ``` ### 2. Claude Code skill `web-reader.zip` is auto-built with your URL embedded: ```bash curl -O ${PULLMD_URL}/web-reader.zip mkdir -p ~/.claude/skills unzip web-reader.zip -d ~/.claude/skills/ # Restart Claude Code; the skill activates on web-reading requests. ``` ### 3. MCP server Remote MCP server at `${PULLMD_URL}/mcp` (Streamable-HTTP transport, stateless). Three tools: `read_url`, `get_share`, `list_recent`. Server-side updates reach every client automatically — no local install needed. **Claude Code (CLI):** ```bash claude mcp add --transport http pullmd ${PULLMD_URL}/mcp ``` **Claude Desktop / Cursor / other MCP hosts — JSON config:** ```json { "mcpServers": { "pullmd": { "type": "http", "url": "${PULLMD_URL}/mcp" } } } ``` Once registered, the three tools surface natively in the agent — no prompt instructions needed, the LLM picks them up via their schema descriptions. --- ## API | Endpoint | Returns | | ---------------------- | -------------------------------------------------------------------------------- | | `GET /api?url=…` | Markdown (or JSON / plain text via `format=`). | | `GET /api/stream?url=…`| Server-Sent Events stream of extraction-stage status, ending in a `result` event. Used by the PWA. | | `GET /s/:id` | Cached Markdown by share id; refreshes from source if > 1 h old. | | `GET /api/history` | Recent conversions (JSON). | | `GET /api/archive` | Paginated full archive. | | `GET /api/storage` | Cache size / hit-rate stats. | | `GET /api/stats` | Extraction telemetry (sources, quality, latency). | | `POST /mcp` | Streamable-HTTP MCP endpoint (3 tools: `read_url`, `get_share`, `list_recent`). | | `GET /web-reader.zip` | Claude Code skill bundle, with this instance's URL baked in. | | `GET /help` | Bilingual user/agent setup guide. | ### `/api` parameters | Param | Default | Notes | | --------------- | ------- | ---------------------------------------------------------------------------------- | | `url` | — | Required. | | `comments` | `true` | Include Reddit comments. Ignored for non-Reddit URLs. | | `comment_depth` | `3` | Max nesting depth (1–10). | | `comment_limit` | `15` | Max top-level comments. | | `frontmatter` | `false` | Prepend YAML metadata. | | `format` | `md` | `text` strips Markdown; `json` returns structured response. | | `nocache` | `false` | Bypass the 1-hour cache. | | `render` | auto | `force` → always render via Playwright. `skip` → never render. Bypasses cache. | | `lang` | `de` | Comments-section header language (`de` or `en`). | ### Response headers - `X-Source` — `reddit` · `cloudflare` · `readability` · `readability-fallback` · `trafilatura` · `playwright` - `X-Quality` — `0.0`–`1.0` extraction confidence - `X-Share-Id` — the 8-hex permalink id --- ## Cache & TTLs - **`/api?url=…`** re-fetches from source if the cache row is older than **1 hour**. - **`/s/:id`** does the same on-demand refresh, so share links double as live endpoints. - Cache rows are pruned **90 days** after the last write. `/s/:id` hits keep the row alive (since they trigger refresh + write); read-only access does not extend the TTL. - If the source is unreachable on refresh, the last good snapshot is served — share links keep working even when the original URL dies. --- ## Architecture - `server.js` — Express app factory (`createApp`) with dependency injection for tests. Exposes `/api` and `/api/stream` (SSE). - `lib/reddit.js` — Reddit URL normalization, redirect resolution, post + comment extraction. - `lib/web.js` — Orchestrator: Cloudflare-Markdown short-circuit, then static Readability + Trafilatura with `pickBest`, then optional Playwright re-render + re-extract on body-soup / low-quality output. - `lib/render-decision.js` — Predicate that decides when to fall back to Playwright (readability-fellback + thin, body-soup signature, or quality < 0.5; plus `force` / `skip` overrides). - `lib/playwright-client.js` — HTTP client for the Playwright sidecar with `AbortSignal` propagation for SSE-disconnect cancellation. - `lib/scoring.js` — Quality scoring used to pick between extractors and as a render-trigger heuristic. - `lib/cache.js` — SQLite cache (`better-sqlite3`) with 90-day TTL and 8-hex share ids. - `lib/mcp.js` — Stateless MCP server registering the three tools. - `lib/distrib.js` — Public-URL substitution in `/help` and `/web-reader.zip`. - `trafilatura-sidecar/` — Python sidecar (FastAPI) wrapping Trafilatura. - `playwright-sidecar/` — Python sidecar (FastAPI + Playwright + Chromium) for JS-rendered pages. - `public/` — PWA frontend (vanilla JS, dark/paper themes, service worker, EventSource client for `/api/stream`). - `skill/web-reader/` — Claude Code skill source (templated with `__PULLMD_URL__`). --- ## License [GNU AGPL v3](LICENSE) — Copyright © 2026 Aeterna Labs. PullMD is free software: you can redistribute it and modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 or later. If you run a modified version as a network service, you must make your modifications available to its users.

pullmd

Content

MCP Config

Connection Info

You Might Also Like

Filesystem

Fetch

Context 7

shodh-memory

opik-mcp

CrowdSentinels-AI-MCP

pullmd

Scan with WeChat to Share

Authentication Required

Content

MCP Config

Connection Info

You Might Also Like

Filesystem

Fetch

Context 7

shodh-memory

opik-mcp

CrowdSentinels-AI-MCP