Content
# Tool List
## Project Introduction
A powerful tool for exporting Zhihu collections, supporting batch export of Zhihu collections (public and private) to Markdown format files. Supports custom output paths, cross-platform compatibility, image download, and Obsidian compatible format. Supports using large model API to generate article summaries with one click.
**MCP Server is also provided**, which can be directly called by AI Agents (such as Claude Code) to provide the ability to save Zhihu collections for large models.
## Main Features
- 📚 **Batch Processing**: Supports processing multiple collections simultaneously
- 📂 **Custom Output**: Supports users to specify output directories
- 🖥️ **Cross-Platform Compatibility**: Supports Windows, Linux, macOS and other systems
- ✨ **Intelligent Deduplication**: Automatically checks for existing files to avoid duplicate downloads
- 🖼️ **Image Download**: Automatically downloads and saves images in articles
- 📝 **Obsidian Compatible**: Output Markdown format is fully compatible with Obsidian
- 📈 **Real-Time Logs**: Supports real-time log refresh, immediately displaying processing progress
- 🔍 **Automatic Collection Acquisition**: Automatically acquires user collection lists through independent scripts
- 🛠️ **Robust Error Handling**: Single article failure does not affect overall processing flow
- 🔧 **Enhanced Debugging Support**: Automatically generates debugging files for easy problem tracking
- 🤖 **MCP Support**: Provides MCP Server, which can be called by AI Agents
## Two Operation Modes
### Method 1: Command Line Operation (Traditional Method)
Suitable for directly running export tasks in the terminal.
#### Installation
```bash
pip install -r requirements.txt
```
#### Configuration File Settings
First, create the main configuration file:
```bash
# Copy configuration example file
cp config_examples.json config.json
```
Edit the `config.json` file to configure your collection information:
```json
{
"zhihuUrls": [
{
"name": "Collection Name",
"url": "https://www.zhihu.com/collection/123456789"
},
{
"name": "Another Collection",
"url": "https://www.zhihu.com/collection/987654321"
}
],
"outputPath": "/path/to/your/output/directory",
"os": "linux"
}
```
The zhihuUrls can be obtained from <img src="./readme/image.png">
#### Automatically Acquire Collections (Optional)
If you want to automatically acquire all collections, you can run:
```bash
python fetch_collections.py
```
This will automatically update the collection list in the `config.json` file.
#### Run the Main Program
```bash
python main.py
```
### Method 2: MCP Server Operation (AI Agent Method)
Suitable for integration and calling by Claude Code or other AI tools.
#### Install MCP Dependencies
```bash
# Install MCP package
pip install mcp
```
#### Start MCP Server
```bash
python mcp_server.py
```
The Server will communicate through stdio and can be used in Claude Code or other MCP clients.
#### Available Tools
| Tool Name | Description |
|--------|------|
| `list_collections` | List all Zhihu collections in the configuration file |
| `export_collection` | Export the specified Zhihu collection to a Markdown file |
| `get_collection_info` | Get basic information (number of articles, etc.) of the specified collection |
| `search_collections` | Search for collections containing keywords in the configuration file |
##### list_collections
```python
# Return formatted collection list
```
##### export_collection
Parameters:
- `collection_url` (required): Collection URL
- `collection_name` (optional): Collection name
- `output_dir` (optional): Output directory
```python
# Example
export_collection(
collection_url="https://www.zhihu.com/collection/123456789",
collection_name="Python Study",
output_dir="my_downloads"
)
```
##### get_collection_info
Parameters:
- `collection_url` (required): Collection URL
##### search_collections
Parameters:
- `keyword` (required): Search keyword
#### Use in Claude Code
Add server configuration in the `CLAUDE.md` or user configuration file's MCP configuration section:
```json
{
"mcpServers": {
"zhihu-collections": {
"command": "python",
"args": ["mcp_server.py"],
"env": {},
"cwd": "Your project path"
}
}
}
```
Then you can call using natural language:
- "List my Zhihu collections"
- "Export collection https://www.zhihu.com/collection/123456789"
- "Search for collections containing Python"
- "Get the number of articles in collection https://www.zhihu.com/collection/123456789"
## Configuration Options Explanation
### Basic Configuration
- **zhihuUrls**: Collection list, each collection contains name and URL
- **outputPath**: Custom output path (optional, leave blank for default path)
- **os**: Operating system type (optional, leave blank for automatic detection)
### Supported Operating Systems and Path Formats
- **Windows**: `"D:\\Documents\\ZhihuExports"` or `"D:/Documents/ZhihuExports"`
- **Linux/Unix**: `"/usr/local/share/zhihu-exports"`
- **macOS**: `"~/Documents/ZhihuExports"`
- **Cygwin**: `"/cygdrive/d/Documents/ZhihuExports"`
### Private Collection Access
For private collections, create a `cookies.json` file:
```json
[
{"name": "cookie_name", "value": "cookie_value"},
{"name": "another_cookie", "value": "another_value"}
]
```
## Output Results
### File Structure
```
Output Directory/
├── Collection Name 1/
│ ├── Article 1.md
│ ├── Article 2.md
│ └── assets/
│ ├── image1.jpg
│ └── image2.png
├── Collection Name 2/
├── logs/
│ ├── debug_20240101_120000.log
│ └── 20240101_120000.json
└── debug/
├── debug_answer_123456.html
└── debug_post_789012.html
```
### Default Output Location
- **Default Path**: `downloads/` folder in the project directory
- **Custom Path**: Specify `outputPath` in `config.json`
- **Image Storage**: `assets/` folder in each collection directory
- **Log Files**: `logs/` folder in the output directory
- **Debugging Files**: `debug/` folder in the output directory (save unresolvable page HTML)
## Troubleshooting
### Common Issues
1. **Content Download Failure**
- Check `downloads/debug/` directory for HTML files to analyze page structure
- Check network connection and cookies for validity
- Confirm article URL is accessible
2. **Empty Log File**
- The program has fixed the real-time log refresh issue
- If the problem persists, check output directory write permissions
3. **TypeError: cannot unpack non-iterable NoneType object**
- This issue has been fixed in v2.1
- If encountered, update to the latest version
4. **Column article returns "The article link was 404" but the browser can open normally**
- v2.1 added intelligent content detection and precise error analysis
- Check debug directory HTML files for specific reasons
- May need to update cookies or page structure changes
### Debugging Support
- **Log Files**: `downloads/logs/debug_*.log` contains detailed processing information
- **Debugging HTML**: `downloads/debug/debug_*.html` saves unresolvable pages
- **Test Scripts**: `test/` directory contains various functional test scripts
# BUG Feedback
If you encounter any issues during use, please provide BUG information in the issue. To facilitate reproduction and resolution, please be sure to attach error prompts or related URLs.
# Suggestions
If you have any suggestions, welcome to initiate discussions in the issue.
## Update Log
### v2.3 Anti-Crawling Mechanism Bypass
- 🔄 **API Backup Mechanism**: When HTML request returns 403, automatically call Zhihu API (`/api/v4/answers/{id}?include=content`) to obtain content
- 📦 **Streaming Download**: Column articles use streaming download to solve network interruption problems for large articles (e.g., long articles with many images)
- 🔁 **Retry Mechanism**: Column articles add 3 retries to improve download success rate
- 🔐 **Headers Enhancement**: Update request Headers to simulate modern browsers and reduce anti-crawling interception probability
### v2.2 MCP Support
- 🤖 **MCP Server**: Added `mcp_server.py`, supporting direct calls by AI Agents
- 📋 **4 Tools**: Provide list_collections / export_collection / get_collection_info / search_collections tools
- 🔌 **Standard Protocol**: Follow MCP protocol, supporting mainstream AI tool integration such as Claude Code
### v2.1 Enhanced Features
- 🚀 **Real-Time Log System**: Support real-time log refresh and immediate display of processing progress
- 🛠️ **Robust Error Handling**: Fixed TypeError and other critical errors, single article failure does not affect overall processing
- 🔍 **Enhanced HTML Parsing**: Support multiple Zhihu page structures, improve content acquisition success rate
- 🔧 **Debugging File Generation**: Automatically save unresolvable page HTML to `debug/` directory for analysis
- 🔄 **Automatic Collection Acquisition**: Added `fetch_collections.py` independent script to automatically acquire collection lists
- 📊 **Detailed Error Logs**: Record specific error information and stack tracking for easy problem location
- 🎯 **Intelligent Content Detection**: When standard CSS selectors fail, automatically enable intelligent algorithms to detect article content
- 🔍 **Precise Error Analysis**: Distinguish different error types such as 404, login requirements, permission issues, etc.
### v2.0 New Features
- ✨ Batch processing of multiple collections
- ⚙️ Configuration file system (config.json)
- 📂 Custom output path support
- 🖥️ Cross-platform path processing
- ✨ Intelligent file deduplication function
- 📈 Detailed processing log system
- 🔄 Backward compatibility with old version configuration files
## Todo
- Optimize crawling speed
- Add more export format support
- GUI interface development
- Third-party large model API support
MCP Config
Below is the configuration for this MCP Server. You can copy it directly to Cursor or other MCP clients.
mcp.json
Connection Info
You Might Also Like
everything-claude-code
Complete Claude Code configuration collection - agents, skills, hooks,...
markitdown
Python tool for converting files and office documents to Markdown.
awesome-claude-skills
A curated list of awesome Claude Skills, resources, and tools for...
antigravity-awesome-skills
The Ultimate Collection of 130+ Agentic Skills for Claude...
openfang
Open-source Agent Operating System
context-mode
MCP is the protocol for tool access. We're the virtualization layer for context.