Content
# Chrome Ingest
A Rust-based tool for extracting and indexing Chrome browsing history with intelligent content extraction and MCP (Model Context Protocol) server integration.
## Features
- **Chrome History Extraction**: Safely extracts browsing history from Chrome's SQLite database
- **Intelligent Content Extraction**: Multi-tier extraction pipeline:
- **Readability**: Fast, local content extraction using Mozilla's readability algorithm
- **Headless Browser**: Playwright-powered rendering for JavaScript-heavy sites
- **SaaS Fallback**: External service integration (planned)
- **Markdown Conversion**: Converts extracted HTML to clean, LLM-friendly markdown
- **Full-Text Search**: SQLite FTS5-powered search with snippet highlighting
- **MCP Server**: Model Context Protocol server for LLM integration
- **Incremental Sync**: Only processes new/changed content
- **Content Deduplication**: SHA-256 based deduplication
## Installation
### Prerequisites
- Rust 1.70+
- Chrome/Chromium browser
- Playwright (for headless extraction)
### Build from Source
```bash
git clone https://github.com/yourusername/chrome_ingest.git
cd chrome_ingest
cargo build --release
```
### Install Playwright (Optional)
For headless browser extraction:
```bash
# Install Playwright dependencies
npx playwright install chromium
```
## Quick Start
### 1. One-shot Sync
Extract and index your Chrome history:
```bash
./target/release/chrome_ingest sync --since "2024-01-01T00:00:00Z"
```
### 2. Start MCP Server
For LLM integration via Model Context Protocol:
```bash
./target/release/chrome_ingest mcp
```
### 3. Run as Daemon
Continuous monitoring and extraction:
```bash
./target/release/chrome_ingest daemon --interval 300 # 5-minute intervals
```
## Usage
### CLI Commands
#### Sync Command
```bash
chrome_ingest sync [OPTIONS]
Options:
--since <SINCE> Force sync start point (RFC3339 format)
--history-path <HISTORY_PATH> Path to Chrome History database
--db <DB> Path to local SQLite database [default: ./data/ingest.db]
--parallel <PARALLEL> Number of parallel fetches [default: 3]
--extractor-strategy <STRATEGY> Extractor strategy: auto, readability, headless, saas [default: auto]
```
#### MCP Server
```bash
chrome_ingest mcp [OPTIONS]
Options:
--db <DB> Path to local SQLite database [default: ./data/ingest.db]
```
#### Daemon Mode
```bash
chrome_ingest daemon [OPTIONS]
Options:
--interval <INTERVAL> Sync interval in seconds [default: 3600]
--db <DB> Path to local SQLite database [default: ./data/ingest.db]
```
### MCP Tools
The MCP server provides the following tools for LLM integration:
#### `search_pages`
Search through extracted content with optional date filtering:
```json
{
"query": "rust programming",
"limit": 10,
"start_date": "2024-01-01",
"end_date": "2024-12-31"
}
```
#### `get_page`
Retrieve full content of a specific page:
```json
{
"page_id": 123
}
```
#### `get_pages_by_date`
Get page titles and URLs by visit date:
```json
{
"start_date": "2024-06-14",
"end_date": "2024-06-14",
"limit": 50
}
```
#### `get_stats`
Get database statistics and recent activity:
```json
{}
```
### Configuration with Claude Desktop
To use with Claude Desktop, add to your MCP configuration:
```json
{
"mcpServers": {
"chrome-ingest": {
"command": "/path/to/chrome_ingest",
"args": ["mcp", "--db", "/absolute/path/to/data/ingest.db"]
}
}
}
```
## Architecture
### Database Schema
- **pages**: Unique URLs with extracted content (markdown text, titles, metadata)
- **visits**: Chrome visit history linked to pages
- **pages_fts**: Full-text search index using SQLite FTS5
- **ingest_state**: Sync checkpoint tracking
### Content Extraction Pipeline
1. **Chrome History Parsing**: Extract URLs and visit timestamps
2. **Content Fetching**: Multi-strategy content extraction
3. **Markdown Conversion**: Clean HTML → Markdown using html2md
4. **Deduplication**: SHA-256 content hashing
5. **Indexing**: FTS5 full-text search index
### Extractor Strategies
- **Auto**: Tries readability first, falls back to headless if content is insufficient
- **Readability**: Fast, local extraction using Mozilla's algorithm
- **Headless**: Playwright-powered browser rendering for SPA/JavaScript sites
- **SaaS**: External service integration (planned)
## Development
### Running Tests
```bash
cargo test
```
### Database Migrations
Migrations are automatically applied on startup. Manual migration:
```bash
sqlx migrate run --database-url sqlite:./data/ingest.db
```
### Environment Variables
- `CHROME_HISTORY_PATH`: Override Chrome history database path
- `DATABASE_URL`: Override local database path
- `RUST_LOG`: Set logging level (debug, info, warn, error)
## Troubleshooting
### Common Issues
1. **Chrome database locked**: Chrome must be closed during sync
2. **Playwright not found**: Install Playwright dependencies
3. **Permission denied**: Ensure read access to Chrome profile directory
### Debug Mode
Enable debug logging:
```bash
RUST_LOG=debug ./target/release/chrome_ingest sync
```
## Performance
- **Incremental sync**: Only processes new visits since last sync
- **Parallel extraction**: Configurable concurrent fetchers
- **Content caching**: Avoids re-extraction of unchanged content
- **Efficient indexing**: SQLite FTS5 with optimized queries
## Limitations
- Chrome must be closed during history extraction
- Headless extraction requires Playwright setup
- Large histories may take time for initial sync
- Some dynamic content may not extract properly
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
This project is licensed under either of
- Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
## Acknowledgments
- [Mozilla Readability](https://github.com/mozilla/readability) for content extraction
- [Playwright](https://playwright.dev/) for headless browser automation
- [SQLite FTS5](https://www.sqlite.org/fts5.html) for full-text search
- [Model Context Protocol](https://modelcontextprotocol.io/) for LLM integration
Connection Info
You Might Also Like
markitdown
Python tool for converting files and office documents to Markdown.
markitdown
MarkItDown-MCP is a lightweight server for converting URIs to Markdown.
Filesystem
Node.js MCP Server for filesystem operations with dynamic access control.
skyflo
Self-hosted AI Agent for Kubernetes & DevOps
graphthulhu
MCP server for Logseq knowledge graph traversal. 27 tools for navigating,...
memo
Memo MCP -- save and restore conversation across agents