Content

# Chrome Ingest A Rust-based tool for extracting and indexing Chrome browsing history with intelligent content extraction and MCP (Model Context Protocol) server integration. ## Features - **Chrome History Extraction**: Safely extracts browsing history from Chrome's SQLite database - **Intelligent Content Extraction**: Multi-tier extraction pipeline: - **Readability**: Fast, local content extraction using Mozilla's readability algorithm - **Headless Browser**: Playwright-powered rendering for JavaScript-heavy sites - **SaaS Fallback**: External service integration (planned) - **Markdown Conversion**: Converts extracted HTML to clean, LLM-friendly markdown - **Full-Text Search**: SQLite FTS5-powered search with snippet highlighting - **MCP Server**: Model Context Protocol server for LLM integration - **Incremental Sync**: Only processes new/changed content - **Content Deduplication**: SHA-256 based deduplication ## Installation ### Prerequisites - Rust 1.70+ - Chrome/Chromium browser - Playwright (for headless extraction) ### Build from Source ```bash git clone https://github.com/yourusername/chrome_ingest.git cd chrome_ingest cargo build --release ``` ### Install Playwright (Optional) For headless browser extraction: ```bash # Install Playwright dependencies npx playwright install chromium ``` ## Quick Start ### 1. One-shot Sync Extract and index your Chrome history: ```bash ./target/release/chrome_ingest sync --since "2024-01-01T00:00:00Z" ``` ### 2. Start MCP Server For LLM integration via Model Context Protocol: ```bash ./target/release/chrome_ingest mcp ``` ### 3. Run as Daemon Continuous monitoring and extraction: ```bash ./target/release/chrome_ingest daemon --interval 300 # 5-minute intervals ``` ## Usage ### CLI Commands #### Sync Command ```bash chrome_ingest sync [OPTIONS] Options: --since <SINCE> Force sync start point (RFC3339 format) --history-path <HISTORY_PATH> Path to Chrome History database --db <DB> Path to local SQLite database [default: ./data/ingest.db] --parallel <PARALLEL> Number of parallel fetches [default: 3] --extractor-strategy <STRATEGY> Extractor strategy: auto, readability, headless, saas [default: auto] ``` #### MCP Server ```bash chrome_ingest mcp [OPTIONS] Options: --db <DB> Path to local SQLite database [default: ./data/ingest.db] ``` #### Daemon Mode ```bash chrome_ingest daemon [OPTIONS] Options: --interval <INTERVAL> Sync interval in seconds [default: 3600] --db <DB> Path to local SQLite database [default: ./data/ingest.db] ``` ### MCP Tools The MCP server provides the following tools for LLM integration: #### `search_pages` Search through extracted content with optional date filtering: ```json { "query": "rust programming", "limit": 10, "start_date": "2024-01-01", "end_date": "2024-12-31" } ``` #### `get_page` Retrieve full content of a specific page: ```json { "page_id": 123 } ``` #### `get_pages_by_date` Get page titles and URLs by visit date: ```json { "start_date": "2024-06-14", "end_date": "2024-06-14", "limit": 50 } ``` #### `get_stats` Get database statistics and recent activity: ```json {} ``` ### Configuration with Claude Desktop To use with Claude Desktop, add to your MCP configuration: ```json { "mcpServers": { "chrome-ingest": { "command": "/path/to/chrome_ingest", "args": ["mcp", "--db", "/absolute/path/to/data/ingest.db"] } } } ``` ## Architecture ### Database Schema - **pages**: Unique URLs with extracted content (markdown text, titles, metadata) - **visits**: Chrome visit history linked to pages - **pages_fts**: Full-text search index using SQLite FTS5 - **ingest_state**: Sync checkpoint tracking ### Content Extraction Pipeline 1. **Chrome History Parsing**: Extract URLs and visit timestamps 2. **Content Fetching**: Multi-strategy content extraction 3. **Markdown Conversion**: Clean HTML → Markdown using html2md 4. **Deduplication**: SHA-256 content hashing 5. **Indexing**: FTS5 full-text search index ### Extractor Strategies - **Auto**: Tries readability first, falls back to headless if content is insufficient - **Readability**: Fast, local extraction using Mozilla's algorithm - **Headless**: Playwright-powered browser rendering for SPA/JavaScript sites - **SaaS**: External service integration (planned) ## Development ### Running Tests ```bash cargo test ``` ### Database Migrations Migrations are automatically applied on startup. Manual migration: ```bash sqlx migrate run --database-url sqlite:./data/ingest.db ``` ### Environment Variables - `CHROME_HISTORY_PATH`: Override Chrome history database path - `DATABASE_URL`: Override local database path - `RUST_LOG`: Set logging level (debug, info, warn, error) ## Troubleshooting ### Common Issues 1. **Chrome database locked**: Chrome must be closed during sync 2. **Playwright not found**: Install Playwright dependencies 3. **Permission denied**: Ensure read access to Chrome profile directory ### Debug Mode Enable debug logging: ```bash RUST_LOG=debug ./target/release/chrome_ingest sync ``` ## Performance - **Incremental sync**: Only processes new visits since last sync - **Parallel extraction**: Configurable concurrent fetchers - **Content caching**: Avoids re-extraction of unchanged content - **Efficient indexing**: SQLite FTS5 with optimized queries ## Limitations - Chrome must be closed during history extraction - Headless extraction requires Playwright setup - Large histories may take time for initial sync - Some dynamic content may not extract properly ## Contributing 1. Fork the repository 2. Create a feature branch (`git checkout -b feature/amazing-feature`) 3. Commit your changes (`git commit -m 'Add amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ## License This project is licensed under either of - Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0) - MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT) at your option. ## Acknowledgments - [Mozilla Readability](https://github.com/mozilla/readability) for content extraction - [Playwright](https://playwright.dev/) for headless browser automation - [SQLite FTS5](https://www.sqlite.org/fts5.html) for full-text search - [Model Context Protocol](https://modelcontextprotocol.io/) for LLM integration

chrome-ingest

Content

Connection Info

You Might Also Like

markitdown

markitdown

Filesystem

TrendRadar

mempalace

mempalace

chrome-ingest

Scan with WeChat to Share

Authentication Required

Content

Connection Info

You Might Also Like

markitdown

markitdown

Filesystem

TrendRadar

mempalace

mempalace