Content
# Dual-Layer MemCube Agent Backend
A robust, modular AI backend implementing the **Dual-Layer Memory Architecture**. It optimizes for both low-latency conversation (Hot Memory) and massive historical recall (Cold Memory/Vector DB).
Built with **FastAPI**, **ChromaDB**, and designed to be fully customizable—easily switch between **OpenAI**, **Ollama**, **vLLM**, or any OpenAI-compatible API.
## 🚀 Key Features
* **Dual-Layer Memory System**:
* **L1 (Hot Memory)**: In-memory LRU cache for recent context. Includes an "Importance" lock to prevent critical information from being evicted.
* **L2 (Cold Memory)**: Persistent Vector Database (ChromaDB) for long-term storage of overflowed memories.
* **Waterfall Retrieval**: Intelligent context lookup that queries L1 first and only falls back to the Vector DB if necessary, reducing latency and cost.
* **Customizable AI Provider**: seamless support for:
* **OpenAI** (GPT-3.5/4)
* **Local LLMs** (Ollama, vLLM, LocalAI)
* **Mock Mode** (Zero-cost testing)
* **Production Ready**: Built on FastAPI with asynchronous request handling, modular architecture, and structured logging.
## 📊 Benchmark Results
We compared MemCube against a traditional chat history system (keeping last 30 messages in context). Both use the same AI API for fair comparison.
### Performance Summary
| Metric | MemCube | Traditional | Advantage |
|--------|---------|-------------|-----------|
| **Avg Latency** | 7,758 ms | 12,272 ms | **37% faster** ⚡ |
| **Min Latency** | 5,368 ms | 6,141 ms | 13% faster |
| **Max Latency** | 16,218 ms | 19,746 ms | 18% faster |
| **Recall Rate** | 100% | 100% | Equal ✓ |
| **API Calls** | 30 | 30 | Equal |
### Visual Comparison

### Why MemCube is Faster
1. **Deferred Embedding** - Memory is saved immediately; embedding generated in background
2. **Smart Retrieval** - If enough recent context exists in L1, skip expensive vector search
3. **No Query Embedding** - For recent conversations, no embedding generation needed
### Additional Advantages Over Traditional Systems
| Feature | MemCube | Traditional |
|---------|---------|-------------|
| **Scalability** | Handles 1000+ conversations | Context window limited |
| **Cross-Session Memory** | ✅ Persistent | ❌ Session-only |
| **Old Memory Retrieval** | ✅ Vector search | ❌ Lost after window |
| **Critical Info Protection** | ✅ Importance lock | ❌ FIFO eviction |
### Running the Benchmark
```bash
# Start the backend
python -m app.main
# In another terminal, run benchmark
python benchmark.py
# Generate charts (optional)
python generate_charts.py
```
## 🛠️ Installation
1. **Clone the repository** (if applicable) or navigate to the project folder:
```bash
cd memcube_backend
```
2. **Install Dependencies**:
Recommended to use a virtual environment (venv/conda).
```bash
pip install -r requirements.txt
```
## ⚙️ Configuration
Copy the example configuration file:
```bash
cp .env.example .env
```
### Option A: Using OpenAI
Edit `.env` to use official OpenAI API:
```ini
AI_PROVIDER_TYPE=openai
AI_API_KEY=sk-proj-...
AI_CHAT_MODEL=gpt-3.5-turbo
AI_EMBEDDING_MODEL=text-embedding-3-small
```
### Option B: Using Local LLM (Ollama)
Edit `.env` to point to your local instance. **No API key required.**
```ini
AI_PROVIDER_TYPE=custom
AI_BASE_URL=http://localhost:11434/v1
AI_API_KEY=ollama
AI_CHAT_MODEL=llama3
AI_EMBEDDING_MODEL=nomic-embed-text
```
### Option C: Mock Mode (Testing)
Perfect for testing logic without running an LLM.
```ini
AI_PROVIDER_TYPE=mock
```
## 🏃♂️ Usage
Start the server:
```bash
python -m app.main
```
The server will start at `http://localhost:8000`.
### API Endpoints
Interactive documentation is available at **[http://localhost:8000/docs](http://localhost:8000/docs)**.
#### 1. Chat with Memory
`POST /api/v1/chat`
```json
{
"message": "My secret code is 42",
"importance": "high"
}
```
*The agent will response using context from L1 or L2 memory. High importance messages are protected from being forgotten (evicted from L1).*
#### 2. Add Memory Manually
`POST /api/v1/memory`
```json
{
"content": "The project deadline is next Friday.",
"importance": "normal",
"source": "slack_integration"
}
```
## 🔌 MCP Server Support (Claude Desktop Integration)
MemCube implements the **Model Context Protocol (MCP)**, allowing it to be used as a native memory tool by intelligent agents like **Claude Desktop** or **Cursor**.
### Features
* **Persistent Memory**: Claude can save important facts about you that persist across sessions.
* **Semantic Search**: Claude can retrieve relevant past context based on your current query.
### Setup for Claude Desktop
1. Ensure local backend is running (`python -m app.main`)
2. Edit your Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json` on Mac, or `%APPDATA%\Claude\claude_desktop_config.json` on Windows):
```json
{
"mcpServers": {
"memcube": {
"command": "python",
"args": ["/absolute/path/to/memcube_backend/mcp_server.py"]
}
}
}
```
3. Restart Claude Desktop. You will see 🛠️ icon indicating MemCube tools (`save_memory`, `retrieve_memory`) are available.
## 📂 Project Structure
```text
memcube_backend/
├── app/
│ ├── api/ # API Routes
│ ├── core/ # Config & Settings
│ ├── llm/ # AI Provider Factory (OpenAI/Custom/Mock)
│ ├── models/ # Pydantic Data Models
│ ├── services/ # Business Logic
│ │ ├── manager.py # Memory Manager (Orchestrator)
│ │ ├── memory_l1.py # Hot Memory Logic
│ │ └── memory_l2.py # Cold Memory Logic
│ └── main.py # App Entry Point
├── .env # Environment Variables
└── requirements.txt # Dependencies
```
## 🧠 Architecture Details
### The "Spillover" Mechanism
When L1 (Hot Memory) reaches capacity, it evicts the **Least Recently Used (LRU)** item to L2 (Cold Memory).
* **Exception**: If an item is marked `importance="high"`, it is moved to the front of the cache and **not evicted**, ensuring critical instructions (like "My name is Alice" or "Act as a python expert") are always fast to access.
### The "Waterfall" Retrieval
When retrieving context for a user query:
1. **Search L1**: Computes cosine similarity with all hot memories.
2. **Early Exit**: If a match is found with similarity > `0.85`, it returns immediately.
3. **Fallback**: If no good match is found in L1, it searches the Vector Database (L2).
MCP Config
Below is the configuration for this MCP Server. You can copy it directly to Cursor or other MCP clients.
mcp.json
Connection Info
You Might Also Like
markitdown
Python tool for converting files and office documents to Markdown.
markitdown
MarkItDown-MCP is a lightweight server for converting URIs to Markdown.
Filesystem
Node.js MCP Server for filesystem operations with dynamic access control.
TrendRadar
TrendRadar: Your hotspot assistant for real news in just 30 seconds.
mempalace
The highest-scoring AI memory system ever benchmarked. And it's free.
mempalace
The highest-scoring AI memory system ever benchmarked. And it's free.