Content
# Auto-claude-code-research-in-sleep
[README](README_CN.md) | English

> 🌙 **Let Claude Code do research while you sleep.** Wake up to find your paper scored, weaknesses identified, experiments run, and narrative rewritten — autonomously.
Custom [Claude Code](https://docs.anthropic.com/en/docs/claude-code) skills for autonomous ML research workflows. These skills orchestrate **cross-model collaboration** — Claude Code drives the research while an external LLM (via [Codex MCP](https://github.com/openai/codex)) acts as a critical reviewer.
## 📈 Score Progression (Real Run)
A real overnight 4-round run on an ML research project, from borderline reject to submission-ready:
| Round | Score | What Happened |
|-------|-------|---------------|
| Initial | 5.0/10 | Borderline reject |
| Round 1 | 6.5/10 | Added standard metrics, discovered metric decoupling |
| Round 2 | 6.8/10 | Key claim failed to reproduce, pivoted narrative |
| Round 3 | 7.0/10 | Large seed study killed main improvement claim |
| Round 4 | **7.5/10** ✅ | Diagnostic evidence solidified, **submission ready** |
The loop autonomously ran **20+ GPU experiments**, rewrote the paper's narrative framing, and killed claims that didn't hold up — all without human intervention.
---
## 🔄 Workflows
These skills are designed to be composed into two main research workflows:
### Workflow 1: Auto Research Loop 🔁 (sleep & wake up to results)
> **"Review my paper, fix what's wrong, repeat until it's good."**
```
┌─────────────────────────────────────────────────────────────┐
│ Auto Review Loop │
│ │
│ /research-review /auto-review-loop │
│ (single deep review) (autonomous loop) │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ External │──▶│ Implement│──▶│ Monitor │──▶ repeat │
│ │ LLM │ │ fixes │ │ results │ until │
│ │ reviews │ │ & run │ │ │ score ≥ 6 │
│ └──────────┘ │ experiments│ └──────────┘ │
│ └──────────┘ │
│ │
│ Supporting skills: │
│ /analyze-results — interpret experiment outputs │
│ /monitor-experiment — check progress, collect results │
└─────────────────────────────────────────────────────────────┘
```
**Skills involved:** `auto-review-loop` + `research-review` + `analyze-results` + `monitor-experiment`
**🛡️ Key safety features:**
- 🔒 **MAX_ROUNDS = 4** — prevents infinite loops; stops early if score threshold is met
- ⏱️ **> 4 GPU-hour experiments skipped** — won't launch massive jobs; flags them for manual follow-up
- 🧠 **Prefer reframing over new experiments** — when both can address a weakness, chooses the cheaper path
- 🪞 **No hiding weaknesses** — explicit rule: "Do NOT hide weaknesses to game a positive score"
- 🔧 **Fix before re-review** — must actually implement fixes before resubmitting; no empty promises
📝 **Blog post:** [开源 | 睡觉 Claude 自动跑实验改文](http://xhslink.com/o/5cBMTDigNXz)
### Workflow 2: Literature & Idea Discovery 🔍
> **"What's the state of the art? Where are the gaps?"**
```
┌─────────────────────────────────────────────────────────────┐
│ Idea Discovery │
│ │
│ /research-lit /research-review │
│ (find & analyze papers) (get critical feedback) │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Search │ │ External │ │
│ │ arXiv, │──────────────▶│ LLM │ │
│ │ Scholar │ "here's the │ evaluates│ │
│ │ for gaps │ landscape" │ your idea│ │
│ └──────────┘ └──────────┘ │
│ │
│ Typical flow: │
│ 1. /research-lit "discrete diffusion models" │
│ 2. Read the landscape summary, spot a gap │
│ 3. /research-review "my idea to fix X using Y" │
│ 4. Iterate on the idea with critical feedback │
└─────────────────────────────────────────────────────────────┘
```
**Skills involved:** `research-lit` + `research-review`
📝 **Blog post:** [Claude Code 两月 NeurIPS 指北](http://xhslink.com/o/7IvAJQ41IBA)
---
## 🧰 All Skills
| Skill | Description | Needs Codex MCP? |
|-------|-------------|-----------------|
| 🔬 [`research-review`](skills/research-review/SKILL.md) | Single-round deep review from external LLM (xhigh reasoning) | Yes |
| 🔁 [`auto-review-loop`](skills/auto-review-loop/SKILL.md) | Autonomous multi-round review→fix→re-review loop (max 4 rounds) | Yes |
| 📚 [`research-lit`](skills/research-lit/SKILL.md) | Search papers, analyze related work, find research gaps | No |
| 📊 [`analyze-results`](skills/analyze-results/SKILL.md) | Analyze experiment results, compute statistics, generate insights | No |
| 👀 [`monitor-experiment`](skills/monitor-experiment/SKILL.md) | Monitor running experiments, check progress, collect results | No |
---
## ⚙️ Setup
### Prerequisites
1. [Claude Code](https://docs.anthropic.com/en/docs/claude-code) installed
2. (For review skills) [Codex CLI](https://github.com/openai/codex) installed and configured as MCP server:
```bash
npm install -g @openai/codex
claude mcp add codex -s user -- codex mcp-server
```
### Install Skills
```bash
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git
cd Auto-claude-code-research-in-sleep
# Install all skills globally
cp -r skills/* ~/.claude/skills/
# Or install specific skills
cp -r skills/auto-review-loop ~/.claude/skills/
cp -r skills/research-lit ~/.claude/skills/
```
### Usage
```
> /research-lit discrete diffusion language models
> /research-review my paper on training dynamics in D-LLMs
> /auto-review-loop ML paper on factorized gap diagnosis
> /analyze-results figures/*.json
> /monitor-experiment server5
```
### 🌙 Auto-Allow for Overnight Runs (Optional)
To run the auto-review loop without clicking permission prompts, add to `.claude/settings.local.json`:
```json
{
"permissions": {
"allow": [
"mcp__codex__codex",
"mcp__codex__codex-reply",
"Write",
"Edit",
"Skill(auto-review-loop)"
]
}
}
```
## 🏗️ How It Works
```
┌─────────────────────────────────────────────────┐
│ Claude Code │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Read │ │ Write │ │ SSH to │ │
│ │ project │───▶│ code & │───▶│ GPU │ │
│ │ context │ │ scripts │ │ server │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Codex MCP (External LLM) │ │
│ │ │ │
│ │ Round 1: "Score 5/10. Weaknesses: ..." │ │
│ │ Round 2: "Score 6.5. Better, but ..." │ │
│ │ Round 3: "Score 7.0. Almost there..." │ │
│ │ Round 4: "Score 7.5. Ready." ✅ │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
```
The key insight: **Claude Code handles execution** (reading files, writing code, running experiments, collecting results) while **the external LLM handles evaluation** (scoring, identifying weaknesses, suggesting fixes). This separation creates a genuine feedback loop — neither model is grading its own work.
## 🎛️ Customization
Skills are plain Markdown files. Fork and customize:
- **`MAX_ROUNDS`** — increase for more thorough iteration (default: 4)
- **`POSITIVE_THRESHOLD`** — adjust the stop condition score
- **Prioritization rules** — change compute limits, what fixes to skip
- **Prompt templates** — tailor the review persona and evaluation criteria
- **`allowed-tools`** — restrict or expand what each skill can do
## License
MIT
Connection Info
You Might Also Like
markitdown
MarkItDown-MCP is a lightweight server for converting URIs to Markdown.
servers
Model Context Protocol Servers
Time
A Model Context Protocol server for time and timezone conversions.
Filesystem
Node.js MCP Server for filesystem operations with dynamic access control.
Sequential Thinking
A structured MCP server for dynamic problem-solving and reflective thinking.
git
A Model Context Protocol server for Git automation and interaction.