Content
<div align="center">
<img src="assets/light-mode-logo.png" alt="MimikaStudio Logo" width="400"/>
<br><br>
<code>NEW</code> macOS · Windows · Web · CUDA · Apple MPS
<br><br>
<h1>Clone any voice <i>in seconds</i></h1>
<p>Local-first voice cloning, text-to-speech, PDF reader, and audiobook creator.<br>Runs on macOS (MPS), Windows (CUDA), and Web. Tested on RTX 4090 & 5090.</p>
<br>
<a href="https://boltzmannentropy.github.io/mimikastudio.github.io/"><strong>Get Started</strong></a> · <a href="https://github.com/BoltzmannEntropy/MimikaStudio"><strong>View on GitHub</strong></a>
<br><br>
macOS (MPS) · Windows (CUDA) · Web UI · Free & Open Source
<br><br>
</div>
> **Custom Voice Cloning** | **Text-to-Speech** | **PDF Read Aloud** | **Audiobook Creator** | **MCP & API Dashboard**
A local-first application for **macOS (MPS), Windows (CUDA), and Web**, with four core capabilities: **clone any voice** from just 3 seconds of audio using four voice cloning engines (Qwen3-TTS, Chatterbox, IndexTTS-2), generate **high-quality text-to-speech** with multiple engines and premium voices, **read PDFs aloud** with sentence-by-sentence highlighting, and **convert documents to audiobooks** with your choice of voice.
### Supported Models
| Model | Parameters | Type | Languages |
|-------|-----------|------|-----------|
| [Kokoro-82M](https://github.com/hexgrad/kokoro) | 82M | Fast TTS | English (British RP + American) |
| [Qwen3-TTS 0.6B Base](https://github.com/QwenLM/Qwen3-TTS) | 600M | Voice Cloning | 10 languages |
| [Qwen3-TTS 1.7B Base](https://github.com/QwenLM/Qwen3-TTS) | 1.7B | Voice Cloning | 10 languages |
| [Qwen3-TTS 0.6B CustomVoice](https://github.com/QwenLM/Qwen3-TTS) | 600M | Preset Speakers | 4 languages (en, zh, ja, ko) |
| [Qwen3-TTS 1.7B CustomVoice](https://github.com/QwenLM/Qwen3-TTS) | 1.7B | Preset Speakers | 4 languages (en, zh, ja, ko) |
| [Chatterbox Multilingual](https://github.com/resemble-ai/chatterbox) | — | Voice Cloning | 23 languages |
| [IndexTTS-2](https://github.com/IndexTeam/IndexTTS) | — (~24GB) | Voice Cloning | Multilingual |


---
## Audio Samples
All samples below were generated using philosophical texts. Click to listen.
### Kokoro TTS (Fast British/American Voices)
| Voice | Sample |
|-------|--------|
| **Emma** (British RP Female) | [kokoro-bf_emma-sample.wav](assets/audio-samples/kokoro-bf_emma-sample.wav) |
| **George** (British Male) | [kokoro-bm_george-sample.wav](assets/audio-samples/kokoro-bm_george-sample.wav) |
| **Lily** (British Female) | [kokoro-bf_lily-sample.wav](assets/audio-samples/kokoro-bf_lily-sample.wav) |
> *"The unexamined life is not worth living. To find yourself, think for yourself. I know that I know nothing, and in this lies my wisdom."*
### Qwen3-TTS CustomVoice (Preset Speakers)
| Speaker | Sample |
|---------|--------|
| **Ryan** (English, dynamic male) | [qwen3-custom-ryan-sample.wav](assets/audio-samples/qwen3-custom-ryan-sample.wav) |
| **Aiden** (English, sunny male) | [qwen3-custom-aiden-sample.wav](assets/audio-samples/qwen3-custom-aiden-sample.wav) |
> *"The unexamined life is not worth living. To find yourself, think for yourself. I know that I know nothing, and in this lies my wisdom."*
### Qwen3-TTS Voice Clone (3-Second Cloning)
| Cloned Voice | Sample |
|--------------|--------|
| **Natasha** | [qwen3-clone-natasha-sample.wav](assets/audio-samples/qwen3-clone-natasha-sample.wav) |
| **Suzan** | [qwen3-clone-suzan-sample.wav](assets/audio-samples/qwen3-clone-suzan-sample.wav) |
> *"We are what we repeatedly do. Excellence, then, is not an act, but a habit. It is the mark of an educated mind to be able to entertain a thought without accepting it."*
### Chatterbox Multilingual Voice Clone
| Cloned Voice | Sample |
|--------------|--------|
| **Natasha** | [chatterbox-clone-natasha-sample.wav](assets/audio-samples/chatterbox-clone-natasha-sample.wav) |
| **Suzan** | [chatterbox-clone-suzan-sample.wav](assets/audio-samples/chatterbox-clone-suzan-sample.wav) |
> *"Happiness depends upon ourselves. Knowing yourself is the beginning of all wisdom. The energy of the mind is the essence of life."*
---
## Installation
### System Requirements
| Component | Requirement |
|-----------|-------------|
| **OS** | macOS 12+ (Monterey or later) |
| **CPU** | Apple Silicon (M1/M2/M3/M4) or Intel |
| **RAM** | 8GB minimum, 16GB+ recommended |
| **Storage** | 10GB for models and dependencies |
| **Python** | 3.10 or later |
| **Flutter** | 3.x with desktop support (**required** for macOS, Windows, and web UI) |
### Flutter Requirement
**Flutter is required** to run the MimikaStudio GUI on **macOS**, **Windows**, and **web**. The backend (API server) runs without Flutter, but for the desktop or web UI you must install Flutter SDK 3.x:
- **macOS**: `brew install --cask flutter` or follow [Flutter macOS install guide](https://docs.flutter.dev/get-started/install/macos)
- **Windows**: Follow the [Flutter Windows install guide](https://docs.flutter.dev/get-started/install/windows)
- **Web**: Same Flutter SDK; the web UI is launched via `./bin/mimikactl up --web`
### Automated Install (Recommended)
A single `install.sh` in the project root handles everything: prerequisites,
virtual environment, all Python dependencies (including Qwen3-TTS, Chatterbox,
OmegaConf, Perth, etc.), database setup, and Flutter.
```bash
git clone https://github.com/BoltzmannEntropy/MimikaStudio.git
cd MimikaStudio
./install.sh
```
The script will:
1. Check / install Homebrew, Python 3, espeak-ng, and ffmpeg
2. Create a Python venv in the project root (`./venv`)
3. Install **all** Python dependencies from the root `requirements.txt`
4. Install `chatterbox-tts` with `--no-deps` (its runtime deps are already in `requirements.txt`)
5. Download the **Dicta ONNX** Hebrew diacritizer model (~1.1 GB) for Chatterbox Hebrew TTS (skip with `SKIP_DICTA=1`)
6. Verify that every critical import works
7. Initialize the SQLite database
8. Set up Flutter (if installed)
Note: `./install.sh` creates the Python virtual environment and installs large dependencies, so the first run can take a few minutes.
After installation, start MimikaStudio:
```bash
source venv/bin/activate
./bin/mimikactl up # Backend + MCP + Flutter desktop
./bin/mimikactl up --web # Backend + MCP + Flutter web UI
```
### Manual Install
```bash
git clone https://github.com/BoltzmannEntropy/MimikaStudio.git
cd MimikaStudio
# System dependencies (macOS)
brew install espeak-ng ffmpeg python@3.11
# Python venv
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
# All Python dependencies (from project root)
pip install -r requirements.txt
# Chatterbox TTS (--no-deps to avoid version conflicts with its strict pins)
pip install --no-deps chatterbox-tts==0.1.6
# Initialize database
cd backend && python3 database.py && cd ..
# Flutter (optional, for desktop/web UI)
cd flutter_app && flutter pub get && cd ..
# Start
./bin/mimikactl up
```
### Download Models (Optional)
Models auto-download on first use (~3 GB total). To pre-download:
```bash
./bin/mimikactl models download kokoro # ~300 MB
./bin/mimikactl models download qwen3 # ~4 GB for 1.7B
```
The **Dicta ONNX** Hebrew diacritizer (~1.1 GB) is downloaded by `install.sh` automatically. If you skipped it (`SKIP_DICTA=1`) and need Hebrew TTS later, run:
```bash
mkdir -p backend/models/dicta-onnx
curl -L -o backend/models/dicta-onnx/dicta-1.0.onnx \
https://github.com/thewh1teagle/dicta-onnx/releases/download/model-files-v1.0/dicta-1.0.onnx
```
### Verify Installation
```bash
source venv/bin/activate
python -c "import kokoro; print('Kokoro OK')"
python -c "from qwen_tts import QwenTTS; print('Qwen3-TTS OK')"
python -c "from chatterbox import ChatterboxTTS; print('Chatterbox OK')"
python -c "import omegaconf; print('OmegaConf OK')"
python -c "import perth; print('Perth OK')"
```
---
## Quick Start
```bash
# Start all services (Backend + MCP + Flutter UI)
./bin/mimikactl up
# Or: Backend + MCP + Flutter Web UI
./bin/mimikactl up --web
# Then open http://127.0.0.1:5173
# Or: Backend + MCP only (no Flutter)
./bin/mimikactl up --no-flutter
# Check status
./bin/mimikactl status
# View logs
./bin/mimikactl logs backend
```
Example startup output:
```
=== Starting MimikaStudio ===
Starting backend...
Waiting for http://localhost:8000/api/health ...... OK
Starting MCP Server...
MCP Server started on port 8010
Starting Flutter UI (dev mode)...
```

---
## Platforms
MimikaStudio ships two UIs backed by the same local FastAPI server:
**macOS Desktop App** (default): `./bin/mimikactl up`
**Web UI** (Flutter Web): `./bin/mimikactl up --web` then open http://127.0.0.1:5173
> The web UI uses the same backend and voice library as the desktop app.
> In web mode, use **Open Document** to upload PDFs from your machine.

---
## Why MimikaStudio?
MimikaStudio brings together the latest advances in neural text-to-speech into a unified desktop experience.
### Lightning-Fast British TTS with Kokoro
**[Kokoro TTS](https://github.com/hexgrad/kokoro)** delivers sub-200ms latency with crystal-clear British and American accents. The 82M parameter model runs effortlessly on any machine, generating natural-sounding speech with Emma, George, Lily, and other premium voices.
Kokoro also includes **Emma IPA** - a British phonetic transcription tool powered by your choice of LLM (Claude, OpenAI, Ollama).

### Voice Cloning Without Limits
Clone any voice from remarkably short audio samples. **[Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS)** requires just **3 seconds** of reference audio to capture a speaker's characteristics. Upload a voice memo, a podcast clip, or any audio snippet, and MimikaStudio will synthesize new speech in that voice.
For multilingual cloning, **[Chatterbox Multilingual TTS](https://huggingface.co/spaces/ResembleAI/Chatterbox-Multilingual-TTS)** supports 23 languages, while **[IndexTTS-2](https://github.com/IndexTeam/IndexTTS)** delivers high-fidelity clones with its large ~24GB model. All three engines share a unified voice library — upload a voice sample once and use it across all cloning engines.
### Premium Preset Speakers
MimikaStudio includes **9 premium preset speakers** across 4 languages (English, Chinese, Japanese, Korean), each with distinct personalities. These CustomVoice speakers require no audio samples at all.
### Multiple State-of-the-Art Models
| Model | Type | Strength |
|-------|------|----------|
| **[Kokoro-82M](https://github.com/hexgrad/kokoro)** | Fast TTS | Sub-200ms latency, British RP & American accents |
| **[Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) 0.6B/1.7B Base** | Voice Cloning | 3-second cloning, 10 languages |
| **[Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) 0.6B/1.7B CustomVoice** | Preset Speakers | 9 premium voices, style control |
| **[Chatterbox Multilingual TTS](https://huggingface.co/spaces/ResembleAI/Chatterbox-Multilingual-TTS)** | Voice Cloning | Multilingual cloning with prompt audio |
| **[IndexTTS-2](https://github.com/IndexTeam/IndexTTS)** | Voice Cloning | High-quality cloning, large model (~24GB) |

### Beyond Simple TTS
- **Emma IPA Transcription**: British IPA-like phonetic transcriptions using LLM providers (Claude, OpenAI, Ollama)
- **PDF Reader with Voice**: Read PDFs aloud with sentence-by-sentence highlighting
- **Audiobook Creator**: Convert documents (PDF, EPUB, TXT, MD, DOCX) into WAV/MP3/M4B audiobooks with smart chunking, crossfade merging, progress tracking, and chapter markers
- **Shared Voice Library**: Voice samples shared across all cloning engines (Qwen3, Chatterbox, IndexTTS-2)
- **Model Manager**: In-app model download manager — check status and download models on demand
- **Advanced Generation Controls**: Temperature, top_p, top_k, repetition penalty, seed
- **Style Instructions**: Tell speakers *how* to speak - "whisper softly", "speak with excitement", etc.
- **Real-time System Monitoring**: CPU, RAM, and GPU usage in the app header
- **Multi-LLM Support**: Claude, OpenAI, Ollama (local), or Claude Code CLI

---
## Features
- **Qwen3-TTS Voice Clone**: Clone any voice from just 3+ seconds of audio
- **Qwen3-TTS Custom Voice**: 9 preset premium speakers (Ryan, Aiden, Vivian, Serena, Uncle Fu, Dylan, Eric, Ono Anna, Sohee)
- **Chatterbox Voice Clone**: Multilingual voice cloning with prompt audio
- **IndexTTS-2 Voice Clone**: High-quality voice cloning with a large model (~24GB)
- **Shared Voice Library**: Voice samples uploaded to any engine are available across all voice cloning models
- **Model Manager**: In-app UI to check model download status and download models on demand
- **Advanced Generation Controls**: Temperature, top_p, top_k, repetition penalty, seed
- **Model Size Selection**: 0.6B (Fast) or 1.7B (Quality)
- **Kokoro TTS**: Fast, high-quality English synthesis with 21 British/American voices
- **Default Voice Samples**: Natasha and Suzan ship with the app; user uploads stored in `backend/data/user_voices/`
- **User Voices in UI**: Uploaded voices appear under each engine's **Your Voices** section after refresh
- **Voice Previews**: Tap play/pause/stop to audition voices before generating
- **Emma IPA**: British phonetic transcription with multi-LLM support (Claude, OpenAI, Ollama)
- **Document Reader**: Read PDFs, TXT, and MD files aloud with Kokoro TTS
- **Audiobook Creator**: Convert full documents to audiobook files (WAV/MP3/M4B) with smart chunking, crossfade merging, progress tracking, and playback controls
- **CLI Tool**: Full command-line interface for Kokoro and Qwen3
- **MCP & API Dashboard**: Built-in tab showing all MCP tools and REST endpoints with live server status
- **MCP Server**: Full MCP integration for programmatic access to all API endpoints
- **Windows Installer**: PyInstaller + Inno Setup build script for standalone Windows distribution
- **60+ REST API endpoints** with FastAPI (auto-documented at `/docs`)
---
## Control Script (mimikactl)
```bash
# Service Commands
./bin/mimikactl up # Start all services
./bin/mimikactl up --no-flutter # Backend + MCP only
./bin/mimikactl up --web # Backend + MCP + Flutter Web UI
./bin/mimikactl down # Stop all services
./bin/mimikactl restart # Restart all
./bin/mimikactl status # Check status
# Backend Commands
./bin/mimikactl backend start # Start backend only
./bin/mimikactl backend stop # Stop backend
# Flutter Commands
./bin/mimikactl flutter start # Start Flutter (release mode)
./bin/mimikactl flutter start --dev # Start in dev mode
./bin/mimikactl flutter start --web # Start Flutter Web UI
./bin/mimikactl flutter stop # Stop Flutter
./bin/mimikactl flutter build # Build macOS app
# MCP Server Commands
./bin/mimikactl mcp start # Start MCP server (port 8010)
./bin/mimikactl mcp stop # Stop MCP server
./bin/mimikactl mcp status # Check MCP status
# Utility Commands
./bin/mimikactl logs [service] # Tail logs (backend|mcp|flutter|all)
./bin/mimikactl test # Run API tests
./bin/mimikactl clean # Clean logs and temp files
./bin/mimikactl version # Show version info
```
---
## CLI Tool (mimika)
Full command-line interface for voice cloning and TTS generation.
### Quick Examples
```bash
# Kokoro TTS (fast British/American voices)
./bin/mimika kokoro "Hello, world!" --voice bf_emma --output hello.wav
./bin/mimika kokoro input.txt --voice bm_george --speed 1.2
# Qwen3 Custom Voice (preset speakers)
./bin/mimika qwen3 "Hello, world!" --speaker Ryan --style "professional narration"
./bin/mimika qwen3 book.epub --speaker Sohee --output audiobook.wav
# Qwen3 Voice Clone (clone from reference audio)
./bin/mimika qwen3 "Hello, world!" --clone --reference Alina.wav
./bin/mimika qwen3 book.pdf --clone --reference Bella.wav --output book.wav
# List available voices
./bin/mimika voices --engine kokoro
./bin/mimika voices --engine qwen3
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MIMIKA_API_URL` | `http://localhost:8000` | Backend API URL |
### `mimika kokoro` - Fast British/American TTS
```bash
./bin/mimika kokoro <input> [options]
```
| Parameter | Short | Default | Description |
|-----------|-------|---------|-------------|
| `input` | | *required* | Text string or file path (.txt, .pdf, .epub, .docx, .doc) |
| `--voice` | `-v` | `bf_emma` | Voice ID (see `mimika voices --engine kokoro`) |
| `--speed` | `-s` | `1.0` | Speech speed multiplier (0.5-2.0) |
| `--output` | `-o` | `<input>.wav` | Output WAV file path |
**Available Kokoro Voices:**
| Voice ID | Name | Gender | Accent |
|----------|------|--------|--------|
| `bf_emma` | Emma | Female | British RP |
| `bf_isabella` | Isabella | Female | British |
| `bf_alice` | Alice | Female | British |
| `bf_lily` | Lily | Female | British |
| `bm_george` | George | Male | British |
| `bm_lewis` | Lewis | Male | British |
| `bm_daniel` | Daniel | Male | British |
| `af_heart` | Heart | Female | American |
| `af_bella` | Bella | Female | American |
| `af_nicole` | Nicole | Female | American |
| `af_aoede` | Aoede | Female | American |
| `af_kore` | Kore | Female | American |
| `af_sarah` | Sarah | Female | American |
| `af_sky` | Sky | Female | American |
| `am_michael` | Michael | Male | American |
| `am_adam` | Adam | Male | American |
| `am_echo` | Echo | Male | American |
| `am_liam` | Liam | Male | American |
| `am_onyx` | Onyx | Male | American |
| `am_puck` | Puck | Male | American |
| `am_santa` | Santa | Male | American |
### `mimika qwen3` - Voice Clone & Custom Voice
```bash
./bin/mimika qwen3 <input> [options]
```
**Common Parameters:**
| Parameter | Short | Default | Description |
|-----------|-------|---------|-------------|
| `input` | | *required* | Text string or file path (.txt, .pdf, .epub, .docx, .doc) |
| `--output` | `-o` | `<input>.wav` | Output WAV file path |
| `--model` | `-m` | `1.7B` | Model size: `0.6B` (fast) or `1.7B` (quality) |
| `--language` | `-l` | `auto` | Language code (auto, en, zh, ja, ko, de, fr, ru, pt, es, it) |
| `--temperature` | | `0.9` | Generation randomness (0.1-2.0) |
| `--top-p` | | `0.9` | Nucleus sampling threshold (0.1-1.0) |
| `--top-k` | | `50` | Top-k sampling (1-100) |
**Custom Voice Mode (Preset Speakers):**
| Parameter | Short | Default | Description |
|-----------|-------|---------|-------------|
| `--speaker` | | `Ryan` | Preset speaker name |
| `--style` | | *see below* | Style instruction for voice |
Default style: `"Optimized for engaging, professional audiobook narration"`
**Available Preset Speakers:**
| Speaker | Language | Character |
|---------|----------|-----------|
| `Ryan` | English | Dynamic male, strong rhythm |
| `Aiden` | English | Sunny American male |
| `Vivian` | Chinese | Bright young female |
| `Serena` | Chinese | Warm gentle female |
| `Uncle_Fu` | Chinese | Seasoned male, mellow timbre |
| `Dylan` | Chinese | Beijing youthful male |
| `Eric` | Chinese | Sichuan lively male |
| `Ono_Anna` | Japanese | Playful female |
| `Sohee` | Korean | Warm emotional female |
**Voice Clone Mode:**
| Parameter | Short | Default | Description |
|-----------|-------|---------|-------------|
| `--clone` | | *flag* | Enable voice cloning mode |
| `--reference` | `-r` | *required* | Reference audio file (WAV, 3+ seconds) |
| `--reference-text` | | *optional* | Transcript of reference audio (improves quality) |
### `mimika voices` - List Available Voices
```bash
./bin/mimika voices [--engine kokoro|qwen3]
```
### Supported File Formats
| Format | Extension | Requirements |
|--------|-----------|--------------|
| Plain Text | `.txt` | Built-in |
| PDF | `.pdf` | `PyPDF2`, `pymupdf` |
| EPUB | `.epub` | `ebooklib`, `beautifulsoup4` |
| Word Document | `.docx` | `python-docx` |
| Legacy Word | `.doc` | `docx2txt` |
| Markdown | `.md` | Built-in |
All format dependencies are included in `requirements.txt`.
---
## TTS Engines
### Kokoro TTS + Emma IPA
Fast, high-quality British English synthesis (82M parameters, 24kHz) with integrated IPA transcription.
**Emma IPA** generates British phonetic transcriptions using your choice of LLM provider:
- **Claude** (Anthropic) - claude-sonnet-4, claude-opus-4, claude-haiku-3
- **OpenAI** - gpt-4, gpt-4-turbo, gpt-3.5-turbo
- **Ollama** (Local) - Any locally installed model
- **Claude Code CLI** - Use your local Claude CLI

### Qwen3-TTS
#### Voice Clone Mode (Base)
Clone any voice from just 3+ seconds of reference audio.
**Models**:
- `Qwen3-TTS-12Hz-0.6B-Base` - Fast, 1.4GB
- `Qwen3-TTS-12Hz-1.7B-Base` - Higher quality, 3.6GB
**Languages**: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
**How It Works**:
1. Upload a 3+ second audio sample
2. (Optional) Provide transcript for better quality
3. Enter text to synthesize
4. Adjust generation parameters if needed
5. Generate!
#### Custom Voice Mode (Preset Speakers)
Use 9 premium preset speakers without any reference audio.
**Models**:
- `Qwen3-TTS-12Hz-0.6B-CustomVoice`
- `Qwen3-TTS-12Hz-1.7B-CustomVoice`
**Style Instructions**: Control tone with prompts like "Speak slowly", "Very happy", "Whisper", or use "Optimized for engaging, professional audiobook narration" for long-form content.
#### Advanced Parameters
| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| Temperature | 0.9 | 0.1-2.0 | Randomness in generation |
| Top P | 0.9 | 0.1-1.0 | Nucleus sampling threshold |
| Top K | 50 | 1-100 | Top-k sampling |
| Repetition Penalty | 1.0 | 1.0-2.0 | Reduce repetition |
| Seed | -1 | -1 or 0+ | Reproducible generation (-1=random) |

### Chatterbox Multilingual TTS
Chatterbox adds multilingual voice cloning from a reference audio prompt. It uses the same voice library flow as Qwen3 (default samples + your uploads).
**23 Supported Languages**:
| Code | Language | Code | Language | Code | Language |
|------|----------|------|----------|------|----------|
| ar | Arabic | he | Hebrew | no | Norwegian |
| da | Danish | hi | Hindi | pl | Polish |
| de | German | it | Italian | pt | Portuguese |
| el | Greek | ja | Japanese | ru | Russian |
| en | English | ko | Korean | sv | Swedish |
| es | Spanish | ms | Malay | sw | Swahili |
| fi | Finnish | nl | Dutch | tr | Turkish |
| fr | French | | | zh | Chinese |
**Parameters**:
- Temperature (randomness)
- CFG weight (conditioning strength)
- Exaggeration (style intensity)
- Seed (reproducibility)
**Hebrew TTS**: Chatterbox Hebrew requires the **Dicta ONNX** diacritizer model (`dicta-1.0.onnx`, ~1.1 GB) which adds vowel marks (nikud) to unvocalized Hebrew text before synthesis. Without it, Hebrew output quality is severely degraded. The model is downloaded automatically by `install.sh` (skip with `SKIP_DICTA=1`) and stored at `backend/models/dicta-onnx/dicta-1.0.onnx`. To download manually:
```bash
mkdir -p backend/models/dicta-onnx
curl -L -o backend/models/dicta-onnx/dicta-1.0.onnx \
https://github.com/thewh1teagle/dicta-onnx/releases/download/model-files-v1.0/dicta-1.0.onnx
```
**Note**: On Apple Silicon, Chatterbox runs on CPU due to MPS resampling limitations.

### IndexTTS-2
High-quality voice cloning with a large model (~24GB). IndexTTS-2 produces natural-sounding clones from reference audio.
**Model**: `IndexTeam/IndexTTS-v2` (~24GB, auto-downloaded from HuggingFace on first use)
**How It Works**:
1. Upload a reference audio sample (WAV)
2. Enter text to synthesize
3. Adjust speed if needed
4. Generate!
**Parameters**:
- Speed (playback speed multiplier)
- Unload after generation (free GPU memory)
**Note**: IndexTTS-2 requires significant disk space (~24GB) and benefits from CUDA GPU acceleration.
---
## API Reference
The backend exposes 60+ REST endpoints via FastAPI. Full interactive docs at **http://localhost:8000/docs**.
### System
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/health` | GET | Health check |
| `/api/system/info` | GET | System information (Python, device, models, OS) |
| `/api/system/stats` | GET | Real-time CPU/RAM/GPU usage |
### Kokoro TTS
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/kokoro/generate` | POST | Generate speech with Kokoro |
| `/api/kokoro/voices` | GET | List available voices |
| `/api/kokoro/audio/list` | GET | List generated audio files |
| `/api/kokoro/audio/{filename}` | DELETE | Delete audio file |
### Qwen3-TTS
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/qwen3/generate` | POST | Generate audio (clone or custom mode) |
| `/api/qwen3/generate/stream` | POST | Streaming audio generation |
| `/api/qwen3/voices` | GET | List saved voice samples |
| `/api/qwen3/voices` | POST | Upload new voice sample |
| `/api/qwen3/voices/{name}` | PUT | Update voice sample |
| `/api/qwen3/voices/{name}` | DELETE | Delete voice sample |
| `/api/qwen3/voices/{name}/audio` | GET | Preview voice sample audio |
| `/api/qwen3/speakers` | GET | List 9 preset speakers |
| `/api/qwen3/models` | GET | List available models |
| `/api/qwen3/languages` | GET | List supported languages |
| `/api/qwen3/info` | GET | Model info and status |
| `/api/qwen3/clear-cache` | POST | Clear voice prompt cache |
### Chatterbox TTS
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/chatterbox/generate` | POST | Generate speech (voice clone) |
| `/api/chatterbox/voices` | GET | List saved voice samples |
| `/api/chatterbox/voices` | POST | Upload new voice sample |
| `/api/chatterbox/voices/{name}` | PUT | Update voice sample |
| `/api/chatterbox/voices/{name}` | DELETE | Delete voice sample |
| `/api/chatterbox/voices/{name}/audio` | GET | Preview voice sample audio |
| `/api/chatterbox/languages` | GET | List supported languages |
| `/api/chatterbox/info` | GET | Model info |
### IndexTTS-2
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/indextts2/generate` | POST | Generate speech (voice clone) |
| `/api/indextts2/voices` | GET | List saved voice samples |
| `/api/indextts2/voices` | POST | Upload new voice sample |
| `/api/indextts2/voices/{name}` | PUT | Update voice sample |
| `/api/indextts2/voices/{name}` | DELETE | Delete voice sample |
| `/api/indextts2/voices/{name}/audio` | GET | Preview voice sample audio |
| `/api/indextts2/info` | GET | Model info |
### Model Management
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/models/status` | GET | Check download status of all models |
| `/api/models/{model_name}/download` | POST | Trigger HuggingFace model download |
### Unified Voices
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/voices/custom` | GET | All custom voices across all engines |
### Audiobook Creator
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/audiobook/generate` | POST | Start audiobook generation from text |
| `/api/audiobook/generate-from-file` | POST | Generate from uploaded file (PDF/EPUB/TXT/DOCX) |
| `/api/audiobook/status/{job_id}` | GET | Job progress (chars/sec, ETA, chapters) |
| `/api/audiobook/cancel/{job_id}` | POST | Cancel in-progress job |
| `/api/audiobook/list` | GET | List generated audiobooks |
| `/api/audiobook/{job_id}` | DELETE | Delete audiobook file |
**Performance**: ~60 chars/sec on M2 MacBook Pro CPU.
**Output Formats**: WAV (lossless), MP3 (compressed), M4B (audiobook with chapter markers).
**Subtitle Formats**: SRT (VLC-compatible), VTT (web-compatible).
### Audio Library
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/tts/audio/list` | GET | List Kokoro-generated audio |
| `/api/tts/audio/{filename}` | DELETE | Delete TTS audio file |
| `/api/voice-clone/audio/list` | GET | List Qwen3/Chatterbox/IndexTTS-2 clone audio |
| `/api/voice-clone/audio/{filename}` | DELETE | Delete clone audio file |
### Emma IPA
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/ipa/sample` | GET | Default sample text |
| `/api/ipa/samples` | GET | All saved IPA samples |
| `/api/ipa/generate` | POST | Generate British IPA transcription |
| `/api/ipa/pregenerated` | GET | Pregenerated IPA with audio |
| `/api/ipa/save-output` | POST | Save IPA output to history |
### LLM Configuration
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/llm/config` | GET | Current LLM config |
| `/api/llm/config` | POST | Update LLM provider settings |
| `/api/llm/ollama/models` | GET | List local Ollama models |
### Samples & Pregenerated Content
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/samples/{engine}` | GET | Sample texts for engine |
| `/api/pregenerated` | GET | Pregenerated audio samples |
| `/api/voice-samples` | GET | Voice sample sentences with audio |
### Audiobook Generation Example
```bash
# Start generation
curl -X POST http://localhost:8000/api/audiobook/generate \
-H "Content-Type: application/json" \
-d '{"text": "Your document text...", "title": "My Audiobook", "voice": "bf_emma", "output_format": "m4b"}'
# From file
curl -X POST http://localhost:8000/api/audiobook/generate-from-file \
-F "file=@mybook.pdf" -F "title=My Audiobook" -F "voice=bf_emma" -F "output_format=m4b"
# Poll progress
curl http://localhost:8000/api/audiobook/status/{job_id}
```
---
## MCP Server
MimikaStudio includes a full MCP (Model Context Protocol) server that exposes every API endpoint as MCP tools for programmatic access via Claude Code CLI, Codex, or any MCP-compatible client.
**Start:** `./bin/mimikactl mcp start` (port 8010)
The MCP server provides 50+ tools for:
- TTS generation (Kokoro, Qwen3, Chatterbox, IndexTTS-2)
- Voice management (list, upload, delete, update, preview)
- Audiobook generation and management
- System info and monitoring
- LLM configuration
- IPA transcription
- Audio library management
### MCP & API Dashboard (In-App)
The **MCP & API** tab in the Flutter app provides a live dashboard showing:
- **Server status** — Backend API (port 8000), MCP Server (port 8010), and API Docs availability with green/red indicators
- **All MCP tools** grouped by category (System, Kokoro, Qwen3, Chatterbox, IndexTTS-2, Audiobook, Voice Management, Models, Samples, LLM, IPA) with expandable parameter details
- **All 60+ REST API endpoints** grouped by category with HTTP method badges (GET/POST/PUT/DELETE)
- **Search** — Filter tools and endpoints by name, path, or description
The dashboard fetches MCP tools live from the MCP server via JSON-RPC, so it always reflects the current tool set.

---
## Running Tests
```bash
source venv/bin/activate
cd backend
# Run all tests (fast, no model loading required)
pytest tests/ -v
# Run specific test file
pytest tests/test_all_endpoints.py -v
# Run with actual model tests (slow, requires models downloaded)
RUN_MODEL_TESTS=1 pytest tests/
```
---
## Architecture
```
MimikaStudio/
├── install.sh # Single install script (run this first)
├── requirements.txt # All Python dependencies
├── venv/ # Python virtual environment (created by install.sh)
│
├── bin/
│ ├── mimikactl # Service control script
│ ├── mimika # CLI tool for TTS/voice cloning
│ └── tts_mcp_server.py # MCP server for programmatic access
│
├── pdf/ # Place PDFs here for the PDF Reader
│
├── flutter_app/ # Flutter desktop + web application (~10,100 lines Dart)
│ ├── lib/
│ │ ├── main.dart # App entry, 6-tab navigation + Model Manager
│ │ ├── screens/
│ │ │ ├── quick_tts_screen.dart # Kokoro TTS + Emma IPA
│ │ │ ├── qwen3_clone_screen.dart # Qwen3 voice cloning
│ │ │ ├── chatterbox_clone_screen.dart # Chatterbox voice cloning
│ │ │ ├── indextts2_screen.dart # IndexTTS-2 voice cloning
│ │ │ ├── pdf_reader_screen.dart # PDF reader with TTS
│ │ │ ├── mcp_endpoints_screen.dart # MCP & API dashboard
│ │ │ └── models_dialog.dart # Model download manager
│ │ ├── widgets/
│ │ │ ├── audio_player_widget.dart # Shared audio player
│ │ │ ├── emma_ipa_widget.dart # IPA transcription widget
│ │ │ └── multi_layer_text.dart # Text overlay widget
│ │ └── services/
│ │ └── api_service.dart # Backend API client (823 lines)
│ └── macos/ # macOS configuration
│
├── backend/ # FastAPI Python backend (~8,500 lines Python, 60+ endpoints)
│ ├── main.py # API endpoints (2,078 lines)
│ ├── database.py # SQLite initialization and seeding
│ ├── requirements.txt # (legacy, use root requirements.txt)
│ ├── tts/ # TTS engine wrappers
│ │ ├── kokoro_engine.py
│ │ ├── qwen3_engine.py # Clone + CustomVoice
│ │ ├── chatterbox_engine.py # Multilingual voice clone
│ │ ├── indextts2_engine.py # IndexTTS-2 voice clone
│ │ ├── text_chunking.py # Smart text chunking for audiobooks
│ │ ├── audio_utils.py # Audio processing utilities
│ │ └── audiobook.py # Audiobook generation logic (822 lines)
│ ├── language/
│ │ └── ipa_generator.py # British IPA transcription
│ ├── llm/ # LLM provider integration
│ │ ├── factory.py # Claude, OpenAI, Ollama support
│ │ ├── claude_provider.py
│ │ ├── openai_provider.py
│ │ └── codex_provider.py
│ ├── models/
│ │ ├── registry.py # Model registry (all engines)
│ │ └── dicta-onnx/ # Hebrew diacritizer (~1.1 GB, downloaded by install.sh)
│ ├── tests/ # Comprehensive test suite
│ └── data/
│ ├── samples/ # Shipped voice samples (shared across engines)
│ │ ├── qwen3_voices/ # Natasha, Suzan
│ │ ├── chatterbox_voices/ # Natasha, Suzan, Hebrew_Natasha
│ │ ├── indextts2_voices/
│ │ └── kokoro/ # Pre-generated Kokoro samples
│ ├── user_voices/ # User uploads (git-ignored, shared across engines)
│ │ ├── qwen3/
│ │ ├── chatterbox/
│ │ └── indextts2/
│ └── outputs/ # Generated audio files
│
├── scripts/ # Build & installer scripts
│ ├── build_installer.ps1 # Windows installer build (PyInstaller + Inno Setup)
│ ├── mimikastudio.spec # PyInstaller spec file
│ ├── mimikastudio.iss # Inno Setup installer script
│ ├── install_macos.sh # (legacy, use root install.sh)
│ └── setup.sh # (legacy, use root install.sh)
```
---
## Codebase Statistics
| Language | Lines of Code | Files |
|----------|--------------|-------|
| **Python** (backend, scripts, MCP server) | ~8,500 | 20+ |
| **Dart** (Flutter UI) | ~10,100 | 13 |
| **Total** | **~18,600** | **33+** |
### Python Breakdown
| Directory | Lines | Description |
|-----------|-------|-------------|
| `backend/main.py` | 2,078 | FastAPI endpoints |
| `backend/tts/` | 2,037 | TTS engine wrappers (Kokoro, Qwen3, Chatterbox, IndexTTS-2) |
| `backend/tests/` | 1,567 | Comprehensive test suite |
| `bin/tts_mcp_server.py` | 1,438 | MCP server |
| `backend/llm/` | 409 | LLM provider integration |
| `backend/models/` | 163 | Model registry |
| `scripts/` | 377 | Build & installer scripts |
### Flutter/Dart Breakdown
| Directory | Lines | Description |
|-----------|-------|-------------|
| `lib/screens/` | 7,080 | 7 screens (TTS, Qwen3, Chatterbox, IndexTTS-2, PDF, MCP, Models) |
| `lib/services/` | 823 | API service client |
| `lib/widgets/` | 952 | Shared widgets (audio player, IPA, text overlay) |
| `lib/main.dart` | 270 | App entry + 6-tab navigation |
### Largest Files
| File | Lines |
|------|-------|
| `backend/main.py` | 2,078 |
| `screens/pdf_reader_screen.dart` | 2,147 |
| `bin/tts_mcp_server.py` | 1,438 |
| `screens/qwen3_clone_screen.dart` | 1,482 |
| `screens/chatterbox_clone_screen.dart` | 1,243 |
| `screens/quick_tts_screen.dart` | 1,085 |
| `screens/indextts2_screen.dart` | 1,053 |
| `backend/tests/test_all_endpoints.py` | 927 |
| `backend/tts/audiobook.py` | 822 |
| `services/api_service.dart` | 823 |
---
## Troubleshooting
### Common Issues
**"espeak-ng not found"**
```bash
brew install espeak-ng
```
**"ffmpeg not found" (for MP3/M4B export)**
```bash
brew install ffmpeg
```
**"No module named 'perth'" or "No module named 'omegaconf'"**
These are Chatterbox runtime dependencies. Run `./install.sh` or manually:
```bash
source venv/bin/activate
pip install resemble-perth omegaconf conformer diffusers pyloudnorm pykakasi spacy-pkuseg
pip install --no-deps chatterbox-tts==0.1.6
```
**Hebrew TTS sounds garbled or robotic**
The Dicta ONNX diacritizer model is missing. Chatterbox requires it to add vowel marks (nikud) to Hebrew text. Download it:
```bash
mkdir -p backend/models/dicta-onnx
curl -L -o backend/models/dicta-onnx/dicta-1.0.onnx \
https://github.com/thewh1teagle/dicta-onnx/releases/download/model-files-v1.0/dicta-1.0.onnx
```
Then restart the backend. You should see `[Chatterbox] Hebrew diacritizer loaded` in the logs.
**"spaCy not available" (warning, not error)**
```bash
pip install spacy
# The app will use regex fallback if spaCy is not installed
```
**Models not downloading**
- Ensure you have internet access
- Models are stored in `~/.cache/huggingface/` (Qwen3) and `backend/models/` (Kokoro)
**Flutter build fails**
```bash
flutter clean && flutter pub get && flutter build macos --release
```
**Port 8000 already in use**
```bash
lsof -i :8000
kill -9 <PID>
```
### Performance Tips (Apple Silicon + MPS)
MimikaStudio is optimized for Apple Silicon Macs with MPS (Metal Performance Shaders) acceleration where supported:
- **Kokoro TTS**: Uses MPS for GPU-accelerated inference — sub-200ms latency
- **Qwen3-TTS**: Runs on CPU (MPS support planned); still fast on M-series chips
- **Chatterbox**: Runs on CPU due to MPS resampling limitations
- **IndexTTS-2**: Benefits from CUDA on Linux/Windows; runs on CPU on macOS
- **Audiobook generation**: Expect ~60 chars/sec on M2 MacBook Pro (matching audiblez benchmark)
- **Memory**: Close other apps when generating long audiobooks with 1.7B model
---
## Author
| | |
|---|---|
| **Author** | Shlomo Kashani |
| **Affiliation** | Johns Hopkins University, Maryland, U.S.A. |
---
## Citation
```bibtex
@software{kashani2025mimikastudio,
title={MimikaStudio: Local-First Voice Cloning and Text-to-Speech Desktop Application},
author={Kashani, Shlomo},
year={2025},
institution={Johns Hopkins University},
url={https://github.com/BoltzmannEntropy/MimikaStudio},
note={Comprehensive desktop application integrating Qwen3-TTS and Kokoro for voice cloning and synthesis}
}
```
**APA Format:**
Kashani, S. (2025). *MimikaStudio: Local-First Voice Cloning and Text-to-Speech Desktop Application*. Johns Hopkins University. https://github.com/BoltzmannEntropy/MimikaStudio
**IEEE Format:**
S. Kashani, "MimikaStudio: Local-First Voice Cloning and Text-to-Speech Desktop Application," Johns Hopkins University, 2025. [Online]. Available: https://github.com/BoltzmannEntropy/MimikaStudio
---
## Similar Projects
| Project | Description | Key Features |
|---------|-------------|--------------|
| [**audiblez**](https://github.com/santinic/audiblez) | EPUB to audiobook converter using Kokoro TTS | spaCy sentence tokenization, M4B output with chapters |
| [**pdf-narrator**](https://github.com/mateogon/pdf-narrator) | PDF to audiobook with smart text extraction | Skips headers/footers/page numbers, TOC-based chapter splitting |
| [**abogen**](https://github.com/denizsafak/abogen) | Full-featured audiobook generator GUI | Voice mixer, subtitle generation, batch processing |
| [**Qwen3-Audiobook-Converter**](https://github.com/WhiskeyCoder/Qwen3-Audiobook-Converter) | Qwen3-TTS audiobook tool | Style instructions for professional narration |
### What MimikaStudio Adds
- **From audiblez**: spaCy-based sentence tokenization, character-based progress tracking, M4B with chapters
- **From pdf-narrator**: Smart PDF extraction that skips headers/footers/page numbers, TOC-based chapters
- **From abogen**: Multiple output formats (WAV/MP3/M4B), real-time progress with ETA
- **Unique to MimikaStudio**: Native macOS Flutter UI, 3-second voice cloning, voice library management, Emma IPA transcription, full MCP server integration, 60+ REST API endpoints, in-app MCP & API dashboard
---
## License
MIT License
## Acknowledgments
- [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) - 3-second voice cloning with CustomVoice
- [Kokoro TTS](https://github.com/hexgrad/kokoro) - Fast, high-quality English TTS
- [Chatterbox](https://github.com/resemble-ai/chatterbox) - Multilingual voice cloning
- [Dicta ONNX](https://github.com/thewh1teagle/dicta-onnx) - Hebrew diacritization for Chatterbox TTS
- [IndexTTS-2](https://github.com/IndexTeam/IndexTTS) - High-quality voice cloning
- [Flutter](https://flutter.dev) - Cross-platform UI framework
- [FastAPI](https://fastapi.tiangolo.com) - Python API framework
- [spaCy](https://spacy.io) - Industrial-strength NLP for sentence tokenization
- [PyMuPDF](https://pymupdf.readthedocs.io) - Smart PDF text extraction
MCP Config
Below is the configuration for this MCP Server. You can copy it directly to Cursor or other MCP clients.
mcp.json
Connection Info
You Might Also Like
markitdown
Python tool for converting files and office documents to Markdown.
Fetch
Retrieve and process content from web pages by converting HTML into markdown format.
chatbox
User-friendly Desktop Client App for AI Models/LLMs (GPT, Claude, Gemini, Ollama...)
oh-my-opencode
Background agents · Curated agents like oracle, librarians, frontend...
continue
Continue is an open-source project for seamless server management.
semantic-kernel
Build and deploy intelligent AI agents with Semantic Kernel's orchestration...