Build Your Own Local Text-to-Speech Stack with Python
A hands-on guide to deploying 3 TTS engines (Edge TTS, Piper, Coqui XTTS v2) and a speech-to-text service — from a simple API to desktop keyboard shortcuts that read any selected text aloud.
Most developers use cloud TTS APIs — pay per character, send your data to someone else's servers, and hope the pricing doesn't change. But what if you could run high-quality text-to-speech entirely on your own machine? No API keys, no subscriptions, no data leaving your network.
In this post, we'll build a complete local TTS stack. We'll deploy three different engines (each with different trade-offs), add speech-to-text in the other direction, and wire everything into desktop shortcuts so you can select any text and hear it spoken with one keypress.
The Three Engines
Not all TTS engines are equal. Some need a GPU, some need the internet, some run on a Raspberry Pi. Here's what we're working with:
| Engine | RAM | GPU | Quality | Internet | Voice Clone | Best For |
|---|---|---|---|---|---|---|
| Edge TTS | ~50MB | No | Great | Yes | No | Zero-cost, best quality |
| Piper | ~100MB | No | Good | No | No | Offline, instant, low-resource |
| Coqui XTTS v2 | ~4GB | 2-3GB VRAM | Excellent | No | Yes | Voice cloning, multilingual |
The decision logic is simple:
IF internet_available AND quality_matters AND no_privacy_concern:
engine = "edge-tts" # zero resource cost, great quality
ELIF gpu_available AND vram >= 3GB AND voice_clone_needed:
engine = "coqui-xtts" # best quality, voice cloning
ELIF low_resources OR need_instant_response:
engine = "piper" # lightweight, fastest, fully offline
Hear the Difference
Before we dive into code, listen to the same sentence spoken by each engine. All samples say the same text:
"Welcome to my blog. Today we are building a complete text-to-speech stack that runs entirely on your own machine. No cloud APIs, no subscriptions, just your hardware and open-source software."
Edge TTS — Aria (US Female, Neural):
Edge TTS — Guy (US Male, Neural):
Edge TTS — Sonia (British Female, Neural):
Edge TTS — Ava (US Female, Conversational):
Piper — Lessac (US English, Medium quality, fully offline):
Notice how Edge TTS voices sound more natural and expressive — they're neural network models running on Microsoft's servers. Piper sounds more robotic but runs instantly with zero internet dependency. That's the trade-off.
Part 1: Edge TTS — Great Quality, Zero Resources
Edge TTS uses Microsoft's neural voice API (the same voices behind Microsoft Edge's Read Aloud feature). The quality is excellent, but it requires an internet connection.
Install
pip install edge-tts fastapi uvicorn
The FastAPI Server
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import edge_tts
import io
app = FastAPI()
class TTSRequest(BaseModel):
text: str
voice: str = "en-US-AriaNeural" # See /voices for options
@app.get("/voices")
async def list_voices():
"""List all available voices"""
voices = await edge_tts.list_voices()
return [
{"name": v["ShortName"], "gender": v["Gender"], "locale": v["Locale"]}
for v in voices
]
@app.post("/speak")
async def speak(req: TTSRequest):
communicate = edge_tts.Communicate(req.text, req.voice)
buffer = io.BytesIO()
async for chunk in communicate.stream():
if chunk["type"] == "audio":
buffer.write(chunk["data"])
buffer.seek(0)
return StreamingResponse(
buffer,
media_type="audio/mp3",
headers={"Content-Disposition": "attachment; filename=speech.mp3"}
)
That's the entire server. 30 lines of Python. Run it:
uvicorn main:app --host 0.0.0.0 --port 8002
Test It
# List available voices (there are 100+)
curl -s http://localhost:8002/voices | python3 -m json.tool | head -20
# Generate speech
curl -s -X POST http://localhost:8002/speak \
-H "Content-Type: application/json" \
-d '{"text": "Hello from Edge TTS", "voice": "en-US-AriaNeural"}' \
-o /tmp/speech.mp3 && mpv /tmp/speech.mp3
Popular Voices
| Voice | Gender | Accent | Style |
|---|---|---|---|
en-US-AriaNeural | Female | American | Neutral, clear |
en-US-GuyNeural | Male | American | Warm, natural |
en-US-AvaNeural | Female | American | Conversational (Copilot style) |
en-GB-SoniaNeural | Female | British | Professional |
en-AU-NatashaNeural | Female | Australian | Friendly |
Docker
FROM python:3.11-slim
WORKDIR /app
COPY main.py .
RUN pip3 install --no-cache-dir fastapi uvicorn edge-tts
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docker build -t local-tts-edge .
docker run -d --name local-tts -p 8002:8000 local-tts-edge
Part 2: Piper — Offline, Instant, Lightweight
Piper is the opposite philosophy. It runs ONNX models locally — no internet, no GPU, ~100MB of RAM. The quality is lower than Edge TTS, but it's instant and completely private.
Install
pip install piper-tts fastapi uvicorn
Download a voice model:
mkdir -p /opt/tts/voices && cd /opt/tts/voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json
The FastAPI Server
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import subprocess
import io
app = FastAPI()
VOICE_MODEL = "/opt/tts/voices/en_US-lessac-medium.onnx"
class TTSRequest(BaseModel):
text: str
@app.post("/speak")
async def speak(req: TTSRequest):
process = subprocess.run(
["piper", "--model", VOICE_MODEL, "--output-raw"],
input=req.text.encode(),
capture_output=True
)
return StreamingResponse(
io.BytesIO(process.stdout),
media_type="audio/wav",
headers={"Content-Disposition": "attachment; filename=speech.wav"}
)
Test It
curl -s -X POST http://localhost:8002/speak \
-H "Content-Type: application/json" \
-d '{"text": "Hello from Piper"}' \
-o /tmp/speech.wav && aplay /tmp/speech.wav
When to Use Piper
Piper shines in scenarios where Edge TTS can't:
- No internet — air-gapped servers, field deployments
- Privacy-critical — medical, legal, or classified environments
- Resource-constrained — Raspberry Pi, cheap VPS, shared hosting
- Latency-critical — response time measured in milliseconds
Part 3: Coqui XTTS v2 — Voice Cloning and Multilingual
This is the heavy hitter. Coqui XTTS v2 needs a GPU (2-3GB VRAM) and ~4GB RAM, but it delivers the best quality AND can clone any voice from just a 6-second audio sample.
Install
pip install TTS fastapi uvicorn python-multipart soundfile
The FastAPI Server
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from TTS.api import TTS
import io
import torch
import soundfile as sf
app = FastAPI()
# Load XTTS v2 model (downloads ~2GB on first run)
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
class TTSRequest(BaseModel):
text: str
language: str = "en"
@app.post("/speak")
async def speak(req: TTSRequest):
wav = tts.tts(text=req.text, language=req.language)
buffer = io.BytesIO()
sf.write(buffer, wav, 22050, format="WAV")
buffer.seek(0)
return StreamingResponse(buffer, media_type="audio/wav")
@app.post("/speak-clone")
async def speak_clone(
text: str,
language: str = "en",
speaker_audio: UploadFile = File(...)
):
"""Clone a voice from a 6+ second audio sample"""
audio_data = await speaker_audio.read()
temp_path = "/tmp/speaker_sample.wav"
with open(temp_path, "wb") as f:
f.write(audio_data)
wav = tts.tts(text=text, speaker_wav=temp_path, language=language)
buffer = io.BytesIO()
sf.write(buffer, wav, 22050, format="WAV")
buffer.seek(0)
return StreamingResponse(buffer, media_type="audio/wav")
Two endpoints: /speak for standard TTS and /speak-clone for voice cloning. The cloning endpoint takes an audio file upload — record yourself saying anything for 6+ seconds, and the model will speak new text in your voice.
Docker (GPU Required)
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
python3-pip ffmpeg libsndfile1 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY main.py .
RUN pip3 install --no-cache-dir fastapi uvicorn python-multipart TTS soundfile
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docker build -f Dockerfile.coqui -t local-tts-coqui .
docker run --gpus all -d --name local-tts-coqui -p 8003:8000 local-tts-coqui
Voice Cloning in Action
# Step 1: Record your voice (6+ seconds)
arecord -d 8 -f S16_LE -r 22050 -c 1 /tmp/my_voice.wav
# Step 2: Clone it
curl -s -X POST "http://localhost:8003/speak-clone?text=This+is+my+cloned+voice&language=en" \
-F "speaker_audio=@/tmp/my_voice.wav" \
-o /tmp/cloned.wav && aplay /tmp/cloned.wav
That's it. Record yourself, upload the sample, and the model speaks new text in your voice. The quality depends on your recording — a clean, quiet recording with natural speech gives the best results.
Supported Languages
XTTS v2 supports 20+ languages out of the box: English, Portuguese, French, Italian, Spanish, German, Polish, Dutch, Russian, Turkish, Greek, Czech, Danish, Finnish, Hungarian, Korean, Chinese, Japanese, and more. Change the language parameter to switch.
A Note on Voice Cloning Ethics
Only clone your own voice or voices you have explicit permission to use. Cloning celebrity or public figure voices without consent is illegal in many jurisdictions (Tennessee's ELVIS Act, California's right of publicity laws, EU AI Act, etc.). Use your own voice for demos — it's actually more impressive to readers.
Part 4: The Other Direction — Speech-to-Text
A voice stack isn't complete without STT. We'll use Faster-Whisper — a CTranslate2 port of OpenAI's Whisper that runs 4x faster with lower memory.
Install
pip install faster-whisper fastapi uvicorn python-multipart
The FastAPI Server
from fastapi import FastAPI, UploadFile, File
from faster_whisper import WhisperModel
import io
app = FastAPI()
# Adapt based on your resources:
# tiny (~500MB RAM), base (~1GB), small (~2GB), medium (~4GB)
MODEL_SIZE = "small"
DEVICE = "cuda" # or "cpu"
COMPUTE_TYPE = "int8_float16" # "int8" for CPU
model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)
@app.post("/transcribe")
async def transcribe(file: UploadFile = File(...)):
audio_data = await file.read()
segments, info = model.transcribe(io.BytesIO(audio_data), beam_size=5)
text = " ".join([segment.text for segment in segments])
return {"language": info.language, "transcription": text.strip()}
Model Size Decision
IF GPU available AND vram >= 4GB:
model = "small", device = "cuda" # Best balance
ELIF GPU available AND vram >= 6GB:
model = "medium", device = "cuda" # Best accuracy
ELIF CPU only AND ram >= 3GB:
model = "small", device = "cpu" # Slower but accurate
ELIF CPU only AND ram < 2GB:
model = "tiny", device = "cpu" # Basic but fast
Docker (GPU)
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
python3-pip ffmpeg \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY main.py requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Test It
# Record a 5-second clip
arecord -d 5 -f S16_LE -r 16000 -c 1 /tmp/test.wav
# Transcribe
curl -X POST http://localhost:8001/transcribe \
-F "file=@/tmp/test.wav"
# Response:
# {"language": "en", "transcription": "Hello, this is a test recording."}
Part 5: Desktop Integration — Keyboard Shortcuts
Once your TTS and STT services are running, you can wire them into desktop keyboard shortcuts so that selecting text and pressing a key combo speaks it aloud, or pressing a shortcut records your voice and pastes the transcription. Since users may be on macOS, Windows, or Linux, the setup differs per OS — you can add a custom shortcut that calls the API endpoints above using a simple shell script (Linux/macOS) or a PowerShell script (Windows) triggered by your OS's keyboard shortcut settings.
Part 6: Systemd Services — Run on Boot (Linux/Ubuntu)
You don't want to manually start your TTS/STT servers every time you reboot. Create systemd services:
TTS Service
Create /etc/systemd/system/tts.service:
[Unit]
Description=Text-to-Speech API
After=network.target
[Service]
Type=simple
WorkingDirectory=/opt/tts
ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8002
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
STT Service
Create /etc/systemd/system/stt.service:
[Unit]
Description=Speech-to-Text API
After=network.target
[Service]
Type=simple
WorkingDirectory=/opt/stt
ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8001
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Enable both:
systemctl daemon-reload
systemctl enable --now tts stt
The Full Architecture
Here's what we built:
┌─────────────────────────────────────────────────────────────┐
│ DESKTOP SHORTCUTS │
│ Ctrl+Shift+S → Select text → Hear it spoken │
│ Ctrl+Shift+R → Speak → Text pasted at cursor │
└──────────┬────────────────────────────┬─────────────────────┘
│ │
┌──────────▼──────────┐ ┌───────────▼──────────────┐
│ TTS API (:8002) │ │ STT API (:8001) │
│ │ │ │
│ Engine Options: │ │ Faster-Whisper │
│ ┌───────────────┐ │ │ tiny/base/small/medium │
│ │ Edge TTS │ │ │ CPU or GPU │
│ │ (internet) │ │ │ │
│ ├───────────────┤ │ │ POST /transcribe │
│ │ Piper │ │ │ → {"transcription":...} │
│ │ (offline, CPU)│ │ │ │
│ ├───────────────┤ │ └──────────────────────────┘
│ │ Coqui XTTS │ │
│ │ (GPU, clone) │ │
│ └───────────────┘ │
│ │
│ POST /speak │
│ POST /speak-clone │
│ GET /voices │
└─────────────────────┘
Troubleshooting
| Problem | Solution |
|---|---|
| Edge TTS: connection error | Check internet connectivity |
| Piper: no voice model found | Download from Hugging Face |
| Coqui: CUDA out of memory | Use a smaller model or switch to Piper/Edge TTS |
| No audio output | Install alsa-utils (apt install alsa-utils) or try mpv |
| Port already in use | Change port: --port 8003 |
| STT: slow on CPU | Use tiny model for faster (but less accurate) results |
| xclip not found | apt install xclip |
| Keyboard shortcut not working | Check GNOME Settings > Keyboard > Custom Shortcuts |
Key Takeaways
-
You don't need cloud APIs for TTS. Edge TTS gives you neural-quality voices for free. Piper runs offline on a Raspberry Pi. Coqui XTTS v2 clones voices with a 6-second sample. Pick the one that fits your constraints.
-
The FastAPI pattern is the same for all engines. Every TTS server follows the same structure: receive text, generate audio, return a streaming response. Swapping engines means changing ~10 lines of code.
-
Desktop integration is the killer feature. Having TTS/STT bound to keyboard shortcuts transforms how you use your computer. Select text → hear it. Speak → text appears. Once you set it up, you'll wonder how you worked without it.
-
Voice cloning is powerful but use it responsibly. Clone your own voice for demos and personal use. Never clone someone else's voice without explicit consent — it's increasingly illegal and always unethical.
-
Start with Edge TTS, add more later. It requires zero local resources and sounds great. Once you've confirmed your workflow, add Piper for offline capability or Coqui for voice cloning. You don't need all three on day one.
-
Systemd + Docker = set it and forget it. Once deployed, your TTS/STT services survive reboots, auto-restart on crashes, and run in isolated containers. Zero maintenance.