Build Your Own Local Text-to-Speech Stack with Python

A hands-on guide to deploying 3 TTS engines (Edge TTS, Piper, Coqui XTTS v2) and a speech-to-text service — from a simple API to desktop keyboard shortcuts that read any selected text aloud.

Most developers use cloud TTS APIs — pay per character, send your data to someone else's servers, and hope the pricing doesn't change. But what if you could run high-quality text-to-speech entirely on your own machine? No API keys, no subscriptions, no data leaving your network.

In this post, we'll build a complete local TTS stack. We'll deploy three different engines (each with different trade-offs), add speech-to-text in the other direction, and wire everything into desktop shortcuts so you can select any text and hear it spoken with one keypress.


The Three Engines

Not all TTS engines are equal. Some need a GPU, some need the internet, some run on a Raspberry Pi. Here's what we're working with:

EngineRAMGPUQualityInternetVoice CloneBest For
Edge TTS~50MBNoGreatYesNoZero-cost, best quality
Piper~100MBNoGoodNoNoOffline, instant, low-resource
Coqui XTTS v2~4GB2-3GB VRAMExcellentNoYesVoice cloning, multilingual

The decision logic is simple:

IF internet_available AND quality_matters AND no_privacy_concern:
    engine = "edge-tts"      # zero resource cost, great quality
ELIF gpu_available AND vram >= 3GB AND voice_clone_needed:
    engine = "coqui-xtts"    # best quality, voice cloning
ELIF low_resources OR need_instant_response:
    engine = "piper"          # lightweight, fastest, fully offline

Hear the Difference

Before we dive into code, listen to the same sentence spoken by each engine. All samples say the same text:

"Welcome to my blog. Today we are building a complete text-to-speech stack that runs entirely on your own machine. No cloud APIs, no subscriptions, just your hardware and open-source software."

Edge TTS — Aria (US Female, Neural):

Edge TTS — Guy (US Male, Neural):

Edge TTS — Sonia (British Female, Neural):

Edge TTS — Ava (US Female, Conversational):

Piper — Lessac (US English, Medium quality, fully offline):

Notice how Edge TTS voices sound more natural and expressive — they're neural network models running on Microsoft's servers. Piper sounds more robotic but runs instantly with zero internet dependency. That's the trade-off.


Part 1: Edge TTS — Great Quality, Zero Resources

Edge TTS uses Microsoft's neural voice API (the same voices behind Microsoft Edge's Read Aloud feature). The quality is excellent, but it requires an internet connection.

Install

pip install edge-tts fastapi uvicorn

The FastAPI Server

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import edge_tts
import io

app = FastAPI()

class TTSRequest(BaseModel):
    text: str
    voice: str = "en-US-AriaNeural"  # See /voices for options

@app.get("/voices")
async def list_voices():
    """List all available voices"""
    voices = await edge_tts.list_voices()
    return [
        {"name": v["ShortName"], "gender": v["Gender"], "locale": v["Locale"]}
        for v in voices
    ]

@app.post("/speak")
async def speak(req: TTSRequest):
    communicate = edge_tts.Communicate(req.text, req.voice)
    buffer = io.BytesIO()
    async for chunk in communicate.stream():
        if chunk["type"] == "audio":
            buffer.write(chunk["data"])
    buffer.seek(0)
    return StreamingResponse(
        buffer,
        media_type="audio/mp3",
        headers={"Content-Disposition": "attachment; filename=speech.mp3"}
    )

That's the entire server. 30 lines of Python. Run it:

uvicorn main:app --host 0.0.0.0 --port 8002

Test It

# List available voices (there are 100+)
curl -s http://localhost:8002/voices | python3 -m json.tool | head -20

# Generate speech
curl -s -X POST http://localhost:8002/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Edge TTS", "voice": "en-US-AriaNeural"}' \
  -o /tmp/speech.mp3 && mpv /tmp/speech.mp3
VoiceGenderAccentStyle
en-US-AriaNeuralFemaleAmericanNeutral, clear
en-US-GuyNeuralMaleAmericanWarm, natural
en-US-AvaNeuralFemaleAmericanConversational (Copilot style)
en-GB-SoniaNeuralFemaleBritishProfessional
en-AU-NatashaNeuralFemaleAustralianFriendly

Docker

FROM python:3.11-slim

WORKDIR /app
COPY main.py .

RUN pip3 install --no-cache-dir fastapi uvicorn edge-tts

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docker build -t local-tts-edge .
docker run -d --name local-tts -p 8002:8000 local-tts-edge

Part 2: Piper — Offline, Instant, Lightweight

Piper is the opposite philosophy. It runs ONNX models locally — no internet, no GPU, ~100MB of RAM. The quality is lower than Edge TTS, but it's instant and completely private.

Install

pip install piper-tts fastapi uvicorn

Download a voice model:

mkdir -p /opt/tts/voices && cd /opt/tts/voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

The FastAPI Server

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import subprocess
import io

app = FastAPI()

VOICE_MODEL = "/opt/tts/voices/en_US-lessac-medium.onnx"

class TTSRequest(BaseModel):
    text: str

@app.post("/speak")
async def speak(req: TTSRequest):
    process = subprocess.run(
        ["piper", "--model", VOICE_MODEL, "--output-raw"],
        input=req.text.encode(),
        capture_output=True
    )
    return StreamingResponse(
        io.BytesIO(process.stdout),
        media_type="audio/wav",
        headers={"Content-Disposition": "attachment; filename=speech.wav"}
    )

Test It

curl -s -X POST http://localhost:8002/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from Piper"}' \
  -o /tmp/speech.wav && aplay /tmp/speech.wav

When to Use Piper

Piper shines in scenarios where Edge TTS can't:

  • No internet — air-gapped servers, field deployments
  • Privacy-critical — medical, legal, or classified environments
  • Resource-constrained — Raspberry Pi, cheap VPS, shared hosting
  • Latency-critical — response time measured in milliseconds

Part 3: Coqui XTTS v2 — Voice Cloning and Multilingual

This is the heavy hitter. Coqui XTTS v2 needs a GPU (2-3GB VRAM) and ~4GB RAM, but it delivers the best quality AND can clone any voice from just a 6-second audio sample.

Install

pip install TTS fastapi uvicorn python-multipart soundfile

The FastAPI Server

from fastapi import FastAPI, UploadFile, File
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from TTS.api import TTS
import io
import torch
import soundfile as sf

app = FastAPI()

# Load XTTS v2 model (downloads ~2GB on first run)
device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

class TTSRequest(BaseModel):
    text: str
    language: str = "en"

@app.post("/speak")
async def speak(req: TTSRequest):
    wav = tts.tts(text=req.text, language=req.language)
    buffer = io.BytesIO()
    sf.write(buffer, wav, 22050, format="WAV")
    buffer.seek(0)
    return StreamingResponse(buffer, media_type="audio/wav")

@app.post("/speak-clone")
async def speak_clone(
    text: str,
    language: str = "en",
    speaker_audio: UploadFile = File(...)
):
    """Clone a voice from a 6+ second audio sample"""
    audio_data = await speaker_audio.read()
    temp_path = "/tmp/speaker_sample.wav"
    with open(temp_path, "wb") as f:
        f.write(audio_data)

    wav = tts.tts(text=text, speaker_wav=temp_path, language=language)
    buffer = io.BytesIO()
    sf.write(buffer, wav, 22050, format="WAV")
    buffer.seek(0)
    return StreamingResponse(buffer, media_type="audio/wav")

Two endpoints: /speak for standard TTS and /speak-clone for voice cloning. The cloning endpoint takes an audio file upload — record yourself saying anything for 6+ seconds, and the model will speak new text in your voice.

Docker (GPU Required)

FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3-pip ffmpeg libsndfile1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY main.py .

RUN pip3 install --no-cache-dir fastapi uvicorn python-multipart TTS soundfile

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docker build -f Dockerfile.coqui -t local-tts-coqui .
docker run --gpus all -d --name local-tts-coqui -p 8003:8000 local-tts-coqui

Voice Cloning in Action

# Step 1: Record your voice (6+ seconds)
arecord -d 8 -f S16_LE -r 22050 -c 1 /tmp/my_voice.wav

# Step 2: Clone it
curl -s -X POST "http://localhost:8003/speak-clone?text=This+is+my+cloned+voice&language=en" \
  -F "speaker_audio=@/tmp/my_voice.wav" \
  -o /tmp/cloned.wav && aplay /tmp/cloned.wav

That's it. Record yourself, upload the sample, and the model speaks new text in your voice. The quality depends on your recording — a clean, quiet recording with natural speech gives the best results.

Supported Languages

XTTS v2 supports 20+ languages out of the box: English, Portuguese, French, Italian, Spanish, German, Polish, Dutch, Russian, Turkish, Greek, Czech, Danish, Finnish, Hungarian, Korean, Chinese, Japanese, and more. Change the language parameter to switch.

A Note on Voice Cloning Ethics

Only clone your own voice or voices you have explicit permission to use. Cloning celebrity or public figure voices without consent is illegal in many jurisdictions (Tennessee's ELVIS Act, California's right of publicity laws, EU AI Act, etc.). Use your own voice for demos — it's actually more impressive to readers.


Part 4: The Other Direction — Speech-to-Text

A voice stack isn't complete without STT. We'll use Faster-Whisper — a CTranslate2 port of OpenAI's Whisper that runs 4x faster with lower memory.

Install

pip install faster-whisper fastapi uvicorn python-multipart

The FastAPI Server

from fastapi import FastAPI, UploadFile, File
from faster_whisper import WhisperModel
import io

app = FastAPI()

# Adapt based on your resources:
# tiny (~500MB RAM), base (~1GB), small (~2GB), medium (~4GB)
MODEL_SIZE = "small"
DEVICE = "cuda"           # or "cpu"
COMPUTE_TYPE = "int8_float16"  # "int8" for CPU

model = WhisperModel(MODEL_SIZE, device=DEVICE, compute_type=COMPUTE_TYPE)

@app.post("/transcribe")
async def transcribe(file: UploadFile = File(...)):
    audio_data = await file.read()
    segments, info = model.transcribe(io.BytesIO(audio_data), beam_size=5)
    text = " ".join([segment.text for segment in segments])
    return {"language": info.language, "transcription": text.strip()}

Model Size Decision

IF GPU available AND vram >= 4GB:
    model = "small", device = "cuda"    # Best balance
ELIF GPU available AND vram >= 6GB:
    model = "medium", device = "cuda"   # Best accuracy
ELIF CPU only AND ram >= 3GB:
    model = "small", device = "cpu"     # Slower but accurate
ELIF CPU only AND ram < 2GB:
    model = "tiny", device = "cpu"      # Basic but fast

Docker (GPU)

FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    python3-pip ffmpeg \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY main.py requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Test It

# Record a 5-second clip
arecord -d 5 -f S16_LE -r 16000 -c 1 /tmp/test.wav

# Transcribe
curl -X POST http://localhost:8001/transcribe \
  -F "file=@/tmp/test.wav"

# Response:
# {"language": "en", "transcription": "Hello, this is a test recording."}

Part 5: Desktop Integration — Keyboard Shortcuts

Once your TTS and STT services are running, you can wire them into desktop keyboard shortcuts so that selecting text and pressing a key combo speaks it aloud, or pressing a shortcut records your voice and pastes the transcription. Since users may be on macOS, Windows, or Linux, the setup differs per OS — you can add a custom shortcut that calls the API endpoints above using a simple shell script (Linux/macOS) or a PowerShell script (Windows) triggered by your OS's keyboard shortcut settings.


Part 6: Systemd Services — Run on Boot (Linux/Ubuntu)

You don't want to manually start your TTS/STT servers every time you reboot. Create systemd services:

TTS Service

Create /etc/systemd/system/tts.service:

[Unit]
Description=Text-to-Speech API
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/tts
ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8002
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

STT Service

Create /etc/systemd/system/stt.service:

[Unit]
Description=Speech-to-Text API
After=network.target

[Service]
Type=simple
WorkingDirectory=/opt/stt
ExecStart=/usr/local/bin/uvicorn main:app --host 0.0.0.0 --port 8001
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable both:

systemctl daemon-reload
systemctl enable --now tts stt

The Full Architecture

Here's what we built:

┌─────────────────────────────────────────────────────────────┐
│                    DESKTOP SHORTCUTS                         │
│   Ctrl+Shift+S → Select text → Hear it spoken               │
│   Ctrl+Shift+R → Speak → Text pasted at cursor              │
└──────────┬────────────────────────────┬─────────────────────┘
           │                            │
┌──────────▼──────────┐    ┌───────────▼──────────────┐
│   TTS API (:8002)   │    │    STT API (:8001)       │
│                     │    │                          │
│  Engine Options:    │    │  Faster-Whisper          │
│  ┌───────────────┐  │    │  tiny/base/small/medium  │
│  │ Edge TTS      │  │    │  CPU or GPU              │
│  │ (internet)    │  │    │                          │
│  ├───────────────┤  │    │  POST /transcribe        │
│  │ Piper         │  │    │  → {"transcription":...} │
│  │ (offline, CPU)│  │    │                          │
│  ├───────────────┤  │    └──────────────────────────┘
│  │ Coqui XTTS   │  │
│  │ (GPU, clone)  │  │
│  └───────────────┘  │
│                     │
│  POST /speak        │
│  POST /speak-clone  │
│  GET  /voices       │
└─────────────────────┘

Troubleshooting

ProblemSolution
Edge TTS: connection errorCheck internet connectivity
Piper: no voice model foundDownload from Hugging Face
Coqui: CUDA out of memoryUse a smaller model or switch to Piper/Edge TTS
No audio outputInstall alsa-utils (apt install alsa-utils) or try mpv
Port already in useChange port: --port 8003
STT: slow on CPUUse tiny model for faster (but less accurate) results
xclip not foundapt install xclip
Keyboard shortcut not workingCheck GNOME Settings > Keyboard > Custom Shortcuts

Key Takeaways

  1. You don't need cloud APIs for TTS. Edge TTS gives you neural-quality voices for free. Piper runs offline on a Raspberry Pi. Coqui XTTS v2 clones voices with a 6-second sample. Pick the one that fits your constraints.

  2. The FastAPI pattern is the same for all engines. Every TTS server follows the same structure: receive text, generate audio, return a streaming response. Swapping engines means changing ~10 lines of code.

  3. Desktop integration is the killer feature. Having TTS/STT bound to keyboard shortcuts transforms how you use your computer. Select text → hear it. Speak → text appears. Once you set it up, you'll wonder how you worked without it.

  4. Voice cloning is powerful but use it responsibly. Clone your own voice for demos and personal use. Never clone someone else's voice without explicit consent — it's increasingly illegal and always unethical.

  5. Start with Edge TTS, add more later. It requires zero local resources and sounds great. Once you've confirmed your workflow, add Piper for offline capability or Coqui for voice cloning. You don't need all three on day one.

  6. Systemd + Docker = set it and forget it. Once deployed, your TTS/STT services survive reboots, auto-restart on crashes, and run in isolated containers. Zero maintenance.

Build Your Own Local Text-to-Speech Stack with Python | Software Engineer Blog