Speech to Text
A tiny local web app: drop an audio file into your browser and get a transcript back.
Runs entirely on your machine. GPU acceleration is used automatically when available (NVIDIA CUDA) — otherwise it falls back to CPU.
Quick start
./start.sh
This single command:
- Creates a Python virtual environment (if missing)
- Installs dependencies (if missing) —
faster-whisperwith CUDA support - Starts the web UI at http://127.0.0.1:8080
Open that URL, drag in an m4a/ogg/mp3/wav/flac/aac/opus/webm file, and get a transcript.
Requirements: Python 3, ffmpeg, and (for GPU) NVIDIA drivers. The first run may take a few minutes to download the Whisper model.
How it works
browser ──drag&drop──▶ app.py (stdlib HTTP) ──▶ faster-whisper (CTranslate2 / GPU)
app.py— serves a single-page drag-and-drop UI and handles transcription via Python. Zero pip installs at runtime; all dependencies live in the venv.- faster-whisper — loads the Whisper model (default:
large-v3) and runs inference via CTranslate2. On first use the model is downloaded from Hugging Face and cached. - Parakeet backend — an optional alternative engine. See Parakeet backend below.
GPU support
- If an NVIDIA GPU with CUDA drivers is available, transcription runs on the GPU automatically.
- The
nvidia-cublas,nvidia-cudnn, andnvidia-cuda_nvrtcpackages installed into the venv provide the CUDA runtime — no system-wide CUDA toolkit needed. - Without a GPU, everything runs on CPU (slower but functional).
- Set
STT_WHISPER_DEVICE=cputo force CPU even if a GPU is present.
Configuration
Optionally create ~/.local/share/speech-to-text/speech-to-text.env to override defaults:
STT_HOST=127.0.0.1 # bind address (0.0.0.0 to accept LAN connections)
STT_PORT=8080 # UI port
STT_LANG=fr # default language
STT_MAX_MB=2000 # max upload size in MB
STT_MODEL=whisper # "whisper" (default) or "parakeet"
STT_WHISPER_MODEL=large-v3 # whisper model size
STT_WHISPER_DEVICE=cuda # "cuda" (default) or "cpu"
STT_WHISPER_COMPUTE=int8_float16 # compute type
STT_CHUNK_SECONDS=120 # seconds per chunk for long files
STT_WORKERS=4 # parallel chunk workers
STT_LLM_MODEL=gpt-4o-mini # LLM model for summarization
No config file is needed for the default setup — sensible defaults are baked in.
Summarization
Each transcription has a Summarize button that calls an OpenAI-compatible LLM endpoint to produce a structured summary with bullet points, action items, and key decisions.
The LLM endpoint is configured by clicking the ⚙ gear icon in the top-right corner of the UI. Settings are stored in your browser's localStorage and persist across sessions.
You can also preconfigure the endpoint via environment variables:
| Variable | Meaning | Default |
|---|---|---|
OPENAI_COMPATIBLE_ENDPOINT |
LLM base URL (e.g. https://api.openai.com/v1) |
(empty) |
OPENAI_API_KEY |
API key for the LLM | (empty) |
STT_LLM_MODEL |
Model name for summarization | gpt-4o-mini |
Parakeet backend (alternative ASR engine)
By default the app uses Whisper via faster-whisper. You can switch to the
NVIDIA Parakeet TDT 0.6B model by setting
STT_MODEL=parakeet. This requires the external Go ASR server and ONNX model files.
# One-time setup (Arch Linux): downloads Go binary + ONNX model
./setup.sh
setup.sh is idempotent and does the following:
- Ensures ffmpeg and ONNX Runtime are installed.
- Downloads the parakeet Go server binary from achetronic/parakeet.
- Downloads the Parakeet TDT 0.6B v3 int8 ONNX model files (~670MB), or symlinks them from an existing Handy install if found.
- Writes a resolved env file to
~/.local/share/speech-to-text/speech-to-text.env.
When STT_MODEL=parakeet, start.sh automatically launches the Go server before
starting the Python UI, and the Python server proxies transcription requests to it.
Configuration reference (Parakeet backend)
| Variable | Meaning | Default |
|---|---|---|
STT_PARAKEET_BIN |
Path to parakeet Go binary | $HOME/.local/share/speech-to-text/bin/parakeet |
STT_PARAKEET_PORT |
Go server port | 5092 |
STT_PARAKEET_MODELS_DIR |
ONNX model directory | $HOME/.local/share/speech-to-text/models |
STT_BACKEND_URL |
Where the Python UI reaches the Go server | http://127.0.0.1:5092 |
STT_API_KEY |
Bearer token if Go server auth is enabled | (unset) |
STT_WORKERS |
Concurrent inference workers | 4 |
Run on boot (systemd)
For always-on deployment, copy the service units and enable them:
DEST=~/.local/share/speech-to-text
cp app.py "$DEST/app.py"
cp -r lib "$DEST/lib"
mkdir -p ~/.config/systemd/user
cp speech-to-text.service ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user enable --now speech-to-text.service
License
- App code: AGPL-3.0-or-later.
- parakeet server (optional): MIT.
- Whisper model weights: MIT (OpenAI).
- Parakeet model weights: CC-BY-4.0 (NVIDIA).