johnride/speech-to-text

Fork 0

Go to file

Jean-Gabriel Gill-Couture 9d030c5443 feat: Add start_cpu.sh setting STT_WHISPER_DEVICE=cpu

2026-05-29 09:48:03 -04:00

lib

audio player

2026-05-28 16:12:58 -04:00

.gitignore

initial commit

2026-05-22 17:10:24 -04:00

app.py

feat: progress bar and streaming chunks as they come

2026-05-23 07:26:00 -04:00

backend.sh

Feat: rename app to speech-to-text and improve README

2026-05-22 17:43:44 -04:00

README.md

feat: Add summarization feature calling out to an openai compatible endpoint

2026-05-23 08:13:05 -04:00

requirements.txt

Feat: rename app to speech-to-text and improve README

2026-05-22 17:43:44 -04:00

setup.sh

Feat: rename app to speech-to-text and improve README

2026-05-22 17:43:44 -04:00

start_cpu.sh

feat: Add start_cpu.sh setting STT_WHISPER_DEVICE=cpu

2026-05-29 09:48:03 -04:00

start.sh

feat: progress bar and streaming chunks as they come

2026-05-23 07:26:00 -04:00

README.md

Speech to Text

A tiny local web app: drop an audio file into your browser and get a transcript back.

Runs entirely on your machine. GPU acceleration is used automatically when available (NVIDIA CUDA) — otherwise it falls back to CPU.

Quick start

./start.sh

This single command:

Creates a Python virtual environment (if missing)
Installs dependencies (if missing) — faster-whisper with CUDA support
Starts the web UI at http://127.0.0.1:8080

Open that URL, drag in an m4a/ogg/mp3/wav/flac/aac/opus/webm file, and get a transcript.

Requirements: Python 3, ffmpeg, and (for GPU) NVIDIA drivers. The first run may take a few minutes to download the Whisper model.

How it works

browser  ──drag&drop──▶  app.py (stdlib HTTP)  ──▶  faster-whisper (CTranslate2 / GPU)

app.py — serves a single-page drag-and-drop UI and handles transcription via Python. Zero pip installs at runtime; all dependencies live in the venv.
faster-whisper — loads the Whisper model (default: large-v3) and runs inference via CTranslate2. On first use the model is downloaded from Hugging Face and cached.
Parakeet backend — an optional alternative engine. See Parakeet backend below.

GPU support

If an NVIDIA GPU with CUDA drivers is available, transcription runs on the GPU automatically.
The nvidia-cublas, nvidia-cudnn, and nvidia-cuda_nvrtc packages installed into the venv provide the CUDA runtime — no system-wide CUDA toolkit needed.
Without a GPU, everything runs on CPU (slower but functional).
Set STT_WHISPER_DEVICE=cpu to force CPU even if a GPU is present.

Configuration

Optionally create ~/.local/share/speech-to-text/speech-to-text.env to override defaults:

STT_HOST=127.0.0.1       # bind address (0.0.0.0 to accept LAN connections)
STT_PORT=8080             # UI port
STT_LANG=fr               # default language
STT_MAX_MB=2000           # max upload size in MB
STT_MODEL=whisper         # "whisper" (default) or "parakeet"
STT_WHISPER_MODEL=large-v3        # whisper model size
STT_WHISPER_DEVICE=cuda           # "cuda" (default) or "cpu"
STT_WHISPER_COMPUTE=int8_float16  # compute type
STT_CHUNK_SECONDS=120              # seconds per chunk for long files
STT_WORKERS=4                       # parallel chunk workers
STT_LLM_MODEL=gpt-4o-mini          # LLM model for summarization

No config file is needed for the default setup — sensible defaults are baked in.

Summarization

Each transcription has a Summarize button that calls an OpenAI-compatible LLM endpoint to produce a structured summary with bullet points, action items, and key decisions.

The LLM endpoint is configured by clicking the ⚙ gear icon in the top-right corner of the UI. Settings are stored in your browser's localStorage and persist across sessions.

You can also preconfigure the endpoint via environment variables:

Variable	Meaning	Default
`OPENAI_COMPATIBLE_ENDPOINT`	LLM base URL (e.g. `https://api.openai.com/v1`)	(empty)
`OPENAI_API_KEY`	API key for the LLM	(empty)
`STT_LLM_MODEL`	Model name for summarization	`gpt-4o-mini`

Parakeet backend (alternative ASR engine)

By default the app uses Whisper via faster-whisper. You can switch to the NVIDIA Parakeet TDT 0.6B model by setting STT_MODEL=parakeet. This requires the external Go ASR server and ONNX model files.

# One-time setup (Arch Linux): downloads Go binary + ONNX model
./setup.sh

setup.sh is idempotent and does the following:

Ensures ffmpeg and ONNX Runtime are installed.
Downloads the parakeet Go server binary from achetronic/parakeet.
Downloads the Parakeet TDT 0.6B v3 int8 ONNX model files (~670MB), or symlinks them from an existing Handy install if found.
Writes a resolved env file to ~/.local/share/speech-to-text/speech-to-text.env.

When STT_MODEL=parakeet, start.sh automatically launches the Go server before starting the Python UI, and the Python server proxies transcription requests to it.

Configuration reference (Parakeet backend)

Variable	Meaning	Default
`STT_PARAKEET_BIN`	Path to parakeet Go binary	`$HOME/.local/share/speech-to-text/bin/parakeet`
`STT_PARAKEET_PORT`	Go server port	`5092`
`STT_PARAKEET_MODELS_DIR`	ONNX model directory	`$HOME/.local/share/speech-to-text/models`
`STT_BACKEND_URL`	Where the Python UI reaches the Go server	`http://127.0.0.1:5092`
`STT_API_KEY`	Bearer token if Go server auth is enabled	(unset)
`STT_WORKERS`	Concurrent inference workers	`4`

Run on boot (systemd)

For always-on deployment, copy the service units and enable them:

DEST=~/.local/share/speech-to-text
cp app.py "$DEST/app.py"
cp -r lib "$DEST/lib"

mkdir -p ~/.config/systemd/user
cp speech-to-text.service ~/.config/systemd/user/

systemctl --user daemon-reload
systemctl --user enable --now speech-to-text.service

License

App code: AGPL-3.0-or-later.
parakeet server (optional): MIT.
Whisper model weights: MIT (OpenAI).
Parakeet model weights: CC-BY-4.0 (NVIDIA).