134 lines
5.2 KiB
Markdown
134 lines
5.2 KiB
Markdown
# Speech to Text
|
|
|
|
A tiny local web app: drop an audio file into your browser and get a transcript back.
|
|
|
|
Runs entirely on your machine. **GPU acceleration is used automatically when available** (NVIDIA CUDA) — otherwise it falls back to CPU.
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
./start.sh
|
|
```
|
|
|
|
This single command:
|
|
|
|
1. Creates a Python virtual environment (if missing)
|
|
2. Installs dependencies (if missing) — `faster-whisper` with CUDA support
|
|
3. Starts the web UI at <http://127.0.0.1:8080>
|
|
|
|
Open that URL, drag in an m4a/ogg/mp3/wav/flac/aac/opus/webm file, and get a transcript.
|
|
|
|
> **Requirements:** Python 3, ffmpeg, and (for GPU) NVIDIA drivers. The first run may take a few minutes to download the Whisper model.
|
|
|
|
## How it works
|
|
|
|
```
|
|
browser ──drag&drop──▶ app.py (stdlib HTTP) ──▶ faster-whisper (CTranslate2 / GPU)
|
|
```
|
|
|
|
- **`app.py`** — serves a single-page drag-and-drop UI and handles transcription via Python.
|
|
Zero pip installs at runtime; all dependencies live in the venv.
|
|
- **faster-whisper** — loads the Whisper model (default: `large-v3`) and runs inference via
|
|
CTranslate2. On first use the model is downloaded from Hugging Face and cached.
|
|
- **Parakeet backend** — an optional alternative engine. See [Parakeet backend](#parakeet-backend) below.
|
|
|
|
### GPU support
|
|
|
|
- If an NVIDIA GPU with CUDA drivers is available, transcription runs on the GPU automatically.
|
|
- The `nvidia-cublas`, `nvidia-cudnn`, and `nvidia-cuda_nvrtc` packages installed into the venv
|
|
provide the CUDA runtime — no system-wide CUDA toolkit needed.
|
|
- Without a GPU, everything runs on CPU (slower but functional).
|
|
- Set `STT_WHISPER_DEVICE=cpu` to force CPU even if a GPU is present.
|
|
|
|
## Configuration
|
|
|
|
Optionally create `~/.local/share/speech-to-text/speech-to-text.env` to override defaults:
|
|
|
|
```bash
|
|
STT_HOST=127.0.0.1 # bind address (0.0.0.0 to accept LAN connections)
|
|
STT_PORT=8080 # UI port
|
|
STT_LANG=fr # default language
|
|
STT_MAX_MB=2000 # max upload size in MB
|
|
STT_MODEL=whisper # "whisper" (default) or "parakeet"
|
|
STT_WHISPER_MODEL=large-v3 # whisper model size
|
|
STT_WHISPER_DEVICE=cuda # "cuda" (default) or "cpu"
|
|
STT_WHISPER_COMPUTE=int8_float16 # compute type
|
|
STT_CHUNK_SECONDS=120 # seconds per chunk for long files
|
|
STT_WORKERS=4 # parallel chunk workers
|
|
STT_LLM_MODEL=gpt-4o-mini # LLM model for summarization
|
|
```
|
|
|
|
No config file is needed for the default setup — sensible defaults are baked in.
|
|
|
|
## Summarization
|
|
|
|
Each transcription has a **Summarize** button that calls an OpenAI-compatible LLM endpoint
|
|
to produce a structured summary with bullet points, action items, and key decisions.
|
|
|
|
The LLM endpoint is configured by clicking the ⚙ gear icon in the top-right corner of the UI.
|
|
Settings are stored in your browser's localStorage and persist across sessions.
|
|
|
|
You can also preconfigure the endpoint via environment variables:
|
|
|
|
| Variable | Meaning | Default |
|
|
| --- | --- | --- |
|
|
| `OPENAI_COMPATIBLE_ENDPOINT` | LLM base URL (e.g. `https://api.openai.com/v1`) | _(empty)_ |
|
|
| `OPENAI_API_KEY` | API key for the LLM | _(empty)_ |
|
|
| `STT_LLM_MODEL` | Model name for summarization | `gpt-4o-mini` |
|
|
|
|
## Parakeet backend (alternative ASR engine)
|
|
|
|
By default the app uses Whisper via `faster-whisper`. You can switch to the
|
|
[NVIDIA Parakeet TDT 0.6B](https://github.com/achetronic/parakeet) model by setting
|
|
`STT_MODEL=parakeet`. This requires the external Go ASR server and ONNX model files.
|
|
|
|
```bash
|
|
# One-time setup (Arch Linux): downloads Go binary + ONNX model
|
|
./setup.sh
|
|
```
|
|
|
|
`setup.sh` is idempotent and does the following:
|
|
|
|
1. Ensures **ffmpeg** and **ONNX Runtime** are installed.
|
|
2. Downloads the **parakeet** Go server binary from
|
|
[achetronic/parakeet](https://github.com/achetronic/parakeet).
|
|
3. Downloads the **Parakeet TDT 0.6B v3 int8** ONNX model files (~670MB),
|
|
or symlinks them from an existing Handy install if found.
|
|
4. Writes a resolved env file to `~/.local/share/speech-to-text/speech-to-text.env`.
|
|
|
|
When `STT_MODEL=parakeet`, `start.sh` automatically launches the Go server before
|
|
starting the Python UI, and the Python server proxies transcription requests to it.
|
|
|
|
### Configuration reference (Parakeet backend)
|
|
|
|
| Variable | Meaning | Default |
|
|
| --- | --- | --- |
|
|
| `STT_PARAKEET_BIN` | Path to parakeet Go binary | `$HOME/.local/share/speech-to-text/bin/parakeet` |
|
|
| `STT_PARAKEET_PORT` | Go server port | `5092` |
|
|
| `STT_PARAKEET_MODELS_DIR` | ONNX model directory | `$HOME/.local/share/speech-to-text/models` |
|
|
| `STT_BACKEND_URL` | Where the Python UI reaches the Go server | `http://127.0.0.1:5092` |
|
|
| `STT_API_KEY` | Bearer token if Go server auth is enabled | _(unset)_ |
|
|
| `STT_WORKERS` | Concurrent inference workers | `4` |
|
|
|
|
## Run on boot (systemd)
|
|
|
|
For always-on deployment, copy the service units and enable them:
|
|
|
|
```bash
|
|
DEST=~/.local/share/speech-to-text
|
|
cp app.py "$DEST/app.py"
|
|
cp -r lib "$DEST/lib"
|
|
|
|
mkdir -p ~/.config/systemd/user
|
|
cp speech-to-text.service ~/.config/systemd/user/
|
|
|
|
systemctl --user daemon-reload
|
|
systemctl --user enable --now speech-to-text.service
|
|
```
|
|
|
|
## License
|
|
|
|
- App code: AGPL-3.0-or-later.
|
|
- parakeet server (optional): MIT.
|
|
- Whisper model weights: MIT (OpenAI).
|
|
- Parakeet model weights: CC-BY-4.0 (NVIDIA). |