Vuesub turns audio and video into accurate subtitles with Whisper, and clones any voice from a short clip with CosyVoice 3. Runs entirely offline. No accounts, no telemetry, no upload required. Available for macOS, Windows, and Linux.
Switch with a click on the left rail.
Drop in audio, get a fully-edited subtitle track.
5–30s clip in. Any text, in that voice, out.
Caption. Built for creators, podcasters, and anyone who needs accurate timed text. Editable transcript with draggable caption clips on a waveform; SRT, VTT, plain text, or FCPXML for Final Cut Pro; 16 languages auto-detected, including Mandarin and Cantonese; optional cloud route via OpenAI whisper-1.
Voice Clone. Powered by CosyVoice 3 — runs locally, your reference clips never leave the machine. Zero-shot cloning, no fine-tuning. Nine languages: 中文 · English · 日本語 · 한국어 · Deutsch · Español · Français · Italiano · Русский. Whisper auto-transcribes the reference; transcript is editable inline. Lazy install — model downloads only when you opt in.
Drag clips. Zoom the waveform. Round-trip every edit.
Per-pixel waveform. Draggable clips. Inline edits.
16 languages, auto-detected. Plays anything FFmpeg can read.
Upload, verify, generate.
5–30 seconds. WAV or MP3. Single speaker.
Whisper transcribes. One click fixes anything wrong.
~60s cold start. ~3.5× realtime warm.
Cloned entirely from the 18s reference — no fine-tuning, no training data, generated in seconds on a Mac.
The CosyVoice runtime is too heavy to bundle (~6.5 GB), so it installs on demand the first time you download a model. Vuesub detects what's missing and shows install snippets inline. You'll need three tools on your PATH first:
brew install [email protected]curl -LsSf https://astral.sh/uv/install.sh | shxcode-select --installHeads up. Voice Clone is fully tested only on Apple Silicon Mac. Windows and Linux binaries are produced from CI but the runtime path hasn't been live-tested on those platforms. Please open an issue if anything breaks.
Download lazily on first use.
Universal. Plays everywhere.
Native Final Cut Pro captions. Frame-aligned.
HTML5 <track>. For self-hosted video.
Just the words. Feed an LLM, write a blog post.
SRT. SubRip — the universal timed-subtitle format. Plain text, numbered cues, HH:MM:SS,mmm ranges. Players: VLC, IINA, mpv, QuickTime, smart TVs. Upload: YouTube, Vimeo accept SRT directly. Editors: Premiere Pro, DaVinci Resolve, Final Cut Pro, CapCut, Camtasia. Burn-in: ffmpeg -vf subtitles=in.srt.
FCPXML. A real Final Cut Pro project (FCPXML 1.11) where every segment is a native, restyleable <caption> on lane 1 — not burnt-in titles. Frame-aligned to your chosen rate (1080p · 30 / 24 / 25 / 29.97 / 60). Drop-frame timecode for 29.97 and 59.94, NDF for everything else. Import: File → Import → XML… → pick the .fcpxml.
VTT. WebVTT for HTML5 <video> with a <track> element. Same shape as SRT, with . instead of , and a WEBVTT header. Pick this for self-hosted video on a website you control.
Plain text. Just the lines, one per segment. No timing. Use it for the content, not the timing — turning a podcast into a blog post, dropping a transcript into Notion, feeding text to an LLM, sharing notes.
Strip the quarantine attribute so Gatekeeper opens the app:
sudo xattr -rd com.apple.quarantine /Applications/Vuesub.app
Same effect as a Developer-ID signature, by hand.