v0.5.0 · Voice Clone now playable

Subtitles & voice cloning,
on your machine.

MacOS Windows Linux
No account, no telemetry Offline by default macOS · Windows · Linux
vuesub · interview-segment-3.wav
"and that's when the model figured it out"
"we shipped on a friday"
"no regrets"

Vuesub turns audio and video into accurate subtitles with Whisper, and clones any voice from a short clip with CosyVoice 3. Runs entirely offline. No accounts, no telemetry, no upload required. Available for macOS, Windows, and Linux.

3.5×
realtime voice synthesis
after warmup
16
languages auto-detected
for transcription
5–30s
reference clip is all
Voice Clone needs
0
accounts, telemetry,
or uploads required
Two tools, one app

Caption or clone.

Switch with a click on the left rail.

Caption

Drop in audio, get a fully-edited subtitle track.

Voice Clone new

5–30s clip in. Any text, in that voice, out.

What's inside each mode →

Caption. Built for creators, podcasters, and anyone who needs accurate timed text. Editable transcript with draggable caption clips on a waveform; SRT, VTT, plain text, or FCPXML for Final Cut Pro; 16 languages auto-detected, including Mandarin and Cantonese; optional cloud route via OpenAI whisper-1.

Voice Clone. Powered by CosyVoice 3 — runs locally, your reference clips never leave the machine. Zero-shot cloning, no fine-tuning. Nine languages: 中文 · English · 日本語 · 한국어 · Deutsch · Español · Français · Italiano · Русский. Whisper auto-transcribes the reference; transcript is editable inline. Lazy install — model downloads only when you opt in.

Caption

A Final-Cut-style editor on top of Whisper.

Drag clips. Zoom the waveform. Round-trip every edit.

Editor

Per-pixel waveform. Draggable clips. Inline edits.

Languages & encodings

16 languages, auto-detected. Plays anything FFmpeg can read.

中文 English 日本語 한국어 Español Deutsch Français Italiano Português Русский العربية हिन्दी Türkçe Tiếng Việt Bahasa 粵語 中文 English 日本語 한국어 Español Deutsch Français Italiano Português Русский العربية हिन्दी Türkçe Tiếng Việt Bahasa 粵語
interview-segment-3.wav · transcript
READY
Edits round-trip back to the timeline. ⌘Z to undo.
Editor shortcuts & details →
  • Waveform timeline with per-pixel detail at any zoom (⌘+ / ⌘− / ⌘0)
  • Draggable caption clips — body to move, edges to resize
  • Inline text edits with auto-grow textareas, ⌘Z undo
  • Insert / split / merge / delete — per-row + button (auto-fills the gap), in-row ✂ to split at the caret, ↑ to merge with the row above, ⌘J to join a multi-row range
  • Variable playback (1× / 1.25× / 1.5× / 2×) without changing pitch
  • Near-instant 简/繁 conversion for Chinese transcripts (OpenCC)
  • Plays anything: WebM with Opus + AV1, MKV, FLAC, .ogg via Web Audio fallback
  • Local mode never leaves your machine — useful for NDA'd content
Voice Clone

Three steps to a cloned voice.

Upload, verify, generate.

01 — UPLOAD

Drop in a clean clip

5–30 seconds. WAV or MP3. Single speaker.

02 — VERIFY

Edit the auto-transcript

Whisper transcribes. One click fixes anything wrong.

03 — GENERATE

Type and hit play

~60s cold start. ~3.5× realtime warm.

Live demo · Click to play
Reference · 0:18 Steve Jobs (real)
0:00 / 0:18
Cloned · 0:13 Same voice, new text
0:00 / 0:13

Cloned entirely from the 18s reference — no fine-tuning, no training data, generated in seconds on a Mac.

One-time setup & prerequisites →

The CosyVoice runtime is too heavy to bundle (~6.5 GB), so it installs on demand the first time you download a model. Vuesub detects what's missing and shows install snippets inline. You'll need three tools on your PATH first:

  • python 3.10–3.12 — CosyVoice's torch pin. brew install [email protected]
  • uv — venv + dep installer, 10× faster than pip. curl -LsSf https://astral.sh/uv/install.sh | sh
  • git — shallow-clones the CosyVoice repo. xcode-select --install

Heads up. Voice Clone is fully tested only on Apple Silicon Mac. Windows and Linux binaries are produced from CI but the runtime path hasn't been live-tested on those platforms. Please open an issue if anything breaks.

Whisper models

Pick the smallest that's accurate enough.

Download lazily on first use.

tiny
~75 MB · ~1 GB RAM
base
~145 MB · ~1 GB RAM
small
~485 MB · ~2 GB RAM
medium
~1.5 GB · ~5 GB RAM
large-v3
~3.1 GB · ~10 GB RAM
whisper-1
cloud · OpenAI
Model download failing? Grab a pre-packaged model from our mirror and import via Settings → Models → Import: Google Drive mirror →
Export

Four formats. Every workflow.

.srt

SRT default

Universal. Plays everywhere.

.fcpxml

FCPXML

Native Final Cut Pro captions. Frame-aligned.

.vtt

VTT

HTML5 <track>. For self-hosted video.

.txt

Plain text

Just the words. Feed an LLM, write a blog post.

Format details & player support →

SRT. SubRip — the universal timed-subtitle format. Plain text, numbered cues, HH:MM:SS,mmm ranges. Players: VLC, IINA, mpv, QuickTime, smart TVs. Upload: YouTube, Vimeo accept SRT directly. Editors: Premiere Pro, DaVinci Resolve, Final Cut Pro, CapCut, Camtasia. Burn-in: ffmpeg -vf subtitles=in.srt.

FCPXML. A real Final Cut Pro project (FCPXML 1.11) where every segment is a native, restyleable <caption> on lane 1 — not burnt-in titles. Frame-aligned to your chosen rate (1080p · 30 / 24 / 25 / 29.97 / 60). Drop-frame timecode for 29.97 and 59.94, NDF for everything else. Import: File → Import → XML… → pick the .fcpxml.

VTT. WebVTT for HTML5 <video> with a <track> element. Same shape as SRT, with . instead of , and a WEBVTT header. Pick this for self-hosted video on a website you control.

Plain text. Just the lines, one per segment. No timing. Use it for the content, not the timing — turning a podcast into a blog post, dropping a transcript into Notion, feeding text to an LLM, sharing notes.

Setup

First launch & system requirements.

First launch on macOS

Strip the quarantine attribute so Gatekeeper opens the app:

sudo xattr -rd com.apple.quarantine /Applications/Vuesub.app

Same effect as a Developer-ID signature, by hand.

System requirements

  • macOS 11+ · Apple Silicon native
  • Windows 10 / 11 · x64 · WebView2
  • Linux · x64 Debian/Ubuntu · libwebkit2gtk-4.1
  • Caption: 8 GB RAM · Voice Clone: 16 GB+
  • Disk: ~1 GB app + 75 MB–3 GB models · +6.5 GB Voice Clone

Ready to ship subtitles?

Download once. No account. No upload. Runs offline.