Voice-Pro is an open-source tool that integrates speech recognition, translation, subtitle generation, AI dubbing, and voice cloning into a single local application. No cloud services, no API keys for core features β everything runs on your machine. For video creators who need to localize content across languages, this replaces an entire pipeline of separate tools.
What It Does
Input: Video / Audio / YouTube URL
β
βββββββββββββββββββββββββββββββββββββββ
β Voice-Pro Pipeline β
βββββββββββ¬βββββββββββ¬βββββββββββββββββ€
β Speech β Translateβ Generate β
β Recog. β Text β New Audio β
β β β β
β Whisper β 100+ β Voice clone β
β WhisperXβ languagesβ (F5-TTS) β
β Faster- β β Edge-TTS β
β Whisper β β kokoro β
βββββββββββ΄βββββββββββ΄βββββββββββββββββ€
β + Vocal isolation (Demucs) β
β + Subtitle generation β
β + YouTube download & extraction β
βββββββββββββββββββββββββββββββββββββββ
β
Output: Dubbed video in target language
with cloned voice + synced subtitles
Key Features
| Feature |
What It Does |
Engine |
| Speech recognition |
Audio β text with timestamps |
Whisper, Faster-Whisper, WhisperX |
| Translation |
Text in 100+ languages |
Built-in translation |
| Voice cloning |
Clone any voice from a short sample |
F5-TTS, E2-TTS, CosyVoice (zero-shot) |
| Text-to-speech |
Generate natural speech |
Edge-TTS (100+ languages), kokoro |
| Vocal isolation |
Separate voice from background audio |
Demucs |
| Subtitle generation |
Word-level highlighted captions |
Whisper-Timestamped |
| Video download |
Download from YouTube and other platforms |
Built-in |
Web Interface Modules
Voice-Pro runs as a local web app (Gradio) with four main modules:
| Module |
Purpose |
| Dubbing Studio |
All-in-one: download β recognize β translate β dub |
| Whisper Caption |
Subtitle-focused recognition with word-level highlighting |
| Translate |
Real-time speech-to-text translation |
| Speech Generation |
Podcast creation, multilingual audio generation |
Installation
# Windows
configure.bat # Install dependencies
start.bat # Launch web interface
# Mac / Linux
./configure.sh
./start.sh
Requirements
| Requirement |
Minimum |
| OS |
Windows 10/11, Linux, macOS |
| GPU |
NVIDIA with CUDA 12.4 (recommended) |
| VRAM |
4GB+ |
| Storage |
20GB+ |
| Python |
3.10.15 |
Real-World Use Cases
| Who |
How They Use It |
| Video creators |
Localize content for international audiences without hiring voice actors |
| Educators |
Translate lecture recordings into multiple languages with natural voice |
| Podcasters |
Generate multilingual versions of episodes |
| Researchers |
Transcribe and translate interview recordings |
| Content teams |
Rapid dubbing for marketing videos across markets |
Voice Cloning: Power and Responsibility
The zero-shot voice cloning (clone from a short audio sample) is the most powerful β and most ethically sensitive β feature. It can:
- Preserve a speakerβs voice across translated content
- Create consistent narration in any language
- Potentially be misused for impersonation
For educators: this is an excellent case study for AI ethics discussions β the same technology that enables accessibility (translating lectures) also enables deepfakes.
How LearnAI Team Could Use This
- Create a localization workflow for course videos, tutorials, and short-form learning clips.
- Use the tool as a hands-on example of local-first AI media pipelines.
- Build an ethics lesson around consent, disclosure, and misuse risks in voice cloning.
See Also: MOSS-TTS-Nano β Tiny TTS That Runs on CPU
MOSS-TTS-Nano (released April 2026) is a 0.1B-parameter open-source TTS model from MOSI.AI and the OpenMOSS team. Unlike Voice-Pro which bundles a full pipeline, MOSS-TTS-Nano is a standalone text-to-speech engine designed for lightweight deployment.
| Feature |
MOSS-TTS-Nano |
Edge-TTS (in Voice-Pro) |
| Parameters |
0.1B |
Cloud-based |
| Runs on |
CPU only (4 cores) |
Requires internet |
| Languages |
Chinese, English, Japanese, Korean, Arabic + more |
100+ languages |
| Audio quality |
48kHz stereo |
24kHz mono |
| Voice cloning |
Yes (streaming) |
No |
| Privacy |
Fully local |
Cloud (Microsoft) |
| Cost |
Free |
Free |
Why it matters: MOSS-TTS-Nano proves you can get high-quality multilingual TTS at 48kHz stereo from a model small enough to run on a laptop CPU β no GPU, no cloud, no API key. For anyone building local-first voice pipelines (audiobooks, narration, accessibility), this is a compelling alternative to Edge-TTS when privacy matters or internet isnβt available.
# Quick start
pip install moss-tts-nano
python -m moss_tts_nano.infer --text "Hello world" --output hello.wav