Voice-Pro β€” Local AI Dubbing, Translation, and Voice Cloning in One Tool

Voice-Pro β€” Local AI Dubbing, Translation, and Voice Cloning in One Tool

Voice-Pro is an open-source tool that integrates speech recognition, translation, subtitle generation, AI dubbing, and voice cloning into a single local application. No cloud services, no API keys for core features β€” everything runs on your machine. For video creators who need to localize content across languages, this replaces an entire pipeline of separate tools.

*Source: GitHub: abus-aikorea/voice-pro εΌ ε²±ζ©™ Weibo recommendation*

What It Does

Input: Video / Audio / YouTube URL
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Voice-Pro Pipeline        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Speech  β”‚ Translateβ”‚ Generate       β”‚
β”‚ Recog.  β”‚ Text     β”‚ New Audio      β”‚
β”‚         β”‚          β”‚                β”‚
β”‚ Whisper β”‚ 100+     β”‚ Voice clone    β”‚
β”‚ WhisperXβ”‚ languagesβ”‚ (F5-TTS)      β”‚
β”‚ Faster- β”‚          β”‚ Edge-TTS       β”‚
β”‚ Whisper β”‚          β”‚ kokoro         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ + Vocal isolation (Demucs)          β”‚
β”‚ + Subtitle generation               β”‚
β”‚ + YouTube download & extraction      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
Output: Dubbed video in target language
        with cloned voice + synced subtitles

Key Features

Feature What It Does Engine
Speech recognition Audio β†’ text with timestamps Whisper, Faster-Whisper, WhisperX
Translation Text in 100+ languages Built-in translation
Voice cloning Clone any voice from a short sample F5-TTS, E2-TTS, CosyVoice (zero-shot)
Text-to-speech Generate natural speech Edge-TTS (100+ languages), kokoro
Vocal isolation Separate voice from background audio Demucs
Subtitle generation Word-level highlighted captions Whisper-Timestamped
Video download Download from YouTube and other platforms Built-in

Web Interface Modules

Voice-Pro runs as a local web app (Gradio) with four main modules:

Module Purpose
Dubbing Studio All-in-one: download β†’ recognize β†’ translate β†’ dub
Whisper Caption Subtitle-focused recognition with word-level highlighting
Translate Real-time speech-to-text translation
Speech Generation Podcast creation, multilingual audio generation

Installation

# Windows
configure.bat    # Install dependencies
start.bat        # Launch web interface

# Mac / Linux
./configure.sh
./start.sh

Requirements

Requirement Minimum
OS Windows 10/11, Linux, macOS
GPU NVIDIA with CUDA 12.4 (recommended)
VRAM 4GB+
Storage 20GB+
Python 3.10.15

Real-World Use Cases

Who How They Use It
Video creators Localize content for international audiences without hiring voice actors
Educators Translate lecture recordings into multiple languages with natural voice
Podcasters Generate multilingual versions of episodes
Researchers Transcribe and translate interview recordings
Content teams Rapid dubbing for marketing videos across markets

Voice Cloning: Power and Responsibility

The zero-shot voice cloning (clone from a short audio sample) is the most powerful β€” and most ethically sensitive β€” feature. It can:

  • Preserve a speaker’s voice across translated content
  • Create consistent narration in any language
  • Potentially be misused for impersonation

For educators: this is an excellent case study for AI ethics discussions β€” the same technology that enables accessibility (translating lectures) also enables deepfakes.

How LearnAI Team Could Use This

  • Create a localization workflow for course videos, tutorials, and short-form learning clips.
  • Use the tool as a hands-on example of local-first AI media pipelines.
  • Build an ethics lesson around consent, disclosure, and misuse risks in voice cloning.

See Also: MOSS-TTS-Nano β€” Tiny TTS That Runs on CPU

MOSS-TTS-Nano (released April 2026) is a 0.1B-parameter open-source TTS model from MOSI.AI and the OpenMOSS team. Unlike Voice-Pro which bundles a full pipeline, MOSS-TTS-Nano is a standalone text-to-speech engine designed for lightweight deployment.

Feature MOSS-TTS-Nano Edge-TTS (in Voice-Pro)
Parameters 0.1B Cloud-based
Runs on CPU only (4 cores) Requires internet
Languages Chinese, English, Japanese, Korean, Arabic + more 100+ languages
Audio quality 48kHz stereo 24kHz mono
Voice cloning Yes (streaming) No
Privacy Fully local Cloud (Microsoft)
Cost Free Free

Why it matters: MOSS-TTS-Nano proves you can get high-quality multilingual TTS at 48kHz stereo from a model small enough to run on a laptop CPU β€” no GPU, no cloud, no API key. For anyone building local-first voice pipelines (audiobooks, narration, accessibility), this is a compelling alternative to Edge-TTS when privacy matters or internet isn’t available.

# Quick start
pip install moss-tts-nano
python -m moss_tts_nano.infer --text "Hello world" --output hello.wav
*Source: GitHub Demo*