Voice-Pro — Local AI Dubbing, Translation, and Voice Cloning in One Tool

Voice-Pro is an open-source tool that integrates speech recognition, translation, subtitle generation, AI dubbing, and voice cloning into a single local application. No cloud services, no API keys for core features — everything runs on your machine. For video creators who need to localize content across languages, this replaces an entire pipeline of separate tools.

*Source: GitHub: abus-aikorea/voice-pro

张岱橙 Weibo recommendation*

What It Does

Input: Video / Audio / YouTube URL
         ↓
┌─────────────────────────────────────┐
│            Voice-Pro Pipeline        │
├─────────┬──────────┬────────────────┤
│ Speech  │ Translate│ Generate       │
│ Recog.  │ Text     │ New Audio      │
│         │          │                │
│ Whisper │ 100+     │ Voice clone    │
│ WhisperX│ languages│ (F5-TTS)      │
│ Faster- │          │ Edge-TTS       │
│ Whisper │          │ kokoro         │
├─────────┴──────────┴────────────────┤
│ + Vocal isolation (Demucs)          │
│ + Subtitle generation               │
│ + YouTube download & extraction      │
└─────────────────────────────────────┘
         ↓
Output: Dubbed video in target language
        with cloned voice + synced subtitles

Key Features

Feature	What It Does	Engine
Speech recognition	Audio → text with timestamps	Whisper, Faster-Whisper, WhisperX
Translation	Text in 100+ languages	Built-in translation
Voice cloning	Clone any voice from a short sample	F5-TTS, E2-TTS, CosyVoice (zero-shot)
Text-to-speech	Generate natural speech	Edge-TTS (100+ languages), kokoro
Vocal isolation	Separate voice from background audio	Demucs
Subtitle generation	Word-level highlighted captions	Whisper-Timestamped
Video download	Download from YouTube and other platforms	Built-in

Web Interface Modules

Voice-Pro runs as a local web app (Gradio) with four main modules:

Module	Purpose
Dubbing Studio	All-in-one: download → recognize → translate → dub
Whisper Caption	Subtitle-focused recognition with word-level highlighting
Translate	Real-time speech-to-text translation
Speech Generation	Podcast creation, multilingual audio generation

Installation

# Windows
configure.bat    # Install dependencies
start.bat        # Launch web interface

# Mac / Linux
./configure.sh
./start.sh

Requirements

Requirement	Minimum
OS	Windows 10/11, Linux, macOS
GPU	NVIDIA with CUDA 12.4 (recommended)
VRAM	4GB+
Storage	20GB+
Python	3.10.15

Real-World Use Cases

Who	How They Use It
Video creators	Localize content for international audiences without hiring voice actors
Educators	Translate lecture recordings into multiple languages with natural voice
Podcasters	Generate multilingual versions of episodes
Researchers	Transcribe and translate interview recordings
Content teams	Rapid dubbing for marketing videos across markets

Voice Cloning: Power and Responsibility

The zero-shot voice cloning (clone from a short audio sample) is the most powerful — and most ethically sensitive — feature. It can:

Preserve a speaker’s voice across translated content
Create consistent narration in any language
Potentially be misused for impersonation

For educators: this is an excellent case study for AI ethics discussions — the same technology that enables accessibility (translating lectures) also enables deepfakes.

How LearnAI Team Could Use This

Create a localization workflow for course videos, tutorials, and short-form learning clips.
Use the tool as a hands-on example of local-first AI media pipelines.
Build an ethics lesson around consent, disclosure, and misuse risks in voice cloning.

See Also: MOSS-TTS-Nano — Tiny TTS That Runs on CPU

MOSS-TTS-Nano (released April 2026) is a 0.1B-parameter open-source TTS model from MOSI.AI and the OpenMOSS team. Unlike Voice-Pro which bundles a full pipeline, MOSS-TTS-Nano is a standalone text-to-speech engine designed for lightweight deployment.

Feature	MOSS-TTS-Nano	Edge-TTS (in Voice-Pro)
Parameters	0.1B	Cloud-based
Runs on	CPU only (4 cores)	Requires internet
Languages	Chinese, English, Japanese, Korean, Arabic + more	100+ languages
Audio quality	48kHz stereo	24kHz mono
Voice cloning	Yes (streaming)	No
Privacy	Fully local	Cloud (Microsoft)
Cost	Free	Free

Why it matters: MOSS-TTS-Nano proves you can get high-quality multilingual TTS at 48kHz stereo from a model small enough to run on a laptop CPU — no GPU, no cloud, no API key. For anyone building local-first voice pipelines (audiobooks, narration, accessibility), this is a compelling alternative to Edge-TTS when privacy matters or internet isn’t available.

# Quick start
pip install moss-tts-nano
python -m moss_tts_nano.infer --text "Hello world" --output hello.wav

*Source: GitHub

Demo*