Zonos-v0.1 emerges as a groundbreaking open-weight text-to-speech (TTS) model, setting new standards for expressiveness and quality in speech synthesis. This model excels in generating highly naturalistic speech from text prompts, leveraging a speaker embedding or audio prefix for voice cloning with just 5 to 30 seconds of speech. It introduces advanced control over speech characteristics, including speaking rate, pitch variation, and emotional tones like happiness and anger, producing output at a high-quality 44kHz.
Built on a foundation of approximately 200,000 hours of predominantly English speech data, Zonos employs a combination of text normalization, phonemization, and DAC token prediction through a sophisticated transformer or hybrid architecture. It supports multiple languages, including English, Japanese, Chinese, French, and German, and offers a user-friendly web interface for easy speech generation.
Zonos-v0.1 is not only fast, running at a real-time factor of ~2x on an RTX 4090, but also boasts simple installation and deployment processes. With its comprehensive feature set and straightforward usage, Zonos-v0.1 represents a significant advancement in the field of text-to-speech technology, promising wide-ranging applications in voice cloning and beyond.
Read more…