AI has drastically altered the way people go about their daily lives. Voice recognition has simplified activities like taking notes, typing documents, and more. Its speed and efficiency are what makes it so popular. With the progress made in AI, many voice recognition applications have been created. Google, Alexa, and Siri are a few examples of virtual assistants that use voice recognition software to communicate with users. Additionally, text–to–speech, speech–to–text, and text–to–text have been widely adopted in various applications.
Creating human-level speech is essential for Artificial Intelligence (AI), especially when it comes to chatbots. Recent advances in deep learning have drastically improved the quality of synthesized speech produced by neural-based Text-to-Speech (TTS) systems. However, most of the data used for training these systems have been limited to recordings from controlled environments, such as reading aloud or performing a script. Human beings on the other hand, can speak spontaneously with varied prosodies that convey paralinguistic information, such as subtle emotions. This ability is acquired from being exposed to a long duration of real-world speech.
Here are 3 new ways that will dramatically improve text to speech.
Multi-codebook vector quantized TTS
Researchers at Carnegie Mellon University have developed an artificial intelligence (AI) system that can be trained to generate text–to–speech with a wide range of voices. To do this, they analyzed actual speech taken from YouTube videos and podcasts. By using existing recordings, they could simplify the environment and focus on text–to–speech. They hope that this will replicate the success of large language models like GPT–3.
Using a limited amount of resources, these systems can be tailored to particular speaker qualities or recording conditions. This paper examines the new challenges that arise when training TTS systems on actual speech, such as increased prosodic variance and background noise that are not found in speech recorded in controlled environments. The authors show that the use of mel–spectrogram–based autoregressive algorithms cannot reproduce accurate text–audio alignment when applied to real–world speech, resulting in distorted speech. The failure of inference alignment is attributed to the errors that accumulate in the decoding process, as they also demonstrate that precise alignments can be learned during the training phase.
Researchers found that replacing the mel–spectrogram with a learned discrete codebook could solve the problem. This is due to the fact that discrete representations are more resistant to input noise. However, their research showed that a single codebook still produced distorted speech even when the codebook was increased in size. It is believed that there are too many prosody patterns in spontaneous speech for a single codebook to capture. Therefore, multiple codebooks were used to create architectures for multi–code sampling and monotonic alignment. A pure silence audio prompt was used during the inference process to ensure that the model produced clear speech despite being trained on a noisy corpus.
In this paper, the authors present their new technology MQTTS (multi–codebook vector quantized TTS). To understand its potential for real–world voice synthesis, they compare mel–spectrogram–based systems in Section 5 and carry out an ablation analysis. They then compare MQTTS to non–autoregressive models, finding that it produces better intelligibility and speaker transferability. Additionally, MQTTS has greater prosody variety and naturalness, though non–autoregressive models have faster computing speed and higher resilience. Furthermore, MQTTS may achieve a lower signal–to–noise ratio with a clean, quiet cue (SNR). The authors make their source code available on GitHub for public use.
Hugging Face Transformers has recently gained its first text–to–speech model, SpeechT5.
The highly successful T5 (Text-To-Text Transfer Transformer) has been the inspiration for the SpeechT5 framework, a unified-model which uses encoder-decoder pre-training for self-supervised learning of speech/text representation. The SpeechT5 model has now been added to the Hugging Face Transformers toolkit, an open-source library with easy access to the latest machine learning models.
SpeechT5 utilizes a conventional encoder–decoder design to develop combined contextual representations for both voice and text. It features three distinct speech models: text–to–speech (for creating audio from nothing), speech–to–text (for automated speech recognition), and speech–to–speech (for carrying out speech augmentation or changing between voices).
The core concept of SpeechT5 is to prepare a single model by combining text–to–speech, speech–to–text, text–to–text, and speech–to–speech data. This encourages the model to learn from both speech and written text. The base of SpeechT5 is a standard Transformer encoder–decoder structure, which can perform sequential transformations with hidden representations, like any other Transformer. Pre–nets and post–nets are added to make the same Transformer suitable for text and audio. The pre–nets convert the input of text or speech into the Transformer‘s hidden representations, while the post–nets convert the Transformer‘s outputs into text or speech. To train the model for multiple languages, the team supplies it with text/speech data as input and produces the corresponding output as text/speech.
SpeechT5 stands out from other models as it allows for multiple activities to be carried out with one architecture, simply by adapting the pre–nets and post–nets. The model has been fine–tuned to tackle a variety of tasks, and studies have shown that it outshines all baseline models in a number of spoken language processing tasks. To improve the model even further, scientists plan to pre–train SpeechT5 with a larger model and more unlabeled data. Additionally, they are exploring ways to use the framework to handle tasks involving spoken language processing in multiple languages.
VALL–E by Microsoft
Microsoft has developed a revolutionary language model for text–to–speech synthesis (TTS) known as VALL–E. The AI utilizes audio codec codes as intermediate representations and is capable of replicating someone‘s voice with only three seconds of audio input. VALL–E is a neural codec language model which tokenizes speech, and then uses algorithms to generate waveforms which sound like the speaker, even replicating their unique timbre and emotional tone. As stated in the research paper, VALL–E can produce high–quality personalized speech with just a three–second sample of the speaker‘s voice, without the need for additional structural engineering, pre–made acoustic features, or fine–tuning. It also supports contextual learning and prompt–based zero–shot TTS approaches. Demonstration audio clips are provided in the research paper, with one sample being a three–second prompt that VALL–E must replicate. To compare, another sample is a previously–recorded phrase by the same speaker (the “ground truth“), while the “baseline“ sample is a typical text–to–speech synthesis example, and the “VALL–E“ sample is the output of the VALL–E model.