Emilia: A Large-Scale Multilingual Speech Dataset for Enhanced Speech Generation

Top post
A Milestone for Speech Generation: Emilia, an Extensive Multilingual Dataset
The quality of speech generation models depends significantly on the training data. Existing models, trained primarily with audiobooks, however, reach their limits when it comes to realistically representing the spontaneity and variability of human speech. While audiobooks offer clear and easily understandable speech, they mostly represent a formal reading style and inadequately reflect the diversity of everyday conversations.
To close this gap, Emilia was developed, an extensive, multilingual dataset for speech generation. Emilia is based on so-called "in-the-wild" data, which captures spontaneous human speech in real-world contexts. The data was extracted and processed with Emilia-Pipe, an open-source preprocessing pipeline, to ensure high-quality training data.
The dataset comprises over 101,000 hours of speech recordings in six languages: English, Chinese, German, French, Japanese, and Korean. An extended version, Emilia-Large, even contains over 216,000 hours, making it the largest freely available dataset for speech generation.
From Audiobooks to "In-the-Wild" Data: A Paradigm Shift
The focus on "in-the-wild" data represents a paradigm shift in speech generation research. In contrast to the controlled environments of audiobooks, this data reflects the natural variability of human speech, including different speaking styles, accents, emotions, and background noises.
The use of Emilia-Pipe for data extraction and processing is crucial for the quality of the dataset. The pipeline filters irrelevant information, removes background noise, and ensures the consistency of the data.
Emilia in Comparison: Convincing Results
Extensive tests have shown that models trained with Emilia achieve significantly better results than models based on traditional audiobook datasets. The generated speech appears more spontaneous and human-like, with a greater variety of vocal timbres and speaking styles.
The results underscore the importance of large datasets for progress in speech generation and demonstrate the effectiveness of Emilia for both multilingual and cross-lingual applications.
The Future of Speech Generation: Scalability and Diversity
Emilia and Emilia-Large are important steps towards more realistic and diverse speech generation. The size and linguistic diversity of the dataset enable the development of models that can capture the nuances of human speech in different contexts and cultures.
The open-source nature of Emilia and Emilia-Pipe also promotes collaboration and exchange within the research community and accelerates progress in this field.
Bibliography: - https://arxiv.org/abs/2407.05361 - https://arxiv.org/html/2407.05361v1 - https://huggingface.co/papers/2501.15907 - https://www.researchgate.net/publication/382079879_Emilia_An_Extensive_Multilingual_and_Diverse_Speech_Dataset_for_Large-Scale_Speech_Generation - https://emilia-dataset.github.io/Emilia-Demo-Page/ - https://www.researchgate.net/publication/388100844_Emilia_An_Extensive_Multilingual_and_Diverse_Speech_Dataset_For_Large-Scale_Speech_Generation - https://www.semanticscholar.org/paper/ee4d1b70d01474493126c401662f9ff85aa7255f - https://modelscope.cn/datasets/modelscope/Emilia-Dataset - https://openxlab.org.cn/datasets/Amphion/Emilia - https://paperswithcode.com/datasets?mod=speech&lang=korean