数据集:

gigant/romanian_speech_synthesis_0_8_1

英文

数据集概述

罗马尼亚语音合成(RSS)语料库是在爱丁堡大学的一个半消声室(消声墙和天花板;地板部分消声)中记录的。我们使用了三个高质量的专业麦克风:Neumann U89i(大膜电容麦克风)、Sennheiser MKH 800(小膜电容麦克风,频宽非常宽)和DPA 4035(头戴式电容麦克风)。尽管目前的发布版本仅包括通过Sennheiser MKH 800录制的语音数据,但我们将来可能会发布通过其他麦克风录制的语音数据。所有录音都是以96 kHz的采样频率和24位的采样精度进行的,然后降采样到48 kHz的采样频率。我们使用了ProTools HD硬件和软件进行录音、降采样和比特率转换。我们在一个月的时间里进行了8个会话,每个会话录制了约500个句子。在每个会话开始时,演讲者会听一段之前的录音样本,以达到相似的声音质量和语调。

语言

罗马尼亚语

数据集结构

数据示例

典型的数据点包括音频文件的路径,即音频和句子。

数据字段

  • 音频:包含下载的音频文件路径、解码后的音频数组和采样率的字典。注意,访问音频列时: dataset[0]["audio"] 会自动解码和重新采样为 dataset.features["audio"].sampling_rate 。解码和重新采样大量音频文件可能需要相当长的时间。因此,在"audio"列之前先查询样本索引非常重要,即 dataset[0]["audio"] 应始终优先于 dataset["audio"][0] 。

  • 句子:用户被提示要说的句子

数据划分

语音材料已被分为训练集和测试集。训练集包含3180个音频片段和相关的句子。测试集包含536个音频片段和相关的句子。

引用信息

@article{Stan2011442,
  author = {Adriana Stan and Junichi Yamagishi and Simon King and
                   Matthew Aylett},
  title = {The {R}omanian speech synthesis ({RSS}) corpus:
                   Building a high quality {HMM}-based speech synthesis
                   system using a high sampling rate},
  journal = {Speech Communication},
  volume = {53},
  number = {3},
  pages = {442--450},
  note = {},
  abstract = {This paper first introduces a newly-recorded high
                   quality Romanian speech corpus designed for speech
                   synthesis, called ''RSS'', along with Romanian
                   front-end text processing modules and HMM-based
                   synthetic voices built from the corpus. All of these
                   are now freely available for academic use in order to
                   promote Romanian speech technology research. The RSS
                   corpus comprises 3500 training sentences and 500 test
                   sentences uttered by a female speaker and was recorded
                   using multiple microphones at 96 kHz sampling
                   frequency in a hemianechoic chamber. The details of the
                   new Romanian text processor we have developed are also
                   given. Using the database, we then revisit some basic
                   configuration choices of speech synthesis, such as
                   waveform sampling frequency and auditory frequency
                   warping scale, with the aim of improving speaker
                   similarity, which is an acknowledged weakness of
                   current HMM-based speech synthesisers. As we
                   demonstrate using perceptual tests, these configuration
                   choices can make substantial differences to the quality
                   of the synthetic speech. Contrary to common practice in
                   automatic speech recognition, higher waveform sampling
                   frequencies can offer enhanced feature extraction and
                   improved speaker similarity for HMM-based speech
                   synthesis.},
  doi = {10.1016/j.specom.2010.12.002},
  issn = {0167-6393},
  keywords = {Speech synthesis, HTS, Romanian, HMMs, Sampling
                   frequency, Auditory scale},
  url = {http://www.sciencedirect.com/science/article/pii/S0167639310002074},
  year = 2011
}

贡献者

@gigant 添加了此数据集。