数据集:

common_language

计算机处理:

multilingual

大小:

100K<n<1M

语言创建人:

crowdsourced

批注创建人:

crowdsourced

许可:

cc-by-4.0
中文

Dataset Card for common_language

Dataset Summary

This dataset is composed of speech recordings from languages that were carefully selected from the CommonVoice database. The total duration of audio recordings is 45.1 hours (i.e., 1 hour of material for each language). The dataset has been extracted from CommonVoice to train language-id systems.

Supported Tasks and Leaderboards

The baselines for language-id are available in the SpeechBrain toolkit (see recipes/CommonLanguage): https://github.com/speechbrain/speechbrain

Languages

List of included languages:

Arabic, Basque, Breton, Catalan, Chinese_China, Chinese_Hongkong, Chinese_Taiwan, Chuvash, Czech, Dhivehi, Dutch, English, Esperanto, Estonian, French, Frisian, Georgian, German, Greek, Hakha_Chin, Indonesian, Interlingua, Italian, Japanese, Kabyle, Kinyarwanda, Kyrgyz, Latvian, Maltese, Mongolian, Persian, Polish, Portuguese, Romanian, Romansh_Sursilvan, Russian, Sakha, Slovenian, Spanish, Swedish, Tamil, Tatar, Turkish, Ukranian, Welsh

Dataset Structure

Data Instances

A typical data point comprises the path to the audio file, and its label language . Additional fields include age , client_id , gender and sentence .

{
  'client_id': 'itln_trn_sp_175',
  'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav',
  'audio': {'path': '/path/common_voice_kpd/Italian/train/itln_trn_sp_175/common_voice_it_18279446.wav',
           'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
           'sampling_rate': 48000},
  'sentence': 'Con gli studenti è leggermente simile.',
  'age': 'not_defined',
  'gender': 'not_defined',
  'language': 22
}

Data Fields

client_id ( string ): An id for which client (voice) made the recording

path ( string ): The path to the audio file

  • audio ( dict ): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .

language ( ClassLabel ): The language of the recording (see the Languages section above)

sentence ( string ): The sentence the user was prompted to speak

age ( string ): The age of the speaker.

gender ( string ): The gender of the speaker

Data Splits

The dataset is already balanced and split into train, dev (validation) and test sets.

Name Train Dev Test
# of utterances 177552 47104 47704
# unique speakers 11189 1297 1322
Total duration, hr 30.04 7.53 7.53
Min duration, sec 0.86 0.98 0.89
Mean duration, sec 4.87 4.61 4.55
Max duration, sec 21.72 105.67 29.83
Duration per language, min ~40 ~10 ~10

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

Considerations for Using the Data

Social Impact of Dataset

The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

Discussion of Biases

More Information Needed

Other Known Limitations

The Mongolian and Ukrainian languages are spelled as "Mangolian" and "Ukranian" in this version of the dataset.

More Information Needed

Additional Information

Dataset Curators

Ganesh Sinisetty; Pavlo Ruban; Oleksandr Dymov; Mirco Ravanelli

Licensing Information

Creative Commons Attribution 4.0 International

Citation Information

@dataset{ganesh_sinisetty_2021_5036977,
  author       = {Ganesh Sinisetty and
                  Pavlo Ruban and
                  Oleksandr Dymov and
                  Mirco Ravanelli},
  title        = {CommonLanguage},
  month        = jun,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5036977},
  url          = {https://doi.org/10.5281/zenodo.5036977}
}

Contributions

Thanks to @anton-l for adding this dataset.