数据集:

MLCommons/ml_spoken_words

计算机处理:

multilingual

大小:

10M<n<100M

语言创建人:

other

批注创建人:

machine-generated

许可:

cc-by-4.0
中文

Dataset Card for Multilingual Spoken Words

Dataset Summary

Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. This dataset is generated by applying forced alignment on crowd-sourced sentence-level audio to produce per-word timing estimates for extraction. All alignments are included in the dataset.

Data is provided in two formats: wav (16KHz) and opus (48KHz). Default configurations look like "{lang}_{format}" , so to load, for example, Tatar in wav format do:

ds = load_dataset("MLCommons/ml_spoken_words", "tt_wav")

To download multiple languages in a single dataset pass list of languages to languages argument:

ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"])

To download a specific format pass it to the format argument (default format is wav ):

ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"], format="opus")

Note that each time you provide different sets of languages, examples are generated from scratch even if you already provided one or several of them before because custom configurations are created each time (the data is not redownloaded though).

Supported Tasks and Leaderboards

Keyword spotting, Spoken term search

Languages

The dataset is multilingual. To specify several languages to download pass a list of them to the languages argument:

ds = load_dataset("MLCommons/ml_spoken_words", languages=["ar", "tt", "br"])

The dataset contains data for the following languages:

Low-resourced (<10 hours):

  • Arabic (0.1G, 7.6h)
  • Assamese (0.9M, 0.1h)
  • Breton (69M, 5.6h)
  • Chuvash (28M, 2.1h)
  • Chinese (zh-CN) (42M, 3.1h)
  • Dhivehi (0.7M, 0.04h)
  • Frisian (0.1G, 9.6h)
  • Georgian (20M, 1.4h)
  • Guarani (0.7M, 1.3h)
  • Greek (84M, 6.7h)
  • Hakha Chin (26M, 0.1h)
  • Hausa (90M, 1.0h)
  • Interlingua (58M, 4.0h)
  • Irish (38M, 3.2h)
  • Latvian (51M, 4.2h)
  • Lithuanian (21M, 0.46h)
  • Maltese (88M, 7.3h)
  • Oriya (0.7M, 0.1h)
  • Romanian (59M, 4.5h)
  • Sakha (42M, 3.3h)
  • Slovenian (43M, 3.0h)
  • Slovak (31M, 1.9h)
  • Sursilvan (61M, 4.8h)
  • Tamil (8.8M, 0.6h)
  • Vallader (14M, 1.2h)
  • Vietnamese (1.2M, 0.1h)

Medium-resourced (>10 & <100 hours):

  • Czech (0.3G, 24h)
  • Dutch (0.8G, 70h)
  • Estonian (0.2G, 19h)
  • Esperanto (1.3G, 77h)
  • Indonesian (0.1G, 11h)
  • Kyrgyz (0.1G, 12h)
  • Mongolian (0.1G, 12h)
  • Portuguese (0.7G, 58h)
  • Swedish (0.1G, 12h)
  • Tatar (4G, 30h)
  • Turkish (1.3G, 29h)
  • Ukrainian (0.2G, 18h)

Hig-resourced (>100 hours):

  • Basque (1.7G, 118h)
  • Catalan (8.7G, 615h)
  • English (26G, 1957h)
  • French (9.3G, 754h)
  • German (14G, 1083h)
  • Italian (2.2G, 155h)
  • Kinyarwanda (6.1G, 422h)
  • Persian (4.5G, 327h)
  • Polish (1.8G, 130h)
  • Russian (2.1G, 137h)
  • Spanish (4.9G, 349h)
  • Welsh (4.5G, 108h)

Dataset Structure

Data Instances

{'file': 'абзар_common_voice_tt_17737010.opus',
 'is_valid': True,
 'language': 0,
 'speaker_id': '687025afd5ce033048472754c8d2cb1cf8a617e469866bbdb3746e2bb2194202094a715906f91feb1c546893a5d835347f4869e7def2e360ace6616fb4340e38',
 'gender': 0,
 'keyword': 'абзар',
 'audio': {'path': 'абзар_common_voice_tt_17737010.opus',
  'array': array([2.03458695e-34, 2.03458695e-34, 2.03458695e-34, ...,
         2.03458695e-34, 2.03458695e-34, 2.03458695e-34]),
  'sampling_rate': 48000}}

Data Fields

  • file: strinrelative audio path inside the archive
  • is_valid: if a sample is valid
  • language: language of an instance. Makes sense only when providing multiple languages to the dataset loader (for example, load_dataset("ml_spoken_words", languages=["ar", "tt"]) )
  • speaker_id: unique id of a speaker. Can be "NA" if an instance is invalid
  • gender: speaker gender. Can be one of ["MALE", "FEMALE", "OTHER", "NAN"]
  • keyword: word spoken in a current sample
  • audio: a dictionary containing the relative path to the audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus, it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0]

Data Splits

The data for each language is splitted into train / validation / test parts.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

The data comes form Common Voice dataset.

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

he dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers.

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

The dataset is licensed under CC-BY 4.0 and can be used for academic research and commercial applications in keyword spotting and spoken term search.

Citation Information

@inproceedings{mazumder2021multilingual,
  title={Multilingual Spoken Words Corpus},
  author={Mazumder, Mark and Chitlangia, Sharad and Banbury, Colby and Kang, Yiping and Ciro, Juan Manuel and Achorn, Keith and Galvez, Daniel and Sabini, Mark and Mattson, Peter and Kanter, David and others},
  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
  year={2021}
}

Contributions

Thanks to @polinaeterna for adding this dataset.