数据集:

mozilla-foundation/common_voice_11_0

任务:

自动语音识别

计算机处理:

multilingual

语言创建人:

crowdsourced

批注创建人:

crowdsourced

源数据集:

extended|common_voice

预印本库:

arxiv:1912.06670

许可:

cc0-1.0

数据集介绍文件清单

英文

Common Voice Corpus 11.0 数据集卡

数据集摘要

Common Voice数据集由独特的MP3和对应的文本文件组成。数据集中的24210个记录小时数中的许多都包含人口统计数据，如年龄、性别和口音，这有助于提高语音识别引擎的准确性。

数据集目前包含100种语言的16413个经过验证的小时数，但会持续添加更多的语音和语言。请查看 Languages 页请求一种语言或开始贡献。

ҳ 任务和排行榜

通过 Common Voice 数据集训练的模型的结果可以通过 ? Autoevaluate Leaderboard 获得。

语言

Abkhaz, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Kurmanji Kurdish, Kyrgyz, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Odia, Persian, Polish, Portuguese, Punjabi, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh

如何使用

datasets 库允许您在纯Python中以大规模加载和处理数据集。您可以使用 load_dataset 函数一次性下载并准备好数据集到本地驱动器。

例如，要下载印地语配置文件，只需指定相应的语言配置名称（即“hi”代表印地语）：

from datasets import load_dataset

cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train")

使用 datasets 库，您还可以通过在 load_dataset 函数调用中添加 streaming=True 参数来实时流式传输数据集。以流式模式加载数据集会一次加载一个样本，而不是将整个数据集下载到磁盘中。

from datasets import load_dataset

cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train", streaming=True)

print(next(iter(cv_11)))

Bonus：通过使用自己的数据集（本地/流媒体），可以直接创建 PyTorch dataloader 。

本地

from datasets import load_dataset
from torch.utils.data.sampler import BatchSampler, RandomSampler

cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train")
batch_sampler = BatchSampler(RandomSampler(cv_11), batch_size=32, drop_last=False)
dataloader = DataLoader(cv_11, batch_sampler=batch_sampler)

流媒体

from datasets import load_dataset
from torch.utils.data import DataLoader

cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train")
dataloader = DataLoader(cv_11, batch_size=32)

要了解更多有关加载和准备音频数据集的信息，请转到 hf.co/blog/audio-datasets 。

示例脚本

在 Common Voice 11 上使用 transformers训练自己的CTC或Seq2Seq自动语音识别模型 - here 。

数据集结构

数据实例

典型的数据点包括音频文件的路径和句子。其他字段包括口音、年龄、客户端ID、赞成票数、反对票数、性别、区域设置和段落。

{
  'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5', 
  'path': 'et/clips/common_voice_et_18318995.mp3', 
  'audio': {
    'path': 'et/clips/common_voice_et_18318995.mp3', 
    'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32), 
    'sampling_rate': 48000
  }, 
  'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.', 
  'up_votes': 2, 
  'down_votes': 0, 
  'age': 'twenties', 
  'gender': 'male', 
  'accent': '', 
  'locale': 'et', 
  'segment': ''
}

数据字段

client_id（字符串）：录制人的客户端ID

path（字符串）：音频文件的路径

audio（字典）：包含下载的音频文件路径、解码的音频数组和采样率的字典。注意，在访问音频列 dataset[0]["audio"] 时，音频文件会自动解码和重新采样为dataset.features["audio"].sampling_rate。解码和重新采样大量音频文件可能需要很长时间。因此，首先查询样本索引然后再使用 "audio" 列，即 dataset[0]["audio"] 应始终优先于 dataset["audio"][0]。

sentence（字符串）：用户被提示要说的句子

up_votes（int64）：音频文件收到的赞成票数

down_votes（int64）：音频文件收到的反对票数

age（字符串）：说话者的年龄（例如青少年、二十岁、五十岁）

gender（字符串）：说话者的性别

accent（字符串）：说话者的口音

locale（字符串）：说话者的语言环境

segment（字符串）：通常为空字段

数据拆分

语音材料已分为开发、训练、测试、已验证、已废弃、已报告和其他部分。

验证数据是经过评审员验证的高质量数据。

作废数据是经过评审员认定为低质量的数据。

报告的数据是发生问题的数据。

其他数据是尚未审核的数据。

开发、测试、训练是经过审核、被认定为高质量数据并分为开发、测试和训练。

Hugging Face 推荐的数据预处理

下面是 Hugging Face 团队建议的数据预处理步骤，附有示例代码片段以展示如何实践。

许多示例中都有尾随的引号，例如“the cat sat on the mat.“。这些尾随的引号不会改变句子的实际含义，而仅仅凭借音频数据是几乎不可能推断出一个句子是引用还是非引用。在这些情况下，建议去掉引号，只保留"the cat sat on the mat"。

此外，大多数训练句子都以标点符号（. 或? 或! ）结尾，只有很少部分句子没有。在开发集中，几乎所有的句子都以标点符号结尾。因此，建议在训练集中的少数不以标点符号结尾的句子末尾添加一个句号（.）。

from datasets import load_dataset

ds = load_dataset("mozilla-foundation/common_voice_11_0", "en", use_auth_token=True)

def prepare_dataset(batch):
  """Function to preprocess the dataset with the .map method"""
  transcription = batch["sentence"]
  
  if transcription.startswith('"') and transcription.endswith('"'):
    # we can remove trailing quotation marks as they do not affect the transcription
    transcription = transcription[1:-1]
  
  if transcription[-1] not in [".", "?", "!"]:
    # append a full-stop to sentences that do not end in punctuation
    transcription = transcription + "."
  
  batch["sentence"] = transcription
  
  return batch

ds = ds.map(prepare_dataset, desc="preprocess dataset")

数据集创建

策划原因

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

谁是源语言制作者？

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是标注者？

[需要更多信息]

个人和敏感信息

数据集包含在线捐赠声音的人们。您同意不试图确定 Common Voice 数据集中的说话者身份。

使用数据的注意事项

数据的社会影响

数据集包含在线捐赠声音的人们。您同意不试图确定 Common Voice 数据集中的说话者身份。

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

公有领域， CC-0

引用信息

@inproceedings{commonvoice:2020,
  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
  title = {Common Voice: A Massively-Multilingual Speech Corpus},
  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
  pages = {4211--4215},
  year = 2020
}

作者:

mozilla-foundation

数据集大小:

286.13 GB