数据集:
tobiolatunji/afrispeech-200
任务:
自动语音识别语言:
en计算机处理:
monolingual大小:
10K<n<100K批注创建人:
expert-generated源数据集:
original许可:
cc-by-nc-sa-4.0AFRISPEECH-200是一个用于临床和一般领域的200小时泛非洲语音语料库,用于具有非洲口音的英语自动语音识别(ASR);该数据集包含来自13个国家的120种非洲口音和2463个独特的非洲说话者。我们的目标是增加对泛非洲英语ASR研究的认识,并推动其中临床领域的发展。
数据集库允许您使用纯Python加载和预处理数据集,且可以实现规模化操作。通过使用`load_dataset`函数,您可以一次将数据集下载并准备好到本地驱动器上。
from datasets import load_dataset afrispeech = load_dataset("tobiolatunji/afrispeech-200", "all")
整个数据集大小约为120GB,根据网络速度/带宽,下载时间可能需要约2小时。如果您的磁盘空间或带宽受限,可以使用下面描述的流式处理模式来处理数据的较小子集。
此外,您可以通过向`load_dataset`函数传递配置来只下载与特定口音相关的数据子集。下面提供的示例是`isizulu`。
例如,要下载isizulu配置,只需指定相应的口音配置名称。支持的口音列表在下面的`口音列表`部分提供:
from datasets import load_dataset afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train")
使用数据集库,您还可以通过在`load_dataset`函数调用中添加`streaming=True`参数来实时流式处理数据集。以流式处理模式加载数据集时,会一次加载数据集的个别样本,而不是将整个数据集下载到磁盘上。
from datasets import load_dataset afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train", streaming=True) print(next(iter(afrispeech))) print(list(afrispeech.take(5)))
from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train") batch_sampler = BatchSampler(RandomSampler(afrispeech), batch_size=32, drop_last=False) dataloader = DataLoader(afrispeech, batch_sampler=batch_sampler)
from datasets import load_dataset from torch.utils.data import DataLoader afrispeech = load_dataset("tobiolatunji/afrispeech-200", "isizulu", split="train", streaming=True) dataloader = DataLoader(afrispeech, batch_size=32)
请注意,直到进行中的2023年5月19日之前,验证集中的转录文本将是隐藏的,测试集将不会发布。
要逐步了解如何使用transformers库,在afrispeech-200数据集上微调wav2vec2模型,请参阅此Colab笔记本。
英语(带口音)
典型的数据点由音频文件的路径(称为`path`)和其转录文本(称为`transcript`)组成。还提供了有关说话者的其他信息。
{ 'speaker_id': 'b545a4ca235a7b72688a1c0b3eb6bde6', 'path': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397.wav', 'audio_id': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397', 'audio': { 'path': 'aad9bd69-7ca0-4db1-b650-1eeea17a0153/5dcb6ee086e392376cd3b7131a250397.wav', 'array': array([0.00018311, 0.00061035, 0.00012207, ..., 0.00192261, 0.00195312, 0.00216675]), 'sampling_rate': 44100}, 'transcript': 'His mother is in her 50 s and has hypertension .', 'age_group': '26-40', 'gender': 'Male', 'accent': 'yoruba', 'domain': 'clinical', 'country': 'US', 'duration': 3.241995464852608 }
speaker_id:录音的说话者(声音)的ID
path:音频文件的路径
audio:包含下载的音频文件的路径、解码后的音频数组和采样率的字典。请注意,访问音频列时,`dataset[0]["audio"]`会自动解码并重新采样到`dataset.features["audio"].sampling_rate`。解码和重新采样大量音频文件可能需要很长时间。因此,在访问`"audio"`列之前,首先查询样本索引,即始终应优先使用`dataset[0]["audio"]`而不是`dataset["audio"][0]`。
transcript:用户被要求朗读的句子
语音素材已分为训练、开发和测试部分。
语音是在安静的环境中使用高质量的麦克风录制的,要求说话者一次读一句话。
Train | Dev | Test | |
---|---|---|---|
# Speakers | 1466 | 247 | 750 |
# Seconds | 624228.83 | 31447.09 | 67559.10 |
# Hours | 173.4 | 8.74 | 18.77 |
# Accents | 71 | 45 | 108 |
Avg secs/speaker | 425.81 | 127.32 | 90.08 |
Avg num clips/speaker | 39.56 | 13.08 | 8.46 |
Avg num speakers/accent | 20.65 | 5.49 | 6.94 |
Avg secs/accent | 8791.96 | 698.82 | 625.55 |
# clips general domain | 21682 | 1407 | 2723 |
# clips clinical domain | 36318 | 1824 | 3623 |
非洲的医生与患者的比例非常低。在非常繁忙的诊所,医生可能每天看30多名患者-与发达国家相比,患者负担很重-但这些过劳医生缺乏诊所自动语音识别(ASR)这样的工具。然而,临床ASR已经非常成熟,甚至在发达国家普遍存在,商用临床ASR系统的临床医生报告的性能通常令人满意。此外,一般领域ASR的最近性能接近人类准确率。但是,仍存在一些差距。几篇论文已经强调了语音转文本算法的种族偏见,并且在使用少数口音时的性能明显落后。据我们所知,没有公开可用的研究或基准对非洲口音的临床ASR进行评估,对于大多数非洲口音来说,语音数据并不存在。我们发布了AfriSpeech,200小时的泛非洲语音,来自13个国家的120种土著口音的67577个片段,用于临床和一般领域的ASR,并提供基准测试集以及在AfriSpeech基准测试上具有SOTA性能的公开可用的预训练模型。
Country | Clips | Speakers | Duration (seconds) | Duration (hrs) |
---|---|---|---|---|
NG | 45875 | 1979 | 512646.88 | 142.40 |
KE | 8304 | 137 | 75195.43 | 20.89 |
ZA | 7870 | 223 | 81688.11 | 22.69 |
GH | 2018 | 37 | 18581.13 | 5.16 |
BW | 1391 | 38 | 14249.01 | 3.96 |
UG | 1092 | 26 | 10420.42 | 2.89 |
RW | 469 | 9 | 5300.99 | 1.47 |
US | 219 | 5 | 1900.98 | 0.53 |
TR | 66 | 1 | 664.01 | 0.18 |
ZW | 63 | 3 | 635.11 | 0.18 |
MW | 60 | 1 | 554.61 | 0.15 |
TZ | 51 | 2 | 645.51 | 0.18 |
LS | 7 | 1 | 78.40 | 0.02 |
Accent | Clips | Speakers | Duration (s) | Country | Splits |
---|---|---|---|---|---|
yoruba | 15407 | 683 | 161587.55 | US,NG | train,test,dev |
igbo | 8677 | 374 | 93035.79 | US,NG,ZA | train,test,dev |
swahili | 6320 | 119 | 55932.82 | KE,TZ,ZA,UG | train,test,dev |
hausa | 5765 | 248 | 70878.67 | NG | train,test,dev |
ijaw | 2499 | 105 | 33178.9 | NG | train,test,dev |
afrikaans | 2048 | 33 | 20586.49 | ZA | train,test,dev |
idoma | 1877 | 72 | 20463.6 | NG | train,test,dev |
zulu | 1794 | 52 | 18216.97 | ZA,TR,LS | dev,train,test |
setswana | 1588 | 39 | 16553.22 | BW,ZA | dev,test,train |
twi | 1566 | 22 | 14340.12 | GH | test,train,dev |
isizulu | 1048 | 48 | 10376.09 | ZA | test,train,dev |
igala | 919 | 31 | 9854.72 | NG | train,test |
izon | 838 | 47 | 9602.53 | NG | train,dev,test |
kiswahili | 827 | 6 | 8988.26 | KE | train,test |
ebira | 757 | 42 | 7752.94 | NG | train,test,dev |
luganda | 722 | 22 | 6768.19 | UG,BW,KE | test,dev,train |
urhobo | 646 | 32 | 6685.12 | NG | train,dev,test |
nembe | 578 | 16 | 6644.72 | NG | train,test,dev |
ibibio | 570 | 39 | 6489.29 | NG | train,test,dev |
pidgin | 514 | 20 | 5871.57 | NG | test,train,dev |
luhya | 508 | 4 | 4497.02 | KE | train,test |
kinyarwanda | 469 | 9 | 5300.99 | RW | train,test,dev |
xhosa | 392 | 12 | 4604.84 | ZA | train,dev,test |
tswana | 387 | 18 | 4148.58 | ZA,BW | train,test,dev |
esan | 380 | 13 | 4162.63 | NG | train,test,dev |
alago | 363 | 8 | 3902.09 | NG | train,test |
tshivenda | 353 | 5 | 3264.77 | ZA | test,train |
fulani | 312 | 18 | 5084.32 | NG | test,train |
isoko | 298 | 16 | 4236.88 | NG | train,test,dev |
akan (fante) | 295 | 9 | 2848.54 | GH | train,dev,test |
ikwere | 293 | 14 | 3480.43 | NG | test,train,dev |
sepedi | 275 | 10 | 2751.68 | ZA | dev,test,train |
efik | 269 | 11 | 2559.32 | NG | test,train,dev |
edo | 237 | 12 | 1842.32 | NG | train,test,dev |
luo | 234 | 4 | 2052.25 | UG,KE | test,train,dev |
kikuyu | 229 | 4 | 1949.62 | KE | train,test,dev |
bekwarra | 218 | 3 | 2000.46 | NG | train,test |
isixhosa | 210 | 9 | 2100.28 | ZA | train,dev,test |
hausa/fulani | 202 | 3 | 2213.53 | NG | test,train |
epie | 202 | 6 | 2320.21 | NG | train,test |
isindebele | 198 | 2 | 1759.49 | ZA | train,test |
venda and xitsonga | 188 | 2 | 2603.75 | ZA | train,test |
sotho | 182 | 4 | 2082.21 | ZA | dev,test,train |
akan | 157 | 6 | 1392.47 | GH | test,train |
nupe | 156 | 9 | 1608.24 | NG | dev,train,test |
anaang | 153 | 8 | 1532.56 | NG | test,dev |
english | 151 | 11 | 2445.98 | NG | dev,test |
afemai | 142 | 2 | 1877.04 | NG | train,test |
shona | 138 | 8 | 1419.98 | ZA,ZW | test,train,dev |
eggon | 137 | 5 | 1833.77 | NG | test |
luganda and kiswahili | 134 | 1 | 1356.93 | UG | train |
ukwuani | 133 | 7 | 1269.02 | NG | test |
sesotho | 132 | 10 | 1397.16 | ZA | train,dev,test |
benin | 124 | 4 | 1457.48 | NG | train,test |
kagoma | 123 | 1 | 1781.04 | NG | train |
nasarawa eggon | 120 | 1 | 1039.99 | NG | train |
tiv | 120 | 14 | 1084.52 | NG | train,test,dev |
south african english | 119 | 2 | 1643.82 | ZA | train,test |
borana | 112 | 1 | 1090.71 | KE | train |
swahili ,luganda ,arabic | 109 | 1 | 929.46 | UG | train |
ogoni | 109 | 4 | 1629.7 | NG | train,test |
mada | 109 | 2 | 1786.26 | NG | test |
bette | 106 | 4 | 930.16 | NG | train,test |
berom | 105 | 4 | 1272.99 | NG | dev,test |
bini | 104 | 4 | 1499.75 | NG | test |
ngas | 102 | 3 | 1234.16 | NG | train,test |
etsako | 101 | 4 | 1074.53 | NG | train,test |
okrika | 100 | 3 | 1887.47 | NG | train,test |
venda | 99 | 2 | 938.14 | ZA | train,test |
siswati | 96 | 5 | 1367.45 | ZA | dev,train,test |
damara | 92 | 1 | 674.43 | NG | train |
yoruba, hausa | 89 | 5 | 928.98 | NG | test |
southern sotho | 89 | 1 | 889.73 | ZA | train |
kanuri | 86 | 7 | 1936.78 | NG | test,dev |
itsekiri | 82 | 3 | 778.47 | NG | test,dev |
ekpeye | 80 | 2 | 922.88 | NG | test |
mwaghavul | 78 | 2 | 738.02 | NG | test |
bajju | 72 | 2 | 758.16 | NG | test |
luo, swahili | 71 | 1 | 616.57 | KE | train |
dholuo | 70 | 1 | 669.07 | KE | train |
ekene | 68 | 1 | 839.31 | NG | test |
jaba | 65 | 2 | 540.66 | NG | test |
ika | 65 | 4 | 576.56 | NG | test,dev |
angas | 65 | 1 | 589.99 | NG | test |
ateso | 63 | 1 | 624.28 | UG | train |
brass | 62 | 2 | 900.04 | NG | test |
ikulu | 61 | 1 | 313.2 | NG | test |
eleme | 60 | 2 | 1207.92 | NG | test |
chichewa | 60 | 1 | 554.61 | MW | train |
oklo | 58 | 1 | 871.37 | NG | test |
meru | 58 | 2 | 865.07 | KE | train,test |
agatu | 55 | 1 | 369.11 | NG | test |
okirika | 54 | 1 | 792.65 | NG | test |
igarra | 54 | 1 | 562.12 | NG | test |
ijaw(nembe) | 54 | 2 | 537.56 | NG | test |
khana | 51 | 2 | 497.42 | NG | test |
ogbia | 51 | 4 | 461.15 | NG | test,dev |
gbagyi | 51 | 4 | 693.43 | NG | test |
portuguese | 50 | 1 | 525.02 | ZA | train |
delta | 49 | 2 | 425.76 | NG | test |
bassa | 49 | 1 | 646.13 | NG | test |
etche | 49 | 1 | 637.48 | NG | test |
kubi | 46 | 1 | 495.21 | NG | test |
jukun | 44 | 2 | 362.12 | NG | test |
igbo and yoruba | 43 | 2 | 466.98 | NG | test |
urobo | 43 | 3 | 573.14 | NG | test |
kalabari | 42 | 5 | 305.49 | NG | test |
ibani | 42 | 1 | 322.34 | NG | test |
obolo | 37 | 1 | 204.79 | NG | test |
idah | 34 | 1 | 533.5 | NG | test |
bassa-nge/nupe | 31 | 3 | 267.42 | NG | test,dev |
yala mbembe | 29 | 1 | 237.27 | NG | test |
eket | 28 | 1 | 238.85 | NG | test |
afo | 26 | 1 | 171.15 | NG | test |
ebiobo | 25 | 1 | 226.27 | NG | test |
nyandang | 25 | 1 | 230.41 | NG | test |
ishan | 23 | 1 | 194.12 | NG | test |
bagi | 20 | 1 | 284.54 | NG | test |
estako | 20 | 1 | 480.78 | NG | test |
gerawa | 13 | 1 | 342.15 | NG | test |
[需要更多信息]
来源语言制作者是谁?[需要更多信息]
[需要更多信息]
注释者是谁?[需要更多信息]
该数据集包括在线捐赠自己声音的人。您同意不尝试确定数据集中讲话者的身份。
[需要更多信息]
[需要更多信息]
仅提供用于研究目的的数据集。请查看数据集许可证以获取更多信息。
此数据集最初由Intron准备,并经CLAIR实验室改进后发布。
公共领域,知识共享署名非商业共享相同方式v4.0( CC BY-NC-SA 4.0 )
[需要更多信息]
感谢 @tobiolatunji 添加此数据集。