模型:

voidful/wav2vec2-xlsr-multilingual-56

任务:

自动语音识别

类库:

PyTorch Safetensors Transformers

数据集:

common_voice 3Acommon_voice

语言:

multilingual

其他:

wav2vec2 audio hf-asr-leaderboard robust-speech-event speech xlsr-fine-tuning-week Eval Results

预印本库:

arxiv:1910.09700

许可:

apache-2.0

模型介绍文件清单

英文

wav2vec2-xlsr-multilingual-56

模型详情

模型描述

开发者：voidful
共享者[可选]：Hugging Face
模型类型：自动语音识别
语言(NLP)：多语言（56种语言，1个模型多语言ASR）
许可证：Apache-2.0
相关模型：
- 父模型：wav2vec
了解更多信息的资源：
- GitHub Repo
- Model Space

用途

直接使用

该模型可用于自动语音识别任务。

下游使用[可选]

需要更多信息。

超出范围的使用

模型不应用于有意创建对人类具有敌意或疏远环境的情况。

偏见、风险和限制

已经进行了大量的研究来探讨语言模型的偏见和公平性问题（参见，例如， Sheng et al. (2021) 和 Bender et al. (2021) ）。模型生成的预测可能包含对受保护类别、身份特征和敏感的社会和职业群体的令人不安和有害的刻板印象。

建议

用户（直接和下游）应意识到模型的风险、偏见和限制。需要更多信息以提供进一步的建议。

培训详情

培训数据

请查看 common_voice dataset card ，在56种语言上使用 Common Voice 对其进行微调。

培训流程

预处理

需要更多信息。

速度、大小、时间

使用此模型时，请确保语音输入采样率为16kHz。

评估

测试数据、因素和指标

测试数据

需要更多信息。

因素

指标

需要更多信息。

结果

点击展开

Common Voice Languages	Num. of data	Hour	WER	CER
ar	21744	81.5	75.29	31.23
as	394	1.1	95.37	46.05
br	4777	7.4	93.79	41.16
ca	301308	692.8	24.80	10.39
cnh	1563	2.4	68.11	23.10
cs	9773	39.5	67.86	12.57
cv	1749	5.9	95.43	34.03
cy	11615	106.7	67.03	23.97
de	262113	822.8	27.03	6.50
dv	4757	18.6	92.16	30.15
el	3717	11.1	94.48	58.67
en	580501	1763.6	34.87	14.84
eo	28574	162.3	37.77	6.23
es	176902	337.7	19.63	5.41
et	5473	35.9	86.87	20.79
eu	12677	90.2	44.80	7.32
fa	12806	290.6	53.81	15.09
fi	875	2.6	93.78	27.57
fr	314745	664.1	33.16	13.94
fy-NL	6717	27.2	72.54	26.58
ga-IE	1038	3.5	92.57	51.02
hi	292	2.0	90.95	57.43
hsb	980	2.3	89.44	27.19
hu	4782	9.3	97.15	36.75
ia	5078	10.4	52.00	11.35
id	3965	9.9	82.50	22.82
it	70943	178.0	39.09	8.72
ja	1308	8.2	99.21	62.06
ka	1585	4.0	90.53	18.57
ky	3466	12.2	76.53	19.80
lg	1634	17.1	98.95	43.84
lt	1175	3.9	92.61	26.81
lv	4554	6.3	90.34	30.81
mn	4020	11.6	82.68	30.14
mt	3552	7.8	84.18	22.96
nl	14398	71.8	57.18	19.01
or	517	0.9	90.93	27.34
pa-IN	255	0.8	87.95	42.03
pl	12621	112.0	56.14	12.06
pt	11106	61.3	53.24	16.32
rm-sursilv	2589	5.9	78.17	23.31
rm-vallader	931	2.3	73.67	21.76
ro	4257	8.7	83.84	21.95
ru	23444	119.1	61.83	15.18
sah	1847	4.4	94.38	38.46
sl	2594	6.7	84.21	20.54
sv-SE	4350	20.8	83.68	30.79
ta	3788	18.4	84.19	21.60
th	4839	11.7	141.87	37.16
tr	3478	22.3	66.77	15.55
tt	13338	26.7	86.80	33.57
uk	7271	39.4	70.23	14.34
vi	421	1.7	96.06	66.25
zh-CN	27284	58.7	89.67	23.96
zh-HK	12678	92.1	81.77	18.82
zh-TW	6402	56.6	85.08	29.07

#模型检查

需要更多信息。

环境影响

可以使用 Lacoste et al. (2019) 中提供的 Machine Learning Impact calculator 来估计碳排放。

硬件型号：需要更多信息
使用小时数：需要更多信息
云提供商：需要更多信息
计算区域：需要更多信息
排放的碳量：需要更多信息

技术规格[可选]

模型架构和目标

需要更多信息。

计算基础设施

需要更多信息。

硬件

需要更多信息。

软件

需要更多信息。

引用

BibTeX：

More information needed

More information needed

术语表[可选]

需要更多信息。

模型卡作者[可选]

voidful与Ezi Ozoani以及Hugging Face团队合作

模型卡联系方式

需要更多信息。

如何开始使用模型

使用以下代码来开始使用模型。

点击展开

环境设置：

!pip install torchaudio
!pip install datasets transformers
!pip install asrp
!wget -O lang_ids.pk https://huggingface.co/voidful/wav2vec2-xlsr-multilingual-56/raw/main/lang_ids.pk

用法

import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
    AutoTokenizer, 
    AutoModelWithLMHead 
)
import torch
import re
import sys
import soundfile as sf
model_name = "voidful/wav2vec2-xlsr-multilingual-56"
device = "cuda"
processor_name = "voidful/wav2vec2-xlsr-multilingual-56"
 
import pickle
with open("lang_ids.pk", 'rb') as output:
    lang_ids = pickle.load(output)
    
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)
 
model.eval()
 
def load_file_to_data(file,sampling_rate=16_000):
    batch = {}
    speech, _ = torchaudio.load(file)
    if sampling_rate != '16_000' or sampling_rate != '16000':
        resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16_000)
        batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
        batch["sampling_rate"] = resampler.new_freq
    else:
        batch["speech"] = speech.squeeze(0).numpy()
        batch["sampling_rate"] = '16000'
    return batch
 
 
def predict(data):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
        decoded_results = []
        for logit in logits:
            pred_ids = torch.argmax(logit, dim=-1)
            mask = pred_ids.ge(1).unsqueeze(-1).expand(logit.size())
            vocab_size = logit.size()[-1]
            voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
            comb_pred_ids = torch.argmax(voice_prob, dim=-1)
            decoded_results.append(processor.decode(comb_pred_ids))
 
    return decoded_results
 
def predict_lang_specific(data,lang_code):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
        decoded_results = []
        for logit in logits:
            pred_ids = torch.argmax(logit, dim=-1)
            mask = ~pred_ids.eq(processor.tokenizer.pad_token_id).unsqueeze(-1).expand(logit.size())
            vocab_size = logit.size()[-1]
            voice_prob = torch.nn.functional.softmax((torch.masked_select(logit, mask).view(-1,vocab_size)),dim=-1)
            filtered_input = pred_ids[pred_ids!=processor.tokenizer.pad_token_id].view(1,-1).to(device)
            if len(filtered_input[0]) == 0:
                decoded_results.append("")
            else:
                lang_mask = torch.empty(voice_prob.shape[-1]).fill_(0)
                lang_index = torch.tensor(sorted(lang_ids[lang_code]))
                lang_mask.index_fill_(0, lang_index, 1)
                lang_mask = lang_mask.to(device)
                comb_pred_ids = torch.argmax(lang_mask*voice_prob, dim=-1)
                decoded_results.append(processor.decode(comb_pred_ids))
                
    return decoded_results
 
 
predict(load_file_to_data('audio file path',sampling_rate=16_000)) # beware of the audio file sampling rate
 
predict_lang_specific(load_file_to_data('audio file path',sampling_rate=16_000),'en') # beware of the audio file sampling rate

{{ get_started_code | default("More information needed", true)}}

作者:

voidful

数据集大小:

2.43 GB

wav2vec2-xlsr-multilingual-56

模型详情

模型描述

用途

直接使用

下游使用[可选]

超出范围的使用

偏见、风险和限制

建议

培训详情

培训数据

培训流程

预处理

速度、大小、时间

评估

测试数据、因素和指标

测试数据

因素

指标

结果

环境影响

技术规格[可选]

模型架构和目标

计算基础设施

硬件

软件

引用

术语表[可选]

更多信息[可选]

模型卡作者[可选]

模型卡联系方式

如何开始使用模型

环境设置：

用法