模型:
asapp/sew-tiny-100k-ft-ls100h
基于16kHz采样的语音音频进行预训练的基本模型。使用该模型时,请确保语音输入也以16kHz进行采样。请注意,该模型应在下游任务(如自动语音识别、说话人识别、意图分类、情感识别等)上进行微调。
Paper: Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition
作者:Felix Wu,Kwangyoun Kim,Jing Pan,Kyu Han,Kilian Q. Weinberger,Yoav Artzi
摘要:本文研究了预训练模型在自动语音识别(ASR)中的性能和效率权衡。我们重点研究了wav2vec 2.0,并形式化了几种影响模型性能和效率的架构设计。综合我们的观察,我们引入了SEW(压缩和高效的Wav2vec),这是一种预训练模型架构,在性能和效率两个维度上都有显著改进,适用于各种训练设置。例如,在LibriSpeech的100h-960h半监督设置下,SEW相比wav2vec 2.0实现了1.9倍的推理加速,并且相对于字错误率减少了13.5%。在类似的推理时间下,SEW在不同模型尺寸上将字错误率降低了25-50%。
原始模型可在 https://github.com/asappresearch/sew#model-checkpoints 找到。
要转录音频文件,可以将该模型作为独立的声学模型使用,如下所示:
from transformers import Wav2Vec2Processor, SEWForCTC from datasets import load_dataset import soundfile as sf import torch # load the model and preprocessor processor = Wav2Vec2Processor.from_pretrained("asapp/sew-tiny-100k-ft-ls100h") model = SEWForCTC.from_pretrained("asapp/sew-tiny-100k-ft-ls100h") # load the dummy dataset with speech samples ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") # preprocess input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values # Batch size 1 # retrieve logits logits = model(input_values).logits # take argmax and decode predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids)
以下代码段展示了如何在LibriSpeech的"clean"和"other"测试数据上评估asapp/sew-tiny-100k-ft-ls100h。
from datasets import load_dataset from transformers import SEWForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = SEWForCTC.from_pretrained("asapp/sew-tiny-100k-ft-ls100h").to("cuda") processor = Wav2Vec2Processor.from_pretrained("asapp/sew-tiny-100k-ft-ls100h") def map_to_pred(batch): input_values = processor(batch["audio"][0]["array"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"]))
结果(WER):
"clean" | "other" |
---|---|
10.61 | 23.74 |