模型:

openai/whisper-small

任务:

自动语音识别

类库:

PyTorch TensorFlow JAX Transformers

语言:

其他:

whisper audio hf-asr-leaderboard Eval Results

预印本库:

arxiv:2212.04356

许可:

apache-2.0

模型介绍文件清单

英文

Whisper (耳语)

Whisper是一个用于自动语音识别（ASR）和语音翻译的预训练模型。通过对标记数据进行训练，Whisper模型展示出在许多数据集和领域中无需微调即具备强大的泛化能力。

Whisper由OpenAI的Alec Radford等人在论文 Robust Speech Recognition via Large-Scale Weak Supervision 中提出。原始的代码存储库可以在 here 中找到。

免责声明：此模型卡片的内容部分由Hugging Face团队编写，部分内容从原始模型卡片中复制并粘贴。

模型细节

Whisper是基于Transformer的编码器-解码器模型，也被称为序列到序列模型。它是在使用大规模弱监督进行注释的680,000小时标注语音数据上进行训练的。

这些模型是在仅英语数据或多语种数据上进行训练的。仅英语模型是在语音识别任务上训练的。多语种模型同时在语音识别和语音翻译上进行训练。对于语音识别，模型预测与音频相同语言的转录。对于语音翻译，模型预测到与音频不同语言的转录。

Whisper检查点有五个尺寸不同的配置。最小的四个是在仅英语或多语种数据上训练的。最大的检查点只有多语种。所有十个预训练检查点都可以在 Hugging Face Hub 上获得。下表总结了检查点的信息，并提供了到Hub上模型的链接：

Size	Parameters	English-only	Multilingual
tiny	39 M	12311321	12312321
base	74 M	12313321	12314321
small	244 M	12315321	12316321
medium	769 M	12317321	12318321
large	1550 M	x	12319321
large-v2	1550 M	x	12320321

使用方法

要转录音频样本，必须与 WhisperProcessor 一起使用该模型。

使用WhisperProcessor进行以下操作：

预处理音频输入（将其转换为模型的对数Mel频谱图）

后处理模型输出（将其从标记转换为文本）

通过传递相应的“上下文标记”来告知模型要执行哪个任务（转录或翻译）。这些上下文标记是一系列在解码过程开始时传递给解码器的标记序列，按照以下顺序进行：

转录始终以“<|startoftranscript|>”标记开头

第二个标记是语言标记（例如英语的“<|en|>”）

第三个标记是“任务标记”。它可以取两个值之一：“<|transcribe|>”表示语音识别，“<|translate|>”表示语音翻译

此外，如果模型不应包括时间戳预测，则添加“<|notimestamps|>”标记

因此，上下文标记的典型序列可能如下所示：

<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>

这告诉模型以英语解码，在语音识别的任务下，并且不预测时间戳。

这些标记可以是强制的或非强制的。如果它们是强制的，模型将被要求在每个位置预测每个标记。这允许控制Whisper模型的输出语言和任务。如果它们是非强制的，Whisper模型将自动预测输出语言和任务。

可以根据需要设置上下文标记：

model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")

这将强制模型在语音识别的任务下预测英语。

转录

从英语到英语

在此示例中，上下文标记是“非强制的”，这意味着模型自动预测输出语言（英语）和任务（转录）。

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import load_dataset

>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
>>> model.config.forced_decoder_ids = None

>>> # load dummy dataset and read audio files
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]
>>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

>>> # generate token ids
>>> predicted_ids = model.generate(input_features)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']

通过设置skip_special_tokens=True，可以将上下文标记从转录开始处移除。

从法语到法语

以下示例演示了通过适当设置解码器id进行法语到法语转录。

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset

>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

>>> # load streaming dataset and read first audio sample
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

>>> # generate token ids
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids)
['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']

>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Un vrai travail intéressant va enfin être mené sur ce sujet.']

翻译

将任务设置为“translate”将强制Whisper模型执行语音翻译。

从法语到英语

>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
>>> from datasets import Audio, load_dataset

>>> # load model and processor
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
>>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate")

>>> # load streaming dataset and read first audio sample
>>> ds = load_dataset("common_voice", "fr", split="test", streaming=True)
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
>>> input_speech = next(iter(ds))["audio"]
>>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features

>>> # generate token ids
>>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
>>> # decode token ids to text
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' A very interesting work, we will finally be given on this subject.']

评估

下面的代码片段展示了如何评估Whisper Small在 LibriSpeech test-clean 上的性能：

>>> from datasets import load_dataset
>>> from transformers import WhisperForConditionalGeneration, WhisperProcessor
>>> import torch
>>> from evaluate import load

>>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

>>> processor = WhisperProcessor.from_pretrained("openai/whisper-small")
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to("cuda")

>>> def map_to_pred(batch):
>>>     audio = batch["audio"]
>>>     input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
>>>     batch["reference"] = processor.tokenizer._normalize(batch['text'])
>>> 
>>>     with torch.no_grad():
>>>         predicted_ids = model.generate(input_features.to("cuda"))[0]
>>>     transcription = processor.decode(predicted_ids)
>>>     batch["prediction"] = processor.tokenizer._normalize(transcription)
>>>     return batch

>>> result = librispeech_test_clean.map(map_to_pred)

>>> wer = load("wer")
>>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
3.432213777886737

长篇转录

Whisper模型本质上设计用于处理长达30秒的音频样本。然而，通过使用分块算法，可以使用它来转录任意长度的音频样本。这可以通过设置chunk_length_s=30来启用。通过启用分块，可以进行批量推断。还可以通过传递return_timestamps=True来扩展到预测序列级别的时间戳：

>>> import torch
>>> from transformers import pipeline
>>> from datasets import load_dataset

>>> device = "cuda:0" if torch.cuda.is_available() else "cpu"

>>> pipe = pipeline(
>>>   "automatic-speech-recognition",
>>>   model="openai/whisper-small",
>>>   chunk_length_s=30,
>>>   device=device,
>>> )

>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> sample = ds[0]["audio"]

>>> prediction = pipe(sample.copy(), batch_size=8)["text"]
" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel."

>>> # we can also return timestamps for the predictions
>>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
[{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.',
  'timestamp': (0.0, 5.44)}]

有关分块算法的详细信息，请参阅博文 ASR Chunking 。

微调

预训练的Whisper模型展示出对不同数据集和领域的强大泛化能力。然而，通过使用至少5小时的标记数据，可以进一步改进它在特定语言和任务上的预测能力。博文 Fine-Tune Whisper with ? Transformers 提供了一步一步的Whisper模型微调指南。

评估使用

这些模型的主要目标用户是研究当前模型的鲁棒性、泛化能力、能力、偏差和限制的AI研究人员。然而，Whisper作为一个ASR解决方案对于开发人员来说也可能非常有用，尤其是英语语音识别。我们意识到，一旦模型发布，就不可能限制只有“预期”使用或在什么是研究的合理指导方针上进行划分。

这些模型主要在ASR和转换到英语的语音翻译任务上进行训练和评估。它们在大约10种语言的ASR结果中表现出很强的效果。它们可能具有额外的能力，特别是在某些任务（如声音活动检测、说话人分类或说话人日程安排）上进行微调，但尚未在这些领域进行过全面评估。我们强烈建议用户在部署之前在特定环境和领域中对模型进行全面评估。

特别是，我们警告不要使用Whisper模型来转录未经个人同意的录音，或声称使用这些模型进行任何形式的主观分类。我们建议不要在决策环境等高风险领域使用这些模型，因为准确性的缺陷可能导致明显的结果缺陷。该模型旨在进行语音转录和翻译，不仅未经评估，而且不适合推断人类属性。

训练数据

这些模型是在互联网上收集的680,000小时音频和相应的文本的基础上进行训练的。其中65%的数据（或438,000小时）代表英语音频和匹配的英语文本，大约18%的数据（或126,000小时）代表非英语音频和英语文本，最后的17%的数据（或117,000小时）代表非英语音频和相应的文本。这些非英语数据代表98种不同的语言。

如 the accompanying paper 所讨论的，我们发现在给定语言的转录性能与我们在该语言中使用的训练数据量直接相关。

性能和限制

我们的研究表明，与许多现有的ASR系统相比，这些模型在口音、背景噪声、技术语言以及从多种语言到英语的零shot翻译方面的鲁棒性得到了改进；对于语音识别和翻译的准确性接近最先进水平。

然而，由于模型是使用大规模嘈杂数据进行弱监督训练的，预测结果可能包括实际上并未在音频输入中发言的文本（即产生幻觉）。我们假设这是因为在尝试预测音频中的下一个单词和转录音频之间，模型结合了对语言的常识。

我们的模型在各种语言上表现不均匀，并且我们观察到在资源匮乏和/或发现性差的语言或训练数据较少的语言上的准确性较低。模型在特定语言的不同口音和方言上的表现也不同，其中可能包括对不同性别、种族、年龄或其他人口统计学标准的发言人的高词错误率。我们的全面评估结果见 the paper accompanying this release 。

此外，模型的序列到序列架构使其容易生成重复的文本，尽管可以通过束搜索和温度调度在一定程度上缓解这一问题，但并不完美。关于这些限制的进一步分析可参考博文 the paper 。这种行为和幻觉在资源匮乏和/或发现性较差的语言上可能会更严重。

更广泛的影响

我们预计Whisper模型的转录能力可以用于改善辅助工具。虽然Whisper模型不能直接用于实时转录，但其速度和体积表明其他人可能能够在其基础上构建应用程序，实现接近实时的语音识别和翻译。在Whisper模型上构建有益应用的真正价值表明，这些模型的不同性能可能会产生实际的经济影响。

在发布Whisper模型时，也存在潜在的双重使用问题。虽然我们希望技术主要用于有益目的，但使ASR技术更易于获取可能会使更多的参与者构建功能强大的监视技术或扩大现有的监视工作，因为速度和准确性可实现大规模音频通信的实惠自动转录和翻译。此外，这些模型可能具有一些识别特定个体的能力，这进而提出了与双重使用和不同性能相关的安全问题。实际上，我们预计转录的成本并不是扩大监视项目的限制因素。

BibTeX条目和引文信息

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

作者:

OpenAI

数据集大小:

2.71 GB