Whisper由OpenAI的Alec Radford等人在论文 Robust Speech Recognition via Large-Scale Weak Supervision 中提出。原始的代码存储库可以在 here 中找到。
免责声明:此模型卡片的内容部分由Hugging Face团队编写,部分内容从原始模型卡片中复制并粘贴。
Whisper检查点有五个尺寸不同的配置。最小的四个是在仅英语或多语种数据上训练的。最大的检查点只有多语种。所有十个预训练检查点都可以在 Hugging Face Hub 上获得。下表总结了检查点的信息,并提供了到Hub上模型的链接:
Size | Parameters | English-only | Multilingual |
tiny | 39 M | 12311321 | 12312321 |
base | 74 M | 12313321 | 12314321 |
small | 244 M | 12315321 | 12316321 |
medium | 769 M | 12317321 | 12318321 |
large | 1550 M | x | 12319321 |
large-v2 | 1550 M | x | 12320321 |
要转录音频样本,必须与 WhisperProcessor 一起使用该模型。
<|startoftranscript|> <|en|> <|transcribe|> <|notimestamps|>
model.config.forced_decoder_ids = WhisperProcessor.get_decoder_prompt_ids(language="english", task="transcribe")
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration >>> from datasets import load_dataset >>> # load model and processor >>> processor = WhisperProcessor.from_pretrained("openai/whisper-small") >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") >>> model.config.forced_decoder_ids = None >>> # load dummy dataset and read audio files >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") >>> sample = ds[0]["audio"] >>> input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features >>> # generate token ids >>> predicted_ids = model.generate(input_features) >>> # decode token ids to text >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False) ['<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.<|endoftext|>'] >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) [' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration >>> from datasets import Audio, load_dataset >>> # load model and processor >>> processor = WhisperProcessor.from_pretrained("openai/whisper-small") >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") >>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe") >>> # load streaming dataset and read first audio sample >>> ds = load_dataset("common_voice", "fr", split="test", streaming=True) >>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000)) >>> input_speech = next(iter(ds))["audio"] >>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features >>> # generate token ids >>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids) >>> # decode token ids to text >>> transcription = processor.batch_decode(predicted_ids) ['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>'] >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) [' Un vrai travail intéressant va enfin être mené sur ce sujet.']
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration >>> from datasets import Audio, load_dataset >>> # load model and processor >>> processor = WhisperProcessor.from_pretrained("openai/whisper-small") >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") >>> forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="translate") >>> # load streaming dataset and read first audio sample >>> ds = load_dataset("common_voice", "fr", split="test", streaming=True) >>> ds = ds.cast_column("audio", Audio(sampling_rate=16_000)) >>> input_speech = next(iter(ds))["audio"] >>> input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features >>> # generate token ids >>> predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids) >>> # decode token ids to text >>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) [' A very interesting work, we will finally be given on this subject.']
下面的代码片段展示了如何评估Whisper Small在 LibriSpeech test-clean 上的性能:
>>> from datasets import load_dataset >>> from transformers import WhisperForConditionalGeneration, WhisperProcessor >>> import torch >>> from evaluate import load >>> librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test") >>> processor = WhisperProcessor.from_pretrained("openai/whisper-small") >>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to("cuda") >>> def map_to_pred(batch): >>> audio = batch["audio"] >>> input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features >>> batch["reference"] = processor.tokenizer._normalize(batch['text']) >>> >>> with torch.no_grad(): >>> predicted_ids = model.generate(input_features.to("cuda"))[0] >>> transcription = processor.decode(predicted_ids) >>> batch["prediction"] = processor.tokenizer._normalize(transcription) >>> return batch >>> result = librispeech_test_clean.map(map_to_pred) >>> wer = load("wer") >>> print(100 * wer.compute(references=result["reference"], predictions=result["prediction"])) 3.432213777886737
>>> import torch >>> from transformers import pipeline >>> from datasets import load_dataset >>> device = "cuda:0" if torch.cuda.is_available() else "cpu" >>> pipe = pipeline( >>> "automatic-speech-recognition", >>> model="openai/whisper-small", >>> chunk_length_s=30, >>> device=device, >>> ) >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") >>> sample = ds[0]["audio"] >>> prediction = pipe(sample.copy(), batch_size=8)["text"] " Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel." >>> # we can also return timestamps for the predictions >>> prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"] [{'text': ' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.', 'timestamp': (0.0, 5.44)}]
有关分块算法的详细信息,请参阅博文 ASR Chunking 。
预训练的Whisper模型展示出对不同数据集和领域的强大泛化能力。然而,通过使用至少5小时的标记数据,可以进一步改进它在特定语言和任务上的预测能力。博文 Fine-Tune Whisper with ? Transformers 提供了一步一步的Whisper模型微调指南。
如 the accompanying paper 所讨论的,我们发现在给定语言的转录性能与我们在该语言中使用的训练数据量直接相关。
我们的模型在各种语言上表现不均匀,并且我们观察到在资源匮乏和/或发现性差的语言或训练数据较少的语言上的准确性较低。模型在特定语言的不同口音和方言上的表现也不同,其中可能包括对不同性别、种族、年龄或其他人口统计学标准的发言人的高词错误率。我们的全面评估结果见 the paper accompanying this release 。
此外,模型的序列到序列架构使其容易生成重复的文本,尽管可以通过束搜索和温度调度在一定程度上缓解这一问题,但并不完美。关于这些限制的进一步分析可参考博文 the paper 。这种行为和幻觉在资源匮乏和/或发现性较差的语言上可能会更严重。
@misc{radford2022whisper, doi = {10.48550/ARXIV.2212.04356}, url = {https://arxiv.org/abs/2212.04356}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }