模型:
nvidia/stt_en_fastconformer_transducer_xxlarge
许可:
cc-by-4.0预印本库:
arxiv:2305.05084语言:
en数据集:
3AMLCommons/peoples_speech 3Amozilla-foundation/common_voice_8_0 3AMultilingual-LibriSpeech-(2000-hours) 3AEuroparl-ASR-(EN) 3AVoxPopuli-(EN) 3Avctk 3ANational-Singapore-Corpus-Part-6 3ANational-Singapore-Corpus-Part-1 3AWSJ-1 3AWSJ-0 3ASwitchboard-1 3Afisher_corpus 3Alibrispeech_asr任务:
自动语音识别| | |
This model transcribes speech in lower case English alphabet. It is a "extra extra large" version of FastConformer Transducer (around 1.2B parameters) model. See the model architecture section and NeMo documentation for complete architecture details.
To train, fine-tune or play with the model you will need to install NVIDIA NeMo . We recommend you install it after you've installed latest Pytorch version.
pip install nemo_toolkit['all']
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_transducer_xxlarge")
First, let's get a sample
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
Then simply do:
asr_model.transcribe(['2086-149220-0033.wav'])
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="nvidia/stt_en_fastconformer_transducer_xxlarge" audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
This model provides transcribed speech as a string for a given audio sample.
FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with a Transducer decoder (RNNT) loss. You may find more information on the details of FastConformer here: Fast-Conformer Model .
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These models are trained with this example script and this base config .
The tokenizers for these models were built using the text transcripts of the train set with this script .
The model in this collection is trained on a composite dataset (NeMo ASRSet En) comprising several thousand hours of English speech:
The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.
The following tables summarize the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
Version | Tokenizer | Vocabulary Size | LS test-other | LS test-clean | WSJ Eval92 | WSJ Dev93 | NSC Part 1 | MLS Test | MCV Test 7.0 | Train Dataset |
---|---|---|---|---|---|---|---|---|---|---|
1.20.0 | SentencePiece Unigram | 1024 | 3.04 | 1.59 | 1.27 | 2.13 | 5.84 | 4.88 | 5.11 | NeMo ASRSET 3.0 |
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
NVIDIA Riva , is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. Additionally, Riva provides:
Although this model isn’t supported yet by Riva, the list of supported models is here . Check out Riva live demo .
[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
[2] Google Sentencepiece Tokenizer
License to use this model is covered by the CC-BY-4.0 . By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.
.hf-sanitized.hf-sanitized-1YFVdutlwPkUiaxGVZIYG img {display: inline;} 此模型是NVIDIA FastConformer-Transducer XXLarge (en)模型。它在小写英文字母上转录语音。这是FastConformer Transducer模型的“特别特大”版本(大约有12亿个参数)。有关完整的架构详细信息,请参见模型架构部分和 NeMo documentation 。NVIDIA NeMo:训练要训练、微调或使用模型,您需要安装 NVIDIA NeMo 。我们建议您在安装最新版本的Pytorch之后再安装它。pip install nemo_toolkit['all']如何使用此模型在NeMo工具包[3]中可用,并可用作预训练的检查点以进行推断或用于另一个数据集的微调。自动实例化模型
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_transducer_xxlarge")使用Python进行转录首先,让我们得到一个示例
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav然后简单地执行
asr_model.transcribe(['2086-149220-0033.wav'])转录多个音频文件
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="nvidia/stt_en_fastconformer_transducer_xxlarge" audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"输入此模型接受16000 Hz单声道音频(wav文件)作为输入。输出此模型为给定音频样本提供转录的语音作为字符串。模型架构FastConformer [1]是Conformer模型的优化版本,具有8倍的深度可分离卷积下采样。该模型在Transducer解码器(RNNT)损失的多任务设置中进行训练。您可以在此处找到有关FastConformer详细信息的更多信息: Fast-Conformer Model 培训这些模型使用NeMo工具包[3]进行了数百个时期的训练。这些模型是使用 example script 和 base config 进行训练的。这些模型的标记器是使用训练集的文本转录使用 script 构建的数据集此集合中的模型使用复合数据集(NeMo ASRSet En)进行训练,该数据集包括数千小时的英语语音:Librispeech 960小时的英语语音Fisher Corpus Switchboard-1数据集WSJ-0和WSJ-1National Speech Corpus(第1部分、第6部分)VCTKVoxPopuli(EN)Europarl-ASR(EN)多语种Librispeech(MLS EN)- 2,000小时子集Mozilla Common Voice(v7.0)People's Speech-12,000小时子集性能自动语音识别模型的性能是使用字错误率进行测量。由于该数据集在多个领域上进行了训练,并且具有更大的语料库,因此它通常在转录音频时表现更好。下表总结了此集合中可用模型的性能与Transducer解码器的性能。 ASR模型的性能以词错误率(WER%)和贪婪解码报告。
Version | Tokenizer | Vocabulary Size | LS test-other | LS test-clean | WSJ Eval92 | WSJ Dev93 | NSC Part 1 | MLS Test | MCV Test 7.0 | Train Dataset |
---|---|---|---|---|---|---|---|---|---|---|
1.20.0 | SentencePiece Unigram | 1024 | 3.04 | 1.59 | 1.27 | 2.13 | 5.84 | 4.88 | 5.11 | NeMo ASRSET 3.0 |