英文

NVIDIA FastConformer-Transducer XXLarge (en)

| | |

This model transcribes speech in lower case English alphabet. It is a "extra extra large" version of FastConformer Transducer (around 1.2B parameters) model. See the model architecture section and NeMo documentation for complete architecture details.

NVIDIA NeMo: Training

To train, fine-tune or play with the model you will need to install NVIDIA NeMo . We recommend you install it after you've installed latest Pytorch version.

pip install nemo_toolkit['all']

How to Use this Model

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_transducer_xxlarge")

Transcribing using Python

First, let's get a sample

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

Then simply do:

asr_model.transcribe(['2086-149220-0033.wav'])

Transcribing many audio files

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py 
 pretrained_name="nvidia/stt_en_fastconformer_transducer_xxlarge" 
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Input

This model accepts 16000 Hz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Architecture

FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with a Transducer decoder (RNNT) loss. You may find more information on the details of FastConformer here: Fast-Conformer Model .

Training

The NeMo toolkit [3] was used for training the models for over several hundred epochs. These models are trained with this example script and this base config .

The tokenizers for these models were built using the text transcripts of the train set with this script .

Datasets

The model in this collection is trained on a composite dataset (NeMo ASRSet En) comprising several thousand hours of English speech:

  • Librispeech 960 hours of English speech
  • Fisher Corpus
  • Switchboard-1 Dataset
  • WSJ-0 and WSJ-1
  • National Speech Corpus (Part 1, Part 6)
  • VCTK
  • VoxPopuli (EN)
  • Europarl-ASR (EN)
  • Multilingual Librispeech (MLS EN) - 2,000 hrs subset
  • Mozilla Common Voice (v7.0)
  • People's Speech - 12,000 hrs subset

Performance

The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general.

The following tables summarize the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.

Version Tokenizer Vocabulary Size LS test-other LS test-clean WSJ Eval92 WSJ Dev93 NSC Part 1 MLS Test MCV Test 7.0 Train Dataset
1.20.0 SentencePiece Unigram 1024 3.04 1.59 1.27 2.13 5.84 4.88 5.11 NeMo ASRSET 3.0

Limitations

Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

NVIDIA Riva: Deployment

NVIDIA Riva , is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. Additionally, Riva provides:

  • World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
  • Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
  • Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support

Although this model isn’t supported yet by Riva, the list of supported models is here . Check out Riva live demo .

References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[2] Google Sentencepiece Tokenizer

[3] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the CC-BY-4.0 . By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.

.hf-sanitized.hf-sanitized-1YFVdutlwPkUiaxGVZIYG img {display: inline;} 此模型是NVIDIA FastConformer-Transducer XXLarge (en)模型。它在小写英文字母上转录语音。这是FastConformer Transducer模型的“特别特大”版本(大约有12亿个参数)。有关完整的架构详细信息,请参见模型架构部分和 NeMo documentation 。NVIDIA NeMo:训练要训练、微调或使用模型,您需要安装 NVIDIA NeMo 。我们建议您在安装最新版本的Pytorch之后再安装它。
pip install nemo_toolkit['all']
如何使用此模型在NeMo工具包[3]中可用,并可用作预训练的检查点以进行推断或用于另一个数据集的微调。自动实例化模型
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_transducer_xxlarge")
使用Python进行转录首先,让我们得到一个示例
wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
然后简单地执行
asr_model.transcribe(['2086-149220-0033.wav'])
转录多个音频文件
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py 
 pretrained_name="nvidia/stt_en_fastconformer_transducer_xxlarge" 
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
输入此模型接受16000 Hz单声道音频(wav文件)作为输入。输出此模型为给定音频样本提供转录的语音作为字符串。模型架构FastConformer [1]是Conformer模型的优化版本,具有8倍的深度可分离卷积下采样。该模型在Transducer解码器(RNNT)损失的多任务设置中进行训练。您可以在此处找到有关FastConformer详细信息的更多信息: Fast-Conformer Model 培训这些模型使用NeMo工具包[3]进行了数百个时期的训练。这些模型是使用 example script base config 进行训练的。这些模型的标记器是使用训练集的文本转录使用 script 构建的数据集此集合中的模型使用复合数据集(NeMo ASRSet En)进行训练,该数据集包括数千小时的英语语音:Librispeech 960小时的英语语音Fisher Corpus Switchboard-1数据集WSJ-0和WSJ-1National Speech Corpus(第1部分、第6部分)VCTKVoxPopuli(EN)Europarl-ASR(EN)多语种Librispeech(MLS EN)- 2,000小时子集Mozilla Common Voice(v7.0)People's Speech-12,000小时子集性能自动语音识别模型的性能是使用字错误率进行测量。由于该数据集在多个领域上进行了训练,并且具有更大的语料库,因此它通常在转录音频时表现更好。下表总结了此集合中可用模型的性能与Transducer解码器的性能。 ASR模型的性能以词错误率(WER%)和贪婪解码报告。
Version Tokenizer Vocabulary Size LS test-other LS test-clean WSJ Eval92 WSJ Dev93 NSC Part 1 MLS Test MCV Test 7.0 Train Dataset
1.20.0 SentencePiece Unigram 1024 3.04 1.59 1.27 2.13 5.84 4.88 5.11 NeMo ASRSET 3.0
限制由于该模型是在公开可用的语音数据集上进行训练的,因此对于包含技术术语或模型未经过训练的方言的语音,该模型的性能可能会降低。该模型在口音语音上的表现可能也较差。NVIDIA Riva:部署 NVIDIA Riva 是一种可在本地,所有云端,多云端,混合,边缘和嵌入式上部署的加速语音AI SDK。此外,Riva还提供:拥有各种最常用语言的工业级开箱即用的准确性,采用专有数据训练的模型检查点和数十万 GPU 计算小时最佳准确性与运行时词强化(例如品牌和产品名称)以及声学模型,语言模型和文本逆归一化的定制流式语音识别,兼容 Kubernetes 的扩展和企业级支持尽管此模型尚未受 Riva 支持,但可使用 list of supported models is here 。查看 Riva live demo 。参考文献[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition [2] Google Sentencepiece Tokenizer [3] NVIDIA NeMo Toolkit 许可证使用该模型的许可证受 CC-BY-4.0 的规定。通过下载模型的公开和发布版本,您接受 CC-BY-4.0 许可的条款和条件。.hf-sanitized.hf-sanitized-1YFVdutlwPkUiaxGVZIYG img {display: inline;}