dominguesm/stt_pt_quartznet15x5_ctc_small | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

dominguesm/stt_pt_quartznet15x5_ctc_small

任务:

自动语音识别

类库:

NeMo PyTorch

数据集:

mozilla-foundation/common_voice_9_0 3Amozilla-foundation/common_voice_9_0

语言:

其他:

speech audio CTC QuartzNet Transformer NeMo Eval Results

许可:

cc-by-4.0

模型介绍文件清单

英文

模型概述

该模型将葡萄牙语的语音转录为小写字母，并包含空格。它是QuartzNet-CTC模型的“小型”版本。

NVIDIA NeMo：训练

要训练、微调或使用该模型，您需要安装 NVIDIA NeMo 。我们建议您在安装最新的Pytorch版本后再安装它。

pip install nemo_toolkit['all']

如何使用该模型

该模型可在NeMo工具包[1]中使用，并可用作推理的预训练检查点，或用于在另一个数据集上进行微调。

自动实例化模型

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("dominguesm/stt_pt_quartznet15x5_ctc_small")

使用Python进行转录

首先，获取一个示例

wget https://github.com/DominguesM/stt_pt_quartznet15x5_ctc_small/raw/main/audios/common_voice_pt_25555332.mp3

然后简单执行：

asr_model.transcribe(['common_voice_pt_25555332.mp3'])

转录多个音频文件

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py  pretrained_name="dominguesm/stt_pt_quartznet15x5_ctc_small"  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

输入

该模型接受16000 KHz单声道音频（wav文件）作为输入。

输出

该模型将给定音频样本的转录语音提供为字符串。

模型架构

该模型基于QuartzNet架构，它是Jasper的变种，在其卷积残差块中使用1D时间通道可分离卷积层，因此比Jasper模型更小。

QuartzNet模型接收音频片段并将其转录为字母、字节对或单词片段序列。

训练

所有训练脚本都可以在以下位置找到： DominguesM/stt_pt_quartznet15x5_ctc_small

数据集

该模型使用了葡萄牙语的Common Voices 9.0数据集的一部分进行训练，共计26小时的音频。

Mozilla Common Voice（v9.0）

性能

Metric	Score
WER	49%
CER	18%

使用以下代码获得了这些指标：

注意：在下载数据集（Mozilla Commom Voices 9.0 PT）并遵循预处理音频数据和包含在文件 notebooks/Finetuning CTC model Portuguese.ipynb 中的清单文件的步骤之后，必须执行下面的步骤。

$ wget -P scripts/ "https://raw.githubusercontent.com/NVIDIA/NeMo/v1.9.0/examples/asr/speech_to_text_eval.py"

$ wget -P scripts/ "https://raw.githubusercontent.com/NVIDIA/NeMo/v1.9.0/examples/asr/transcribe_speech.py"

$ python scripts/speech_to_text_eval.py \
    pretrained_name="dominguesm/stt_pt_quartznet15x5_ctc_small" \
    dataset_manifest="manifests/pt/commonvoice_test_manifest_processed.json" \
    output_filename="./evaluation_transcripts.json" \
    batch_size=32 \
    amp=true \
    use_cer=false

限制

由于该模型是在公开可得的语音数据集上进行训练的，所以对于包含技术术语或模型未经过训练的方言的语音，该模型的性能可能会下降。该模型在带有口音的语音上的表现也可能较差。

引用

如果您使用我们的工作，请引用：

@misc{domingues2022quartznet15x15-small-portuguese,
  title={Fine-tuned {Quartznet}-15x5 CTC small model for speech recognition in {P}ortuguese},
  author={Domingues, Maicon},
  howpublished={\url{https://huggingface.co/dominguesm/stt_pt_quartznet15x5_ctc_small}},
  year={2022}
}

参考文献

[1] NVIDIA NeMo Toolkit

作者:

Maicon Domingues

数据集大小:

72.79 MB