模型:

speechbrain/vad-crdnn-libriparty

类库:

speechbrain PyTorch

数据集:

Urbansound8k 3AUrbansound8k

语言:

其他:

VAD SAD Voice Activity Detection Speech Activity Detection Speaker Diarization CRDNN LibriSpeech LibryParty Voice+Activity+Detection Speech+Activity+Detection Speaker+Diarization

预印本库:

arxiv:2106.04624

模型介绍文件清单

英文

使用在Libriparty训练的小型CRDNN模型进行语音活动检测

该存储库提供了使用SpeechBrain在Libriparty上预训练的模型进行语音活动检测所需的所有工具。

预训练系统可以处理短时和长时的语音录音，并输出检测到语音活动的片段。系统的输出如下所示：

segment_001  0.00  2.57 NON_SPEECH
segment_002  2.57  8.20 SPEECH
segment_003  8.20  9.10 NON_SPEECH
segment_004  9.10  10.93 SPEECH
segment_005  10.93  12.00 NON_SPEECH
segment_006  12.00  14.40 SPEECH
segment_007  14.40  15.00 NON_SPEECH
segment_008  15.00  17.70 SPEECH

该系统期望输入的录音采样率为16kHz（单声道）。如果信号的采样率不同，请在使用接口之前进行重采样（例如使用torchaudio或sox）。

为了获得更好的体验，我们鼓励您了解更多有关 SpeechBrain 的信息。

结果

该模型在LibriParty测试集上的性能为：

Release	hyperparams file	Test Precision	Test Recall	Test F-Score	Model link	GPUs
2021-09-09	train.yaml	0.9518	0.9437	0.9477	12312321	1xV100 16GB

管道描述

该系统由一个CRDNN组成，其输出接近于一的语音帧的后验概率，以及接近于零的非语音片段的后验概率。在后验概率之上应用一个阈值以检测候选语音边界。

根据激活的选项，这些边界可以进行后处理（例如合并接近的片段，移除短片段等）以进一步提高性能。请参阅下面的更多详细信息。

安装SpeechBrain

pip install speechbrain

请注意，我们鼓励您阅读我们的教程并了解更多有关 SpeechBrain 的信息。

执行语音活动检测

from speechbrain.pretrained import VAD

VAD = VAD.from_hparams(source="speechbrain/vad-crdnn-libriparty", savedir="pretrained_models/vad-crdnn-libriparty")
boundaries = VAD.get_speech_segments("speechbrain/vad-crdnn-libriparty/example_vad.wav")

# Print the output
VAD.save_boundaries(boundaries)

输出是一个包含每个检测到的语音片段的开始/结束秒数的张量。您可以使用以下命令将边界保存到文件中：

VAD.save_boundaries(boundaries, save_path='VAD_file.txt')

有时候，将VAD的输出与输入信号一起可视化是有用的。这有助于快速判断VAD是否工作正常。

要实现这个目的：

import torchaudio
upsampled_boundaries = VAD.upsample_boundaries(boundaries, 'pretrained_model_checkpoints/example_vad.wav')    
torchaudio.save('vad_final.wav', upsampled_boundaries.cpu(), 16000)

这将创建一个与原始信号具有相同维度的"VAD信号"。

现在，您可以使用audacity等软件打开vad_final.wav和pretrained_model_checkpoints/example_vad.wav来联合可视化它们。

VAD管道详细信息

检测语音片段的管道如下所示：

计算帧级的后验概率。

在后验概率上应用阈值。

建立候选语音片段。

在每个候选片段内应用能量VAD（可选）。这可以根据能量内容将长句子分解为短句子。

合并过于接近的片段。

移除太短的片段。

双重检查语音片段（可选）。这是对检测到的片段进行最后检查以确保它们确实是语音片段。

我们设计了VAD，使您可以访问所有这些步骤（这可能有助于调试）：

from speechbrain.pretrained import VAD
VAD = VAD.from_hparams(source="speechbrain/vad-crdnn-libriparty", savedir="pretrained_models/vad-crdnn-libriparty")

# 1- Let's compute frame-level posteriors first
audio_file = 'pretrained_model_checkpoints/example_vad.wav'
prob_chunks = VAD.get_speech_prob_file(audio_file)

# 2- Let's apply a threshold on top of the posteriors
prob_th = VAD.apply_threshold(prob_chunks).float()

# 3- Let's now derive the candidate speech segments
boundaries = VAD.get_boundaries(prob_th)

# 4- Apply energy VAD within each candidate speech segment (optional)

boundaries = VAD.energy_VAD(audio_file,boundaries)

# 5- Merge segments that are too close
boundaries = VAD.merge_close_segments(boundaries, close_th=0.250)

# 6- Remove segments that are too short
boundaries = VAD.remove_short_segments(boundaries, len_th=0.250)

# 7- Double-check speech segments (optional).
boundaries = VAD.double_check_speech_segments(boundaries, audio_file,  speech_th=0.5)

在GPU上推断

要在GPU上进行推断，在调用from_hparams方法时添加run_opts={"device":"cuda"}。

训练

该模型是使用SpeechBrain（ea17d22）进行训练的。要从头开始训练，请按照以下步骤进行：

克隆SpeechBrain：

git clone https://github.com/speechbrain/speechbrain/

安装SpeechBrain：

cd speechbrain
pip install -r requirements.txt
pip install -e .

运行训练：训练严重依赖于数据增强。请确保您已经下载了所有所需的数据集：

 - LibriParty: https://drive.google.com/file/d/1--cAS5ePojMwNY5fewioXAv9YlYAWzIJ/view?usp=sharing
 - Musan: https://www.openslr.org/resources/17/musan.tar.gz
 - CommonLanguage: https://zenodo.org/record/5036977/files/CommonLanguage.tar.gz?download=1

cd recipes/LibriParty/VAD
python train.py hparams/train.yaml --data_folder=/path/to/LibriParty --musan_folder=/path/to/musan/ --commonlanguage_folder=/path/to/common_voice_kpd

限制

SpeechBrain团队不对在其他数据集上使用此模型时所达到的性能提供任何保证。

引用SpeechBrain

如果您在研究或商业中使用SpeechBrain，请引用SpeechBrain。

@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}

作者:

SpeechBrain

数据集大小:

9.59 MB