模型:
speechbrain/vad-crdnn-libriparty
该存储库提供了使用SpeechBrain在Libriparty上预训练的模型进行语音活动检测所需的所有工具。
预训练系统可以处理短时和长时的语音录音,并输出检测到语音活动的片段。系统的输出如下所示:
segment_001 0.00 2.57 NON_SPEECH segment_002 2.57 8.20 SPEECH segment_003 8.20 9.10 NON_SPEECH segment_004 9.10 10.93 SPEECH segment_005 10.93 12.00 NON_SPEECH segment_006 12.00 14.40 SPEECH segment_007 14.40 15.00 NON_SPEECH segment_008 15.00 17.70 SPEECH
该系统期望输入的录音采样率为16kHz(单声道)。如果信号的采样率不同,请在使用接口之前进行重采样(例如使用torchaudio或sox)。
为了获得更好的体验,我们鼓励您了解更多有关 SpeechBrain 的信息。
该模型在LibriParty测试集上的性能为:
Release | hyperparams file | Test Precision | Test Recall | Test F-Score | Model link | GPUs |
---|---|---|---|---|---|---|
2021-09-09 | train.yaml | 0.9518 | 0.9437 | 0.9477 | 12312321 | 1xV100 16GB |
该系统由一个CRDNN组成,其输出接近于一的语音帧的后验概率,以及接近于零的非语音片段的后验概率。在后验概率之上应用一个阈值以检测候选语音边界。
根据激活的选项,这些边界可以进行后处理(例如合并接近的片段,移除短片段等)以进一步提高性能。请参阅下面的更多详细信息。
pip install speechbrain
请注意,我们鼓励您阅读我们的教程并了解更多有关 SpeechBrain 的信息。
from speechbrain.pretrained import VAD VAD = VAD.from_hparams(source="speechbrain/vad-crdnn-libriparty", savedir="pretrained_models/vad-crdnn-libriparty") boundaries = VAD.get_speech_segments("speechbrain/vad-crdnn-libriparty/example_vad.wav") # Print the output VAD.save_boundaries(boundaries)
输出是一个包含每个检测到的语音片段的开始/结束秒数的张量。您可以使用以下命令将边界保存到文件中:
VAD.save_boundaries(boundaries, save_path='VAD_file.txt')
有时候,将VAD的输出与输入信号一起可视化是有用的。这有助于快速判断VAD是否工作正常。
要实现这个目的:
import torchaudio upsampled_boundaries = VAD.upsample_boundaries(boundaries, 'pretrained_model_checkpoints/example_vad.wav') torchaudio.save('vad_final.wav', upsampled_boundaries.cpu(), 16000)
这将创建一个与原始信号具有相同维度的"VAD信号"。
现在,您可以使用audacity等软件打开vad_final.wav和pretrained_model_checkpoints/example_vad.wav来联合可视化它们。
检测语音片段的管道如下所示:
我们设计了VAD,使您可以访问所有这些步骤(这可能有助于调试):
from speechbrain.pretrained import VAD VAD = VAD.from_hparams(source="speechbrain/vad-crdnn-libriparty", savedir="pretrained_models/vad-crdnn-libriparty") # 1- Let's compute frame-level posteriors first audio_file = 'pretrained_model_checkpoints/example_vad.wav' prob_chunks = VAD.get_speech_prob_file(audio_file) # 2- Let's apply a threshold on top of the posteriors prob_th = VAD.apply_threshold(prob_chunks).float() # 3- Let's now derive the candidate speech segments boundaries = VAD.get_boundaries(prob_th) # 4- Apply energy VAD within each candidate speech segment (optional) boundaries = VAD.energy_VAD(audio_file,boundaries) # 5- Merge segments that are too close boundaries = VAD.merge_close_segments(boundaries, close_th=0.250) # 6- Remove segments that are too short boundaries = VAD.remove_short_segments(boundaries, len_th=0.250) # 7- Double-check speech segments (optional). boundaries = VAD.double_check_speech_segments(boundaries, audio_file, speech_th=0.5)
要在GPU上进行推断,在调用from_hparams方法时添加run_opts={"device":"cuda"}。
该模型是使用SpeechBrain(ea17d22)进行训练的。要从头开始训练,请按照以下步骤进行:
git clone https://github.com/speechbrain/speechbrain/
cd speechbrain pip install -r requirements.txt pip install -e .
- LibriParty: https://drive.google.com/file/d/1--cAS5ePojMwNY5fewioXAv9YlYAWzIJ/view?usp=sharing - Musan: https://www.openslr.org/resources/17/musan.tar.gz - CommonLanguage: https://zenodo.org/record/5036977/files/CommonLanguage.tar.gz?download=1
cd recipes/LibriParty/VAD python train.py hparams/train.yaml --data_folder=/path/to/LibriParty --musan_folder=/path/to/musan/ --commonlanguage_folder=/path/to/common_voice_kpd
SpeechBrain团队不对在其他数据集上使用此模型时所达到的性能提供任何保证。
如果您在研究或商业中使用SpeechBrain,请引用SpeechBrain。
@misc{speechbrain, title={{SpeechBrain}: A General-Purpose Speech Toolkit}, author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio}, year={2021}, eprint={2106.04624}, archivePrefix={arXiv}, primaryClass={eess.AS}, note={arXiv:2106.04624} }