英文

? 说话人分离

基于pyannote.audio 2.0:请查看 installation instructions .

简介

# load the pipeline from Hugginface Hub
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2022.07")

# apply the pipeline to an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

高级使用

如果事先知道说话人数量,可以使用 num_speakers 选项:

diarization = pipeline("audio.wav", num_speakers=2)

可以使用 min_speakers 和 max_speakers 选项提供说话人数量的下限和/或上限:

diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)

如果你感兴趣,可以尝试调整各种流水线超参数。 例如,可以通过增加 segmentation_onset 阈值来使用更加激进的语音活动检测:

hparams = pipeline.parameters(instantiated=True)
hparams["segmentation_onset"] += 0.1
pipeline.instantiate(hparams)

基准

实时系数

使用一块Nvidia Tesla V100 SXM2 GPU(用于神经推理部分)和一颗Intel Cascade Lake 6248 CPU(用于聚类部分),实时系数约为5%。

换句话说,处理一小时的对话大约需要3分钟。

准确性

该流水线在不断增长的数据集集合上进行基准测试。

处理是完全自动的:

  • 没有人工语音活动检测(如文献中有时的情况)
  • 没有手动提供说话人数量(尽管可以提供)
  • 没有对内部模型进行微调,也没有对流水线超参数进行调整以适应每个数据集

...使用最严格的对话错误率(DER)设置进行评估(在 this paper 中称为“完全”):

  • 没有原谅的宽度
  • 评估重叠的语音
Benchmark DER% FA% Miss% Conf% Expected output File-level evaluation
1238321 14.61 3.31 4.35 6.95 RTTM eval
1239321 12310321 18.21 3.28 11.07 3.87 RTTM eval
12311321 12310321 29.00 2.71 21.61 4.68 RTTM eval
12313321 12314321 30.24 3.71 16.86 9.66 RTTM eval
12315321 20.99 4.25 10.74 6.00 RTTM eval
12316321 12.62 1.55 3.30 7.76 RTTM eval
12317321 12.76 3.45 3.85 5.46 RTTM eval

支持

如需商业咨询和科学咨询,请联系 me . 对于 technical questions bug reports ,请查看 pyannote.audio GitHub 存储库。

引用

@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Address = {Brno, Czech Republic},
  Month = {August},
  Year = {2021},
}
@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}