UniSpeech-SAT-Base说话人验证模型

Microsoft's UniSpeech

该模型在16kHz采样率的语音音频上进行了预训练，使用了句子和说话人对比损失。在使用该模型时，请确保您的语音输入也采样为16kHz。

该模型的预训练内容包括：

60000小时的 Libri-Light
10000小时的 GigaSpeech
24000小时的 VoxPopuli

Paper: UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING

作者：Sanyuan Chen，Yu Wu，Chengyi Wang，Zhengyang Chen，Zhuo Chen，Shujie Liu，Jian Wu，Yao Qian，Furu Wei，Jinyu Li，Xiangzhan Yu

摘要自我监督学习（SSL）是语音处理的一个长期目标，因为它利用大规模无标签数据并避免了繁琐的人工标注。近年来，将自我监督学习应用于语音识别取得了巨大成功，但在建模说话人特征方面尝试的研究有限。本文旨在改进现有的自我监督学习框架以用于说话人表示学习。我们引入了两种方法来增强无监督说话人信息的提取。首先，我们将多任务学习应用于当前的自我监督学习框架，将语句级对比损失与自我监督目标函数进行集成。其次，为了更好地区分说话人，我们提出了一种语句混合策略用于数据增强，在训练过程中无监督地创建了额外的重叠语句并进行整合。我们将这些方法整合到HuBERT框架中。在SUPERB基准测试上的实验结果表明，该系统在通用表示学习方面取得了最先进的性能，特别是对于面向说话人识别的任务。我们进行了消融研究，验证了每种方法的有效性。最后，我们扩大了训练数据集，使用了94000小时的公共音频数据，并在所有SUPERB任务中进一步提高了性能。

原始模型可以在 https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT 中找到。

Fine-tuning细节

该模型是在 VoxCeleb1 dataset 上使用带有Additive Margin Softmax损失的X-Vector头部进行微调的

X-Vectors: Robust DNN Embeddings for Speaker Recognition

使用方法

说话人验证

from transformers import Wav2Vec2FeatureExtractor, UniSpeechSatForXVector
from datasets import load_dataset
import torch

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/unispeech-sat-base-plus-sv')
model = UniSpeechSatForXVector.from_pretrained('microsoft/unispeech-sat-base-plus-sv')

# audio files are decoded on the fly
inputs = feature_extractor(dataset[:2]["audio"]["array"], return_tensors="pt")
embeddings = model(**inputs).embeddings
embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()

# the resulting embeddings can be used for cosine similarity-based retrieval
cosine_sim = torch.nn.CosineSimilarity(dim=-1)
similarity = cosine_sim(embeddings[0], embeddings[1])
threshold = 0.89  # the optimal threshold is dataset-dependent
if similarity < threshold:
    print("Speakers are not the same!")

许可证

官方许可证可以在 here 中找到

作者:

Microsoft

数据集大小:

385.82 MB