There is one .npy file for each utterance in the dataset, 7931 files in total. The speaker embeddings are 512-element X-vectors.
The CMU ARCTIC dataset divides the utterances among the following speakers:
The X-vectors were extracted using this script , which uses the speechbrain/spkrec-xvect-voxceleb model.
Usage:
from datasets import load_dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = embeddings_dataset[7306]["xvector"]
speaker_embeddings = torch.tensor(speaker_embeddings).unsqueeze(0)