There is one .npy file for each utterance in the dataset, 7931 files in total. The speaker embeddings are 512-element X-vectors.
The CMU ARCTIC dataset divides the utterances among the following speakers:
The X-vectors were extracted using this script , which uses the speechbrain/spkrec-xvect-voxceleb model.
Usage:
from datasets import load_dataset embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") speaker_embeddings = embeddings_dataset[7306]["xvector"] speaker_embeddings = torch.tensor(speaker_embeddings).unsqueeze(0)