中文

Dataset Card for SUPERB

Dataset Summary

SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.

Supported Tasks and Leaderboards

The SUPERB leaderboard can be found here https://superbbenchmark.org/leaderboard and consists of the following tasks:

pr

Phoneme Recognition (PR) transcribes an utterance into the smallest content units. This task includes alignment modeling to avoid potentially inaccurate forced alignment. LibriSpeech train-clean-100/dev-clean/test-clean subsets are adopted in SUPERB for training/validation/testing. Phoneme transcriptions are obtained from the LibriSpeech official g2p-model-5 and the conversion script in Kaldi librispeech s5 recipe. The evaluation metric is phone error rate (PER).

asr

Automatic Speech Recognition (ASR) transcribes utterances into words. While PR analyzes the improvement in modeling phonetics, ASR reflects the significance of the improvement in a real-world scenario. LibriSpeech train-clean-100/devclean/test-clean subsets are used for training/validation/testing. The evaluation metric is word error rate (WER).

ks

Keyword Spotting (KS) detects preregistered keywords by classifying utterances into a predefined set of words. The task is usually performed on-device for the fast response time. Thus, accuracy, model size, and inference time are all crucial. SUPERB uses the widely used Speech Commands dataset v1.0 for the task. The dataset consists of ten classes of keywords, a class for silence, and an unknown class to include the false positive. The evaluation metric is accuracy (ACC)

Example of usage:

Use these auxillary functions to:

  • load the audio file into an audio data array
  • sample from long _silence_ audio clips

For other examples of handling long _silence_ clips see the S3PRL or TFDS implementations.

def map_to_array(example):
    import soundfile as sf

    speech_array, sample_rate = sf.read(example["file"])
    example["speech"] = speech_array
    example["sample_rate"] = sample_rate
    return example


def sample_noise(example):
    # Use this function to extract random 1 sec slices of each _silence_ utterance,
    # e.g. inside `torch.utils.data.Dataset.__getitem__()`
    from random import randint

    if example["label"] == "_silence_":
        random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1)
        example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]]

    return example
qbe

Query by Example Spoken Term Detection (QbE) detects a spoken term (query) in an audio database (documents) by binary discriminating a given pair of query and document into a match or not. The English subset in QUESST 2014 challenge is adopted since we focus on investigating English as the first step. The evaluation metric is maximum term weighted value (MTWV) which balances misses and false alarms.

ic

Intent Classification (IC) classifies utterances into predefined classes to determine the intent of speakers. SUPERB uses the Fluent Speech Commands dataset , where each utterance is tagged with three intent labels: action, object, and location. The evaluation metric is accuracy (ACC).

sf

Slot Filling (SF) predicts a sequence of semantic slot-types from an utterance, like a slot-type FromLocation for a spoken word Taipei, which is known as a slot-value. Both slot-types and slot-values are essential for an SLU system to function. The evaluation metrics thus include slot-type F1 score and slotvalue CER. Audio SNIPS is adopted, which synthesized multi-speaker utterances for SNIPS. Following the standard split in SNIPS, US-accent speakers are further selected for training, and others are for validation/testing.

si

Speaker Identification (SI) classifies each utterance for its speaker identity as a multi-class classification, where speakers are in the same predefined set for both training and testing. The widely used VoxCeleb1 dataset is adopted, and the evaluation metric is accuracy (ACC).

asv

Automatic Speaker Verification (ASV) verifies whether the speakers of a pair of utterances match as a binary classification, and speakers in the testing set may not appear in the training set. Thus, ASV is more challenging than SID. VoxCeleb1 is used without VoxCeleb2 training data and noise augmentation. The evaluation metric is equal error rate (EER).

sd

Speaker Diarization (SD) predicts who is speaking when for each timestamp, and multiple speakers can speak simultaneously. The model has to encode rich speaker characteristics for each frame and should be able to represent mixtures of signals. LibriMix is adopted where LibriSpeech train-clean-100/dev-clean/test-clean are used to generate mixtures for training/validation/testing. We focus on the two-speaker scenario as the first step. The time-coded speaker labels were generated using alignments from Kaldi LibriSpeech ASR model. The evaluation metric is diarization error rate (DER).

Example of usage

Use these auxiliary functions to:

  • load the audio file into an audio data array
  • generate the label array
def load_audio_file(example, frame_shift=160):
    import soundfile as sf

    example["array"], example["sample_rate"] = sf.read(
        example["file"], start=example["start"] * frame_shift, stop=example["end"] * frame_shift
    )
    return example


def generate_label(example, frame_shift=160, num_speakers=2, rate=16000):
    import numpy as np

    start = example["start"]
    end = example["end"]
    frame_num = end - start
    speakers = sorted({speaker["speaker_id"] for speaker in example["speakers"]})
    label = np.zeros((frame_num, num_speakers), dtype=np.int32)
    for speaker in example["speakers"]:
        speaker_index = speakers.index(speaker["speaker_id"])
        start_frame = np.rint(speaker["start"] * rate / frame_shift).astype(int)
        end_frame = np.rint(speaker["end"] * rate / frame_shift).astype(int)
        rel_start = rel_end = None
        if start <= start_frame < end:
            rel_start = start_frame - start
        if start < end_frame <= end:
            rel_end = end_frame - start
        if rel_start is not None or rel_end is not None:
            label[rel_start:rel_end, speaker_index] = 1
    example["label"] = label
    return example
er

Emotion Recognition (ER) predicts an emotion class for each utterance. The most widely used ER dataset IEMOCAP is adopted, and we follow the conventional evaluation protocol: we drop the unbalance emotion classes to leave the final four classes with a similar amount of data points and cross-validates on five folds of the standard splits. The evaluation metric is accuracy (ACC).

Languages

The language data in SUPERB is in English (BCP-47 en )

Dataset Structure

Data Instances

pr

More Information Needed

asr

An example from each split looks like:

{'chapter_id': 1240,
 'file': 'path/to/file.flac',
 'audio': {'path': 'path/to/file.flac',
           'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
           'sampling_rate': 16000},
 'id': '103-1240-0000',
 'speaker_id': 103,
 'text': 'CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE '
         'LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE '
         'HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A '
         'BROOK'}
ks

An example from each split looks like:

{
  'file': '/path/yes/af7a8296_nohash_1.wav', 
  'audio': {'path': '/path/yes/af7a8296_nohash_1.wav',
            'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
            'sampling_rate': 16000},
  'label': 0  # 'yes'
}
qbe

More Information Needed

ic
{
  'file': "/path/wavs/speakers/2BqVo8kVB2Skwgyb/063aa8f0-4479-11e9-a9a5-5dbec3b8816a.wav",
  'audio': {'path': '/path/wavs/speakers/2BqVo8kVB2Skwgyb/063aa8f0-4479-11e9-a9a5-5dbec3b8816a.wav',
            'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
            'sampling_rate': 16000},
  'speaker_id': '2BqVo8kVB2Skwgyb',
  'text': 'Turn the bedroom lights off',
  'action': 3,  # 'deactivate'
  'object': 7,  # 'lights'
  'location': 0  # 'bedroom'
}
sf

More Information Needed

si
{
  'file': '/path/wav/id10003/na8-QEFmj44/00003.wav', 
  'audio': {'path': '/path/wav/id10003/na8-QEFmj44/00003.wav',
            'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
            'sampling_rate': 16000},
  'label': 2  # 'id10003'
}
asv

More Information Needed

sd

An example from each split looks like:

{
  'record_id': '1578-6379-0038_6415-111615-0009',
  'file': 'path/to/file.wav',
  'audio': {'path': 'path/to/file.wav',
            'array': array([-0.00048828, -0.00018311, -0.00137329, ...,  0.00079346, 0.00091553,  0.00085449], dtype=float32),
            'sampling_rate': 16000},
  'start': 0,
  'end': 1590,
  'speakers': [
    {'speaker_id': '1578', 'start': 28, 'end': 657},
    {'speaker_id': '6415', 'start': 28, 'end': 1576}
  ]
}
er

More Information Needed

Data Fields

####Note abouth the audio fields

When accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate . Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0] .

pr

More Information Needed

asr
  • file ( string ): Path to the WAV audio file.
  • audio ( dict ): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
  • text ( string ): The transcription of the audio file.
  • speaker_id ( integer ): A unique ID of the speaker. The same speaker id can be found for multiple data samples.
  • chapter_id ( integer ): ID of the audiobook chapter which includes the transcription.
  • id ( string ): A unique ID of the data sample.
ks
  • file ( string ): Path to the WAV audio file.
  • audio ( dict ): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
  • label ( ClassLabel ): Label of the spoken command. Possible values:
    • 0: "yes", 1: "no", 2: "up", 3: "down", 4: "left", 5: "right", 6: "on", 7: "off", 8: "stop", 9: "go", 10: "_silence_", 11: "_unknown_"
qbe

More Information Needed

ic
  • file ( string ): Path to the WAV audio file.
  • audio ( dict ): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
  • speaker_id ( string ): ID of the speaker.
  • text ( string ): Transcription of the spoken command.
  • action ( ClassLabel ): Label of the command's action. Possible values:
    • 0: "activate", 1: "bring", 2: "change language", 3: "deactivate", 4: "decrease", 5: "increase"
  • object ( ClassLabel ): Label of the command's object. Possible values:
    • 0: "Chinese", 1: "English", 2: "German", 3: "Korean", 4: "heat", 5: "juice", 6: "lamp", 7: "lights", 8: "music", 9: "newspaper", 10: "none", 11: "shoes", 12: "socks", 13: "volume"
  • location ( ClassLabel ): Label of the command's location. Possible values:
    • 0: "bedroom", 1: "kitchen", 2: "none", 3: "washroom"
sf

More Information Needed

si
  • file ( string ): Path to the WAV audio file.
  • audio ( dict ): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
  • label ( ClassLabel ): Label (ID) of the speaker. Possible values:
    • 0: "id10001", 1: "id10002", 2: "id10003", ..., 1250: "id11251"
asv

More Information Needed

sd

The data fields in all splits are:

  • record_id ( string ): ID of the record.
  • file ( string ): Path to the WAV audio file.
  • audio ( dict ): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
  • start ( integer ): Start frame of the audio.
  • end ( integer ): End frame of the audio.
  • speakers ( list of dict ): List of speakers in the audio. Each item contains the fields:
    • speaker_id ( string ): ID of the speaker.
    • start ( integer ): Frame when the speaker starts speaking.
    • end ( integer ): Frame when the speaker stops speaking.
er
  • file ( string ): Path to the WAV audio file.
  • audio ( dict ): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.
  • label ( ClassLabel ): Label of the speech emotion. Possible values:
    • 0: "neu", 1: "hap", 2: "ang", 3: "sad"

Data Splits

pr

More Information Needed

asr
train validation test
asr 28539 2703 2620
ks
train validation test
ks 51094 6798 3081
qbe

More Information Needed

ic
train validation test
ic 23132 3118 3793
sf

More Information Needed

si
train validation test
si 138361 6904 8251
asv

More Information Needed

sd

The data is split into "train", "dev" and "test" sets, each containing the following number of examples:

train dev test
sd 13901 3014 3002
er

The data is split into 5 sets intended for 5-fold cross-validation:

session1 session2 session3 session4 session5
er 1085 1023 1151 1031 1241

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@article{DBLP:journals/corr/abs-2105-01051,
  author    = {Shu{-}Wen Yang and
               Po{-}Han Chi and
               Yung{-}Sung Chuang and
               Cheng{-}I Jeff Lai and
               Kushal Lakhotia and
               Yist Y. Lin and
               Andy T. Liu and
               Jiatong Shi and
               Xuankai Chang and
               Guan{-}Ting Lin and
               Tzu{-}Hsien Huang and
               Wei{-}Cheng Tseng and
               Ko{-}tik Lee and
               Da{-}Rong Liu and
               Zili Huang and
               Shuyan Dong and
               Shang{-}Wen Li and
               Shinji Watanabe and
               Abdelrahman Mohamed and
               Hung{-}yi Lee},
  title     = {{SUPERB:} Speech processing Universal PERformance Benchmark},
  journal   = {CoRR},
  volume    = {abs/2105.01051},
  year      = {2021},
  url       = {https://arxiv.org/abs/2105.01051},
  archivePrefix = {arXiv},
  eprint    = {2105.01051},
  timestamp = {Thu, 01 Jul 2021 13:30:22 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2105-01051.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Note that each SUPERB dataset has its own citation. Please see the source to see
the correct citation for each contained dataset.

Contributions

Thanks to @lewtun , @albertvillanova and @anton-l for adding this dataset.