数据集:

a6kme/minds14-mirror

任务:

自动语音识别

子任务:

keyword-spotting

语言:

计算机处理:

multilingual

大小:

10K<n<100K

语言创建人:

crowdsourced expert-generated

批注创建人:

expert-generated crowdsourced machine-generated

预印本库:

arxiv:2104.08524

许可:

cc-by-4.0

数据集介绍文件清单

英文

MInDS-14

MINDS-14是一个用于口语意图识别任务的训练和评估资源。它涵盖了从电子银行领域的商业系统中提取出的14个意图，以及与14种不同语言变体相对应的口语示例。

示例

MInDS-14可以按照以下方式下载和使用：

from datasets import load_dataset

minds_14 = load_dataset("PolyAI/minds14", "fr-FR") # for French
# to download all data for multi-lingual fine-tuning uncomment following line
# minds_14 = load_dataset("PolyAI/all", "all")

# see structure
print(minds_14)

# load audio sample on the fly
audio_input = minds_14["train"][0]["audio"]  # first decoded audio sample
intent_class = minds_14["train"][0]["intent_class"]  # first transcription
intent = minds_14["train"].features["intent_class"].names[intent_class]

# use audio_input and language_class to fine-tune your model for audio classification

数据集结构

下面我们展示了数据集配置文件 fr-FR的详细信息，其他配置文件具有相同的结构。

数据实例

fr-FR

下载的数据集文件大小：471 MB
生成的数据集大小：300 KB
所使用的磁盘总量：471 MB

配置文件 fr-FR的数据实例示例如下：

{
    "path": "/home/patrick/.cache/huggingface/datasets/downloads/extracted/3ebe2265b2f102203be5e64fa8e533e0c6742e72268772c8ac1834c5a1a921e3/fr-FR~ADDRESS/response_4.wav",
    "audio": {
        "path": "/home/patrick/.cache/huggingface/datasets/downloads/extracted/3ebe2265b2f102203be5e64fa8e533e0c6742e72268772c8ac1834c5a1a921e3/fr-FR~ADDRESS/response_4.wav",
        "array": array(
            [0.0, 0.0, 0.0, ..., 0.0, 0.00048828, -0.00024414], dtype=float32
        ),
        "sampling_rate": 8000,
    },
    "transcription": "je souhaite changer mon adresse",
    "english_transcription": "I want to change my address",
    "intent_class": 1,
    "lang_id": 6,
}

数据字段

所有拆分的数据字段都是相同的。

path (str): 音频文件的路径
audio (dict): 音频对象，包括加载的音频数组、采样率和音频路径
transcription (str): 音频文件的转录
english_transcription (str): 音频文件的英文转录
intent_class (int): 意图的类别ID
lang_id (int): 语言ID

数据拆分

每个配置文件只有一个 "train" 拆分，包含约600个示例。

数据集创建

More Information Needed

使用数据的考虑事项

附加信息

数据集策展人

More Information Needed

许可信息

所有数据集都在 Creative Commons license (CC-BY) 下获得许可。

引用信息

@article{DBLP:journals/corr/abs-2104-08524,
  author    = {Daniela Gerz and
               Pei{-}Hao Su and
               Razvan Kusztos and
               Avishek Mondal and
               Michal Lis and
               Eshan Singhal and
               Nikola Mrksic and
               Tsung{-}Hsien Wen and
               Ivan Vulic},
  title     = {Multilingual and Cross-Lingual Intent Detection from Spoken Data},
  journal   = {CoRR},
  volume    = {abs/2104.08524},
  year      = {2021},
  url       = {https://arxiv.org/abs/2104.08524},
  eprinttype = {arXiv},
  eprint    = {2104.08524},
  timestamp = {Mon, 26 Apr 2021 17:25:10 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2104-08524.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

贡献

感谢 @patrickvonplaten 增加了这个数据集

作者:

a6kme

数据集大小:

12.57 KB