数据集:

PolyAI/minds14

任务:

自动语音识别

子任务:

keyword-spotting

语言:

计算机处理:

multilingual

大小:

10K<n<100K

语言创建人:

crowdsourced expert-generated

批注创建人:

expert-generated crowdsourced machine-generated

预印本库:

arxiv:2104.08524

许可:

cc-by-4.0

数据集介绍文件清单

英文

MInDS-14

MINDS-14是一个用于语音数据意图检测任务的训练和评估资源。它覆盖了来自电子银行领域商业系统中提取的14个意图，与14种不同语言风格的口语示例相关联。

示例

MInDS-14可以按以下方式下载和使用：

from datasets import load_dataset

minds_14 = load_dataset("PolyAI/minds14", "fr-FR") # for French
# to download all data for multi-lingual fine-tuning uncomment following line
# minds_14 = load_dataset("PolyAI/all", "all")

# see structure
print(minds_14)

# load audio sample on the fly
audio_input = minds_14["train"][0]["audio"]  # first decoded audio sample
intent_class = minds_14["train"][0]["intent_class"]  # first transcription
intent = minds_14["train"].features["intent_class"].names[intent_class]

# use audio_input and language_class to fine-tune your model for audio classification

数据集结构

我们展示了配置fr-FR的数据集示例配置的详细信息。所有其他配置具有相同的结构。

数据实例

fr-FR

下载的数据集文件大小：471 MB
生成的数据集大小：300 KB
总磁盘使用量：471 MB

配置fr-FR的数据实例示例如下：

{
    "path": "/home/patrick/.cache/huggingface/datasets/downloads/extracted/3ebe2265b2f102203be5e64fa8e533e0c6742e72268772c8ac1834c5a1a921e3/fr-FR~ADDRESS/response_4.wav",
    "audio": {
        "path": "/home/patrick/.cache/huggingface/datasets/downloads/extracted/3ebe2265b2f102203be5e64fa8e533e0c6742e72268772c8ac1834c5a1a921e3/fr-FR~ADDRESS/response_4.wav",
        "array": array(
            [0.0, 0.0, 0.0, ..., 0.0, 0.00048828, -0.00024414], dtype=float32
        ),
        "sampling_rate": 8000,
    },
    "transcription": "je souhaite changer mon adresse",
    "english_transcription": "I want to change my address",
    "intent_class": 1,
    "lang_id": 6,
}

数据字段

数据字段在所有拆分中都相同。

路径（str）：音频文件的路径
音频（dict）：包括加载的音频数组、采样率和音频路径的音频对象
转录（str）：音频文件的转录
英文转录（str）：音频文件的英文转录
意图类别（int）：意图的类别ID
语言ID（int）：语言的ID

数据拆分

每个配置只有一个“train”拆分，包含约600个示例。

数据集创建

More Information Needed

使用数据的注意事项

附加信息

数据集策划人员

More Information Needed

许可信息

所有数据集均在 Creative Commons license (CC-BY) 下获得许可。

引用信息

@article{DBLP:journals/corr/abs-2104-08524,
  author    = {Daniela Gerz and
               Pei{-}Hao Su and
               Razvan Kusztos and
               Avishek Mondal and
               Michal Lis and
               Eshan Singhal and
               Nikola Mrksic and
               Tsung{-}Hsien Wen and
               Ivan Vulic},
  title     = {Multilingual and Cross-Lingual Intent Detection from Spoken Data},
  journal   = {CoRR},
  volume    = {abs/2104.08524},
  year      = {2021},
  url       = {https://arxiv.org/abs/2104.08524},
  eprinttype = {arXiv},
  eprint    = {2104.08524},
  timestamp = {Mon, 26 Apr 2021 17:25:10 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2104-08524.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

贡献

感谢 @patrickvonplaten 添加了这个数据集

作者:

PolyAI

数据集大小:

12.55 KB