数据集:

albertvillanova/meqsum

其他:

medical

源数据集:

original

大小:

n<1K

计算机处理:

monolingual

语言:

en
英文

MeQSum 数据集卡片

数据集概述

MeQSum语料库是用于医学问题摘要的数据集。它包含1,000个摘要的消费者健康问题。

支持的任务和排行榜

[需要更多信息]

语言

英语(en)。

数据集结构

数据示例

{
  "CHQ": "SUBJECT: who and where to get cetirizine - D\\nMESSAGE: I need\\/want to know who manufscturs Cetirizine. My Walmart is looking for a new supply and are not getting the recent",
  "Summary": "Who manufactures cetirizine?",
  "File": "1-131188152.xml.txt"
}

数据字段

  • CHQ(str):消费者健康问题。
  • Summary(str):问题摘要,即表达原始问题所需的最少信息的简化问题。
  • File(str):文件名。

数据拆分

数据集包含一个包含1,000个示例的单一训练集。

数据集创建

策划理由

[需要更多信息]

源数据

初始数据采集和归一化

[需要更多信息]

谁是源语言的生成者?

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是注释者?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集维护者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

如果您使用MeQSum语料库,请引用:

@inproceedings{ben-abacha-demner-fushman-2019-summarization,
    title = "On the Summarization of Consumer Health Questions",
    author = "Ben Abacha, Asma  and
      Demner-Fushman, Dina",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P19-1215",
    doi = "10.18653/v1/P19-1215",
    pages = "2228--2234",
    abstract = "Question understanding is one of the main challenges in question answering. In real world applications, users often submit natural language questions that are longer than needed and include peripheral information that increases the complexity of the question, leading to substantially more false positives in answer retrieval. In this paper, we study neural abstractive models for medical question summarization. We introduce the MeQSum corpus of 1,000 summarized consumer health questions. We explore data augmentation methods and evaluate state-of-the-art neural abstractive models on this new task. In particular, we show that semantic augmentation from question datasets improves the overall performance, and that pointer-generator networks outperform sequence-to-sequence attentional models on this task, with a ROUGE-1 score of 44.16{\%}. We also present a detailed error analysis and discuss directions for improvement that are specific to question summarization.",
}

贡献

感谢 @albertvillanova 添加此数据集。