数据集:

lmqg/qg_squad

英文

数据集卡片:"lmqg/qg_squad"

数据集概述

这是《 "Generative Language Models for Paragraph-Level Question Generation: A Unified Benchmark and Evaluation, EMNLP 2022 main conference" 》提出的统一问题生成基准数据集《 QG-Bench 》的子集。这是针对问题生成(QG)任务的《 SQuAD 》数据集。训练/开发/测试集的划分遵循《 "Neural Question Generation" 》的工作并与《 leader board 》兼容。

支持的任务和排行榜

  • 问题生成:该数据集被认为是用于训练问题生成模型的。在这个任务中,通常通过实现高BLEU4/METEOR/ROUGE-L/BERTScore/MoverScore(更多详细信息请参阅我们的论文)来衡量成功。该任务有一个活跃的排行榜,可以在《 here 》上找到。

语言

英语(en)

数据集结构

'train'的示例如下。

{
  "question": "What is heresy mainly at odds with?",
  "paragraph": "Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs. A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
  "answer": "established beliefs or customs",
  "sentence": "Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs .",
  "paragraph_sentence": "<hl> Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs . <hl> A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
  "paragraph_answer": "Heresy is any provocative belief or theory that is strongly at variance with <hl> established beliefs or customs <hl>. A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
  "sentence_answer": "Heresy is any provocative belief or theory that is strongly at variance with <hl> established beliefs or customs <hl> ."
}

所有拆分的数据字段是相同的。

  • question:字符串特征。
  • paragraph:字符串特征。
  • answer:字符串特征。
  • sentence:字符串特征。
  • paragraph_answer:字符串特征,与段落相同,但答案用特殊标记""突出显示。
  • paragraph_sentence:字符串特征,与段落相同,但包含答案的句子用特殊标记""突出显示。
  • sentence_answer:字符串特征,与句子相同,但答案用特殊标记""突出显示。

假设每个paragraph_answer、paragraph_sentence和sentence_answer特征用于训练问题生成模型,但包含了不同的信息。paragraph_answer和sentence_answer特征用于答案感知型问题生成,而paragraph_sentence特征用于句子感知型问题生成。

数据拆分

train validation test
75722 10570 11877

引用信息

@inproceedings{ushio-etal-2022-generative,
    title = "{G}enerative {L}anguage {M}odels for {P}aragraph-{L}evel {Q}uestion {G}eneration",
    author = "Ushio, Asahi  and
        Alva-Manchego, Fernando  and
        Camacho-Collados, Jose",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, U.A.E.",
    publisher = "Association for Computational Linguistics",
}