数据集:

McGill-NLP/mlquestions

英文

mlquestions数据集的数据卡

数据集概述

MLQuestions数据集由来自Google搜索查询的问题和与机器学习领域相关的维基百科页面段落组成。该数据集的创建旨在支持领域适应问题生成和段落检索模型的研究。

语言

数据集中的文本为英文

数据集结构

数据实例

我们发布了开发集和测试集,其中典型的数据点由一个通过 input_text 标签表示的段落和一个通过 target_text 标签表示的问题组成。

MLQuestions测试集的一个示例如下:

{ 'input_text': '贝叶斯学习使用贝叶斯定理来确定在给定一些证据或观察的情况下假设的条件概率。' 'target_text': '什么是贝叶斯学习在机器学习中'}

我们还提供了两个单独的文件中的无监督问题和段落 - 'passages_unaligned.csv'和'questions_unaligned.csv',分别带有 input_text 和 target_text 标签。

其他信息

许可信息

https://github.com/McGill-NLP/MLQuestions/blob/main/LICENSE.md

引用信息

如果您在研究中发现此数据集有用,请考虑引用:

@inproceedings{kulshreshtha-etal-2021-back,
    title = "Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval",
    author = "Kulshreshtha, Devang  and
      Belfer, Robert  and
      Serban, Iulian Vlad  and
      Reddy, Siva",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.566",
    pages = "7064--7078",
    abstract = "In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA). While self-training generates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between target domain and synthetic data distribution, and reduces model overfitting to source domain. We run UDA experiments on question generation and passage retrieval from the Natural Questions domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6{\%} top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset - MLQuestions containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.",
}