数据集:

castorini/msmarco_v1_doc_segmented_doc2query-t5_expansions

英文

数据集概要

该仓库提供了针对MS MARCO V1文档分段语料库生成的查询,使用的是docTTTTTquery(有时写作docT5query或doc2query-T5),这是doc2query系列文档扩展模型的最新版本。基本思想是训练一个模型,当给出输入文档时,生成文档可能回答的问题(或更广泛地说,文档可能相关的查询)。然后,将这些预测的问题(或查询)追加到原始文档中,并像以前一样对其进行索引。docTTTTTquery模型得名于使用T5作为扩展模型。

数据集结构

所有三个数据集(训练集,开发集和测试集)共享相同的语料库。一个示例数据条目如下所示:

{
    "id": "D1555982#0", "predicted_queries": ["when find radius of star r", "what is r radius", "how to find out radius of star", "what is radius r", "what is radius of r", "how do you find radius of star igel", "which law states that radiation is proportional to radiation?", "what is the radius of a spherical star", "what is the radius of the star", "what is radius of star"]
}

加载数据集

加载数据集的示例:

dataset = load_dataset('castorini/msmarco_v1_doc_segmented_doc2query-t5_expansions')

引文信息

@article{docTTTTTquery,
  title={From doc2query to {docTTTTTquery}},
  author={Nogueira, Rodrigo and Lin, Jimmy},
  year={2019}
}

@article{emdt5,
   author = "Ronak Pradeep and Rodrigo Nogueira and Jimmy Lin",
   title = "The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models",
   journal = "arXiv:2101.05667",
   year = 2021,
}