数据集:

castorini/msmarco_v2_passage_doc2query-t5_expansions

英文

数据集概述

该存储库提供了使用docTTTTTquery(有时写为docT5query或doc2query-T5)为MS MARCO v2段落语料库生成的查询,这是doc2query系列文件扩展模型的最新版本。基本思想是训练一个模型,当给出输入文档时,生成文档可能回答的问题(或更广泛地说,文档可能相关的查询)。然后将这些预测问题(或查询)附加到原始文档上,然后像以前一样对其进行索引。docTTTTTquery模型的名称来自于使用T5作为扩展模型。

数据集结构

所有三个折叠(训练、开发和测试)共用相同的语料库。查询是从该语料库生成的。

示例数据条目如下:

{   "id": "msmarco_passage_22_0", 
    "predicted_queries": ["in drug combat does a zombie take more damage or die", "is the health bar the same as smash bros", "is brawlhalla health bar", "icpri league brawlhalla", "what is a battle brawlhalla", "is smash bros minecraft brawlhalla zombies", "what are the health bars on brawlhalla", "does smash bros have health bars", "is brawlhalla a health bar", "what is brawlhalla", "what is brwlhalla", "how many health bars is in brawlhalla", "is there health bar in brawlhalla", "what is boiledhalla?", "what is a good health bar in brawlhalla", "what is skills brawlhalla", "how many gobs in a brawlhalla", "is smash bros. an nsb game", "how many health bars are there in the brawlhalla", "what is brawlhalla"]
}

加载数据集

加载数据集的示例:

dataset = load_dataset('castorini/msmarco_v2_passage_doc2query-t5_expansions', data_files='d2q/d2q.jsonl???.gz')

引用信息

@article{docTTTTTquery,
  title={From doc2query to {docTTTTTquery}},
  author={Nogueira, Rodrigo and Lin, Jimmy},
  year={2019}
}

@article{emdt5,
   author={Ronak Pradeep and Rodrigo Nogueira and Jimmy Lin},
   title={The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models},
   journal={arXiv:2101.05667},
   year={2021},
}