Webis MS MARCO Anchor Text 2022

Webis MS MARCO Anchor Text 2022 dataset 丰富了 MS MARCO 的文档集合的版本1和版本2，其中包含从六个 Common Crawl 快照中提取的锚文本。这六个Common Crawl快照覆盖了2016年至2021年（每个快照包含17到34亿个文档）。我们随机采样了1000个带有超过1000个锚文本的文档的锚文本，并采样了所有带有少于1000个锚文本的文档的锚文本（这样采样可以覆盖94％的版本1文档和97％的版本2文档的所有锚文本）。总体而言，MS MARCO Anchor Text 2022数据集为版本1的1,703,834个文档和版本2的4,821,244个文档提供了锚文本情况。

MS MARCO Anchor Text 2022数据集的清洁版本可以在 ir_datasets 、 Zenodo 和 Hugging Face 中找到。原始数据集包含附加信息和所有提取的锚文本的元数据（大约100GB），可在 Hugging Face 和 files.webis.de 上获取。

Webis MS MARCO Anchor Text 2022数据集的构建细节在 associated paper 中描述。如果您使用此数据集，请引用

@InProceedings{froebe:2022a,
  address =               {Berlin Heidelberg New York},
  author =                {Maik Fr{\"o}be and Sebastian G{\"u}nther and Maximilian Probst and Martin Potthast and Matthias Hagen},
  booktitle =             {Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)},
  editor =                {Matthias Hagen and Suzan Verberne and Craig Macdonald and Christin Seifert and Krisztian Balog and Kjetil N{\o}rv\r{a}g and Vinay Setty},
  month =                 apr,
  publisher =             {Springer},
  series =                {Lecture Notes in Computer Science},
  site =                  {Stavanger, Norway},
  title =                 {{The Power of Anchor Text in the Neural Retrieval Era}},
  year =                  2022
}

。

作者:

webis

数据集大小:

422.63 GB