数据集:

unicamp-dl/mrobust

英文

数据集概述

mRobust是 TREC 2004 Robust passage ranking dataset 的多语言版本。欲了解更多信息,请参阅我们的论文:

当前版本包含10种语言:中文、法文、德文、印尼文、意大利文、葡萄牙文、俄文、西班牙文、荷兰文和越南文。

支持的语言

Language name Language code
English english
Chinese chinese
French french
German german
Indonesian indonesian
Italian italian
Portuguese portuguese
Russian russian
Spanish spanish
Dutch dutch
Vietnamese vietnamese

数据集结构

您可以通过选择特定语言来加载mRobust数据集。我们提供了翻译后的文档和查询集合。

Queries
>>> dataset = load_dataset('unicamp-dl/mrobust', 'queries-spanish')
>>> dataset['queries'][1]
{'id': '302', 'text': '¿Está controlada la enfermedad de la poliomielitis (polio) en el mundo?'}
Collection
>>> dataset = load_dataset('unicamp-dl/mrobust', 'collection-portuguese')
>>> dataset['collection'][5]
{'id': 'FT931-16660', 'text': '930105 FT 05 JAN 93 / Cenelec: Correção O endereço do Cenelec, Comitê Europeu de Normalização Eletrotécnica, estava incorreto na edição de ontem. É Rue de Stassart 35, B-1050, Bruxelas, Tel (322) 519 6871. CEN, Comitê Europeu de Normalização, está localizado na Rue de Stassart 36, B-1050, Bruxelas, Tel 519 6811.'}

引用信息

@misc{https://doi.org/10.48550/arxiv.2209.13738,
  doi = {10.48550/ARXIV.2209.13738},
  url = {https://arxiv.org/abs/2209.13738},
  author = {Jeronymo, Vitor and Nascimento, Mauricio and Lotufo, Roberto and Nogueira, Rodrigo},
  title = {mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}