数据集:

unicamp-dl/mmarco

英文

数据集概述

mMARCO 是 MS MARCO passage ranking dataset 的多语言版本。有关更多信息,请查阅我们的论文:

第一个(已过时)版本包括 8 种语言:中文、法语、德语、印尼语、意大利语、葡萄牙语、俄语和西班牙语。目前的版本还增加了日语、荷兰语、越南语、印地语和阿拉伯语的翻译。目前的版本由 14 种语言组成(包括原始的英语版本)。

支持的语言

Language name Language code
English english
Chinese chinese
French french
German german
Indonesian indonesian
Italian italian
Portuguese portuguese
Russian russian
Spanish spanish
Arabic arabic
Dutch dutch
Hindi hindi
Japanese japanese
Vietnamese vietnamese

数据集结构

您可以通过选择特定的语言来加载 mMARCO 数据集。我们包括了训练三元组(查询、正例和负例)、翻译后的文档和查询集合。

训练三元组
>>> dataset = load_dataset('unicamp-dl/mmarco', 'english')
>>> dataset['train'][1]
{'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.', 'negative': 'The kola nut is the fruit of the kola tree, a genus (Cola) of trees that are native to the tropical rainforests of Africa.'}
查询
>>> dataset = load_dataset('unicamp-dl/mmarco', 'queries-spanish')
>>> dataset['train'][1]
{'id': 634306, 'text': '¿Qué significa Chattel en el historial de crédito'}
集合
>>> dataset = load_dataset('unicamp-dl/mmarco', 'collection-portuguese')
>>> dataset['collection'][100]
{'id': 100, 'text': 'Antonín Dvorák (1841-1904) Antonin Dvorak era filho de açougueiro, mas ele não seguiu o negócio de seu pai. Enquanto ajudava seu pai a meio tempo, estudou música e se formou na Escola de Órgãos de Praga em 1859.'}

引用信息

@misc{bonifacio2021mmarco,
      title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset}, 
      author={Luiz Henrique Bonifacio and Vitor Jeronymo and Hugo Queiroz Abonizio and Israel Campiotti and Marzieh Fadaee and  and Roberto Lotufo and Rodrigo Nogueira},
      year={2021},
      eprint={2108.13897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}