数据集:
unicamp-dl/mmarco
mMARCO 是 MS MARCO passage ranking dataset 的多语言版本。有关更多信息,请查阅我们的论文:
第一个(已过时)版本包括 8 种语言:中文、法语、德语、印尼语、意大利语、葡萄牙语、俄语和西班牙语。目前的版本还增加了日语、荷兰语、越南语、印地语和阿拉伯语的翻译。目前的版本由 14 种语言组成(包括原始的英语版本)。
| Language name | Language code |
|---|---|
| English | english |
| Chinese | chinese |
| French | french |
| German | german |
| Indonesian | indonesian |
| Italian | italian |
| Portuguese | portuguese |
| Russian | russian |
| Spanish | spanish |
| Arabic | arabic |
| Dutch | dutch |
| Hindi | hindi |
| Japanese | japanese |
| Vietnamese | vietnamese |
您可以通过选择特定的语言来加载 mMARCO 数据集。我们包括了训练三元组(查询、正例和负例)、翻译后的文档和查询集合。
训练三元组>>> dataset = load_dataset('unicamp-dl/mmarco', 'english')
>>> dataset['train'][1]
{'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.', 'negative': 'The kola nut is the fruit of the kola tree, a genus (Cola) of trees that are native to the tropical rainforests of Africa.'}
查询 >>> dataset = load_dataset('unicamp-dl/mmarco', 'queries-spanish')
>>> dataset['train'][1]
{'id': 634306, 'text': '¿Qué significa Chattel en el historial de crédito'}
集合 >>> dataset = load_dataset('unicamp-dl/mmarco', 'collection-portuguese')
>>> dataset['collection'][100]
{'id': 100, 'text': 'Antonín Dvorák (1841-1904) Antonin Dvorak era filho de açougueiro, mas ele não seguiu o negócio de seu pai. Enquanto ajudava seu pai a meio tempo, estudou música e se formou na Escola de Órgãos de Praga em 1859.'}
@misc{bonifacio2021mmarco,
title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset},
author={Luiz Henrique Bonifacio and Vitor Jeronymo and Hugo Queiroz Abonizio and Israel Campiotti and Marzieh Fadaee and and Roberto Lotufo and Rodrigo Nogueira},
year={2021},
eprint={2108.13897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}