数据集:
unicamp-dl/mmarco
mMARCO 是 MS MARCO passage ranking dataset 的多语言版本。有关更多信息,请查阅我们的论文:
第一个(已过时)版本包括 8 种语言:中文、法语、德语、印尼语、意大利语、葡萄牙语、俄语和西班牙语。目前的版本还增加了日语、荷兰语、越南语、印地语和阿拉伯语的翻译。目前的版本由 14 种语言组成(包括原始的英语版本)。
Language name | Language code |
---|---|
English | english |
Chinese | chinese |
French | french |
German | german |
Indonesian | indonesian |
Italian | italian |
Portuguese | portuguese |
Russian | russian |
Spanish | spanish |
Arabic | arabic |
Dutch | dutch |
Hindi | hindi |
Japanese | japanese |
Vietnamese | vietnamese |
您可以通过选择特定的语言来加载 mMARCO 数据集。我们包括了训练三元组(查询、正例和负例)、翻译后的文档和查询集合。
训练三元组>>> dataset = load_dataset('unicamp-dl/mmarco', 'english') >>> dataset['train'][1] {'query': 'what fruit is native to australia', 'positive': 'Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.assiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible.', 'negative': 'The kola nut is the fruit of the kola tree, a genus (Cola) of trees that are native to the tropical rainforests of Africa.'}查询
>>> dataset = load_dataset('unicamp-dl/mmarco', 'queries-spanish') >>> dataset['train'][1] {'id': 634306, 'text': '¿Qué significa Chattel en el historial de crédito'}集合
>>> dataset = load_dataset('unicamp-dl/mmarco', 'collection-portuguese') >>> dataset['collection'][100] {'id': 100, 'text': 'Antonín Dvorák (1841-1904) Antonin Dvorak era filho de açougueiro, mas ele não seguiu o negócio de seu pai. Enquanto ajudava seu pai a meio tempo, estudou música e se formou na Escola de Órgãos de Praga em 1859.'}
@misc{bonifacio2021mmarco, title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset}, author={Luiz Henrique Bonifacio and Vitor Jeronymo and Hugo Queiroz Abonizio and Israel Campiotti and Marzieh Fadaee and and Roberto Lotufo and Rodrigo Nogueira}, year={2021}, eprint={2108.13897}, archivePrefix={arXiv}, primaryClass={cs.CL} }