数据集:

BeIR/arguana-qrels

任务:

文本检索

子任务:

entity-linking-retrieval fact-checking-retrieval

语言:

计算机处理:

monolingual

许可:

cc-by-sa-4.0

数据集介绍文件清单

英文

BEIR基准数据集的数据卡

数据集摘要

BEIR是一个异构基准数据集，由18个不同的数据集组成，代表9个信息检索任务：

事实核查： FEVER ， Climate-FEVER ， SciFact
问答： NQ ， HotpotQA ， FiQA-2018
生物医学IR： TREC-COVID ， BioASQ ， NFCorpus
新闻检索： TREC-NEWS ， Robust04
论证检索： Touche-2020 ， ArguAna
重复问题检索： Quora ， CqaDupstack
引文预测： SCIDOCS
推文检索： Signal-1M
实体检索： DBPedia

所有这些数据集都经过预处理，可以用于您的实验。

支持的任务和排行榜

该数据集支持排行榜，评估模型针对特定任务的指标（如F1或EM），以及它们从维基百科中检索支持信息的能力。

目前最佳性能模型可以在 here 找到。

语言

所有任务均为英文（en）。

数据集结构

所有BEIR数据集必须包含一个语料库、查询和qrels（相关性判断文件）。它们必须使用以下格式：

语料库文件：一个.jsonl文件（jsonlines），包含一个字典列表，每个字典有三个字段：_id（唯一文档标识符）、title（文档标题，可选）和text（文档段落或段落）。例如：{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
查询文件：一个.jsonl文件（jsonlines），包含一个字典列表，每个字典有两个字段：_id（唯一查询标识符）和text（查询文本）。例如：{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
qrels文件：一个.tsv文件（制表符分隔），包含三列，即query-id、corpus-id和score，按此顺序。将第一行保留为标题。例如：q1 doc1 1

数据实例

BEIR数据集的一个高级示例：

corpus = {
    "doc1" : {
        "title": "Albert Einstein", 
        "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
                 one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
                 its influence on the philosophy of science. He is best known to the general public for his massâ€“energy \
                 equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
                 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
                 of the photoelectric effect', a pivotal step in the development of quantum theory."
        },
    "doc2" : {
        "title": "", # Keep title an empty string if not present
        "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
                 malted barley. The two main varieties are German WeiÃŸbier and Belgian witbier; other types include Lambic (made\
                 with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
    },
}

queries = {
    "q1" : "Who developed the mass-energy equivalence formula?",
    "q2" : "Which beer is brewed with a large proportion of wheat?"
}

qrels = {
    "q1" : {"doc1": 1},
    "q2" : {"doc2": 1},
}

数据字段

所有配置的示例都具有以下特征：

语料库

语料库：表示文档标题和段落文本的dict特征，由以下内容组成：
- _id：表示唯一文档id的字符串特征
  - 标题：表示文档标题的字符串特征。
  - 正文：表示文档正文的字符串特征。

查询

查询：表示查询的dict特征，由以下内容组成：
- _id：表示唯一查询id的字符串特征
- 文本：表示查询文本的字符串特征。

Qrels

qrels：表示查询文档相关性判断的dict特征，由以下内容组成：
- _id：表示查询id的字符串特征
  - _id：表示文档id的字符串特征。
  - score：表示查询和文档之间的相关性判断的int32特征。

数据拆分

Dataset	Website	BEIR-Name	Type	Queries	Corpus	Rel D/Q	Down-load	md5
MSMARCO	12321321	msmarco	train dev test	6,980	8.84M	1.1	12322321	444067daf65d982533ea17ebd59501e4
TREC-COVID	12323321	trec-covid	test	50	171K	493.5	12324321	ce62140cb23feb9becf6270d0d1fe6d1
NFCorpus	12325321	nfcorpus	train dev test	323	3.6K	38.2	12326321	a89dba18a62ef92f7d323ec890a0d38d
BioASQ	12327321	bioasq	train test	500	14.91M	8.05	No	12328321
NQ	12329321	nq	train test	3,452	2.68M	1.2	12330321	d4d3d2e48787a744b6f6e691ff534307
HotpotQA	12331321	hotpotqa	train dev test	7,405	5.23M	2.0	12332321	f412724f78b0d91183a0e86805e16114
FiQA-2018	12333321	fiqa	train dev test	648	57K	2.6	12334321	17918ed23cd04fb15047f73e6c3bd9d9
Signal-1M(RT)	12335321	signal1m	test	97	2.86M	19.6	No	12336321
TREC-NEWS	12337321	trec-news	test	57	595K	19.6	No	12338321
ArguAna	12339321	arguana	test	1,406	8.67K	1.0	12340321	8ad3e3c2a5867cdced806d6503f29b99
Touche-2020	12341321	webis-touche2020	test	49	382K	19.0	12342321	46f650ba5a527fc69e0a6521c5a23563
CQADupstack	12343321	cqadupstack	test	13,145	457K	1.4	12344321	4e41456d7df8ee7760a7f866133bda78
Quora	12345321	quora	dev test	10,000	523K	1.6	12346321	18fb154900ba42a600f84b839c173167
DBPedia	12347321	dbpedia-entity	dev test	400	4.63M	38.2	12348321	c2a39eb420a3164af735795df012ac2c
SCIDOCS	12349321	scidocs	test	1,000	25K	4.9	12350321	38121350fc3a4d2f48850f6aff52e4a9
FEVER	12351321	fever	train dev test	6,666	5.42M	1.2	12352321	5a818580227bfb4b35bb6fa46d9b6c03
Climate-FEVER	12353321	climate-fever	test	1,535	5.42M	3.0	12354321	8b66f0a9126c521bae2bde127b4dc99d
SciFact	12355321	scifact	train test	300	5K	1.1	12356321	5f7d1de60b170fc8027bb7898e2efca1
Robust04	12357321	robust04	test	249	528K	69.9	No	12358321

数据集创建

策划原理

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

谁是源语言的生产者？

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是标注者？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集维护者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

如下引用：

@inproceedings{
thakur2021beir,
title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
year={2021},
url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}

贡献者

感谢 @Nthakur20 添加此数据集。

作者:

BeIR

数据集大小:

109.24 KB