数据集:

clips/mqa

任务:

问答

子任务:

multiple-choice-qa

语言:

计算机处理:

multilingual

大小:

size_categories:unknown

语言创建人:

other

批注创建人:

no-annotation

源数据集:

original

许可:

cc0-1.0

数据集介绍文件清单

英文

MQA

MQA是从 Common Crawl 中解析出来的一个多语言问题和答案（MQA）语料库。问题分为两种类型：常见问题（FAQ）和社区问答（CQA）。

from datasets import load_dataset
all_data = load_dataset("clips/mqa", language="en")
{
  "name": "the title of the question (if any)",
  "text": "the body of the question (if any)",
  "answers": [{
    "text": "the text of the answer",
    "is_accepted": "true|false"
  }]
}
faq_data = load_dataset("clips/mqa", scope="faq", language="en")
cqa_data = load_dataset("clips/mqa", scope="cqa", language="en")

语言

我们收集了39种语言中约234M对问题和答案。要下载特定语言的子集，您需要在配置中指定语言键。以下是一个示例。

load_dataset("clips/mqa", language="en") # replace "en" by any language listed below

Language	FAQ	CQA
en	174,696,414	14,082,180
de	17,796,992	1,094,606
es	14,967,582	845,836
fr	13,096,727	1,299,359
ru	12,435,022	1,715,131
it	6,850,573	455,027
ja	6,369,706	2,089,952
zh	5,940,796	579,596
pt	5,851,286	373,982
nl	4,882,511	503,376
tr	3,893,964	370,975
pl	3,766,531	70,559
vi	2,795,227	96,528
id	2,253,070	200,441
ar	2,211,795	805,661
uk	2,090,611	27,260
el	1,758,618	17,167
no	1,752,820	11,786
sv	1,733,582	20,024
fi	1,717,221	41,371
ro	1,689,471	93,222
th	1,685,463	73,204
da	1,554,581	16,398
he	1,422,449	88,435
ko	1,361,901	49,061
cs	1,224,312	143,863
hu	878,385	27,639
fa	787,420	118,805
sk	785,101	4,615
lt	672,105	301
et	547,208	441
hi	516,342	205,645
hr	458,958	11,677
is	437,748	37
lv	428,002	88
ms	230,568	7,460
bg	198,671	5,320
sr	110,270	3,980
ca	100,201	1,914

FAQ vs. CQA

您可以下载数据集中的常见问题（FAQ）部分或社区问答（CQA）部分。

faq = load_dataset("clips/mqa", scope="faq")
cqa = load_dataset("clips/mqa", scope="cqa")
all = load_dataset("clips/mqa", scope="all")

虽然FAQ和CQA问题具有相同的结构，但CQA问题可以有多个答案，而FAQ问题只有一个答案。FAQ问题通常只有一个标题（name键），而CQA问题有一个标题和一个正文（name和text）。

嵌套和数据字段

您可以指定三个不同的嵌套级别：question，page和domain。

Question

load_dataset("clips/mqa", level="question") # default

默认级别是问题对象：

name：问题的标题（如有的话），以markdown格式
text：问题的正文（如有的话），以markdown格式
answers：答案列表
- text：答案的标题（如有的话），以markdown格式
- name：答案的正文，以markdown格式
- is_accepted：如果答案已被选择，则为true

Page

此级别返回同一页上的问题列表。对于FAQ，这通常很有用，因为CQA已经每页有一个问题。

load_dataset("clips/mqa", level="page")

Domain

此级别返回网域上的页面列表。这是解决FAQ重复的好方法，每个时期从每个域中抽取一个页面。

load_dataset("clips/mqa", level="domain")

来源数据

本节内容改编自 OSCAR 的源数据描述。

Common Crawl是一个非盈利基金会，生产和维护一个开放的可访问和可分析的网络抓取数据库。Common Crawl的完整网络存档包括通过8年的网络抓取收集的数据。该存储库包含原始网页HTML数据（WARC文件），元数据提取（WAT文件）和纯文本提取（WET文件）。该组织的网络爬虫始终尊重nofollow和robots.txt策略。

为构建MQA，我们使用了Common Crawl的WARC文件。

人员

该模型由 Maxime De Bruyn 、Ehsan Lotfi、Jeska Buhmann和Walter Daelemans开发。

许可信息

These data are released under this licensing scheme.
We do not own any of the text from which these data has been extracted.
We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/

Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

引用信息

@inproceedings{de-bruyn-etal-2021-mfaq,
    title = "{MFAQ}: a Multilingual {FAQ} Dataset",
    author = "De Bruyn, Maxime  and
      Lotfi, Ehsan  and
      Buhmann, Jeska  and
      Daelemans, Walter",
    booktitle = "Proceedings of the 3rd Workshop on Machine Reading for Question Answering",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrqa-1.1",
    pages = "1--13",
}

作者:

clips

数据集大小:

1009.32 MB