MFAQ

?请查看 MQA 或MFAQ Light以获取更新的数据集。

MFAQ是从 Common Crawl 中解析出的多语言常见问题集。

from datasets import load_dataset
load_dataset("clips/mfaq", "en")
{
  "qa_pairs": [
    {
      "question": "Do  I need a rental Car in Cork?",
      "answer": "If you plan on travelling outside of Cork City, for instance to  Kinsale [...]"
    },
    ...
  ]
}

语言

我们收集了21种不同语言的约600万个问题和答案对。要下载特定语言的子集，需要在配置中指定语言键。以下是一个示例。

load_dataset("clips/mfaq", "en") # replace "en" by any language listed below

Language	Key	Pairs	Pages
All	all	6,346,693	1,035,649
English	en	3,719,484	608,796
German	de	829,098	111,618
Spanish	es	482,818	75,489
French	fr	351,458	56,317
Italian	it	155,296	24,562
Dutch	nl	150,819	32,574
Portuguese	pt	138,778	26,169
Turkish	tr	102,373	19,002
Russian	ru	91,771	22,643
Polish	pl	65,182	10,695
Indonesian	id	45,839	7,910
Norwegian	no	37,711	5,143
Swedish	sv	37,003	5,270
Danish	da	32,655	5,279
Vietnamese	vi	27,157	5,261
Finnish	fi	20,485	2,795
Romanian	ro	17,066	3,554
Czech	cs	16,675	2,568
Hebrew	he	11,212	1,921
Hungarian	hu	8,598	1,264
Croatian	hr	5,215	819

数据字段

嵌套（按页面，默认）

数据按页面组织。每个页面包含一组问题和答案。

id
language
num_pairs: 页面上的FAQ数量
domain：FAQ的来源网域
qa_pairs：问题和答案的列表
- question
- answer
- language

展平

数据按对（即页面已展平）组织。您可以通过在配置中加上_flat（例如en_flat）来访问任何语言的平面版本。数据将逐个对返回，而不是逐个页面。

domain_id
pair_id
language
domain：FAQ的来源网域
question
answer

源数据

本节摘自 OSCAR 的源数据描述。

Common Crawl是一个非盈利基金会，他们提供并维护着一个开放的网络抓取数据存储库，这个存储库是可访问和可分析的。Common Crawl的完整网络存档包含了8年的网络抓取数据，总量达到了几个PB。该组织的网络爬虫始终尊重nofollow和robots.txt策略。

为构建MFAQ，我们使用了Common Crawl的WARC文件。我们在HTML中寻找FAQPage标记，并从页面上解析出FAQItem。

人员

本模型是由 Maxime De Bruyn 、Ehsan Lotfi、Jeska Buhmann和Walter Daelemans开发的。

许可信息

These data are released under this licensing scheme.
We do not own any of the text from which these data has been extracted.
We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/

Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

引用信息

@misc{debruyn2021mfaq,
      title={MFAQ: a Multilingual FAQ Dataset}, 
      author={Maxime {De Bruyn} and Ehsan Lotfi and Jeska Buhmann and Walter Daelemans},
      year={2021},
      eprint={2109.12870},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

作者:

CLiPS

数据集大小:

4.12 GB