模型:
clips/mfaq
?请查看 MQA 或MFAQ Light以获取更新的数据集。
MFAQ是从 Common Crawl 中解析出的多语言常见问题集。
from datasets import load_dataset load_dataset("clips/mfaq", "en") { "qa_pairs": [ { "question": "Do I need a rental Car in Cork?", "answer": "If you plan on travelling outside of Cork City, for instance to Kinsale [...]" }, ... ] }
我们收集了21种不同语言的约600万个问题和答案对。要下载特定语言的子集,需要在配置中指定语言键。以下是一个示例。
load_dataset("clips/mfaq", "en") # replace "en" by any language listed below
Language | Key | Pairs | Pages |
---|---|---|---|
All | all | 6,346,693 | 1,035,649 |
English | en | 3,719,484 | 608,796 |
German | de | 829,098 | 111,618 |
Spanish | es | 482,818 | 75,489 |
French | fr | 351,458 | 56,317 |
Italian | it | 155,296 | 24,562 |
Dutch | nl | 150,819 | 32,574 |
Portuguese | pt | 138,778 | 26,169 |
Turkish | tr | 102,373 | 19,002 |
Russian | ru | 91,771 | 22,643 |
Polish | pl | 65,182 | 10,695 |
Indonesian | id | 45,839 | 7,910 |
Norwegian | no | 37,711 | 5,143 |
Swedish | sv | 37,003 | 5,270 |
Danish | da | 32,655 | 5,279 |
Vietnamese | vi | 27,157 | 5,261 |
Finnish | fi | 20,485 | 2,795 |
Romanian | ro | 17,066 | 3,554 |
Czech | cs | 16,675 | 2,568 |
Hebrew | he | 11,212 | 1,921 |
Hungarian | hu | 8,598 | 1,264 |
Croatian | hr | 5,215 | 819 |
数据按页面组织。每个页面包含一组问题和答案。
数据按对(即页面已展平)组织。您可以通过在配置中加上_flat(例如en_flat)来访问任何语言的平面版本。数据将逐个对返回,而不是逐个页面。
本节摘自 OSCAR 的源数据描述。
Common Crawl是一个非盈利基金会,他们提供并维护着一个开放的网络抓取数据存储库,这个存储库是可访问和可分析的。Common Crawl的完整网络存档包含了8年的网络抓取数据,总量达到了几个PB。该组织的网络爬虫始终尊重nofollow和robots.txt策略。
为构建MFAQ,我们使用了Common Crawl的WARC文件。我们在HTML中寻找FAQPage标记,并从页面上解析出FAQItem。
本模型是由 Maxime De Bruyn 、Ehsan Lotfi、Jeska Buhmann和Walter Daelemans开发的。
These data are released under this licensing scheme. We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: * Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. * Clearly identify the copyrighted work claimed to be infringed. * Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material. We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
@misc{debruyn2021mfaq, title={MFAQ: a Multilingual FAQ Dataset}, author={Maxime {De Bruyn} and Ehsan Lotfi and Jeska Buhmann and Walter Daelemans}, year={2021}, eprint={2109.12870}, archivePrefix={arXiv}, primaryClass={cs.CL} }