数据集:
flax-sentence-embeddings/stackexchange_title_best_voted_answer_jsonl
任务:
问答子任务:
closed-domain-qa语言:
en计算机处理:
multilingual语言创建人:
found批注创建人:
found源数据集:
original许可:
cc-by-nc-sa-4.0我们从 Stack Exchange 个网络中自动提取了问题和答案 (Q&A) 对。Stack Exchange 聚集了来自 50 个在线平台的许多 Q&A 社区,包括著名的 Stack Overflow 和其他技术网站。每个月有1亿开发者访问 Stack Exchange。该数据集是一个平行语料库,其中每个问题映射到评分最高的答案。数据集根据不同的社区进行划分,涵盖了各种领域,包括3D打印、经济学、树莓派或者Emacs等。可以在 here 的链接中找到所有社区的详尽列表。
Stack Exchange 主要由英语(en)构成。
每个数据样本的呈现方式如下:
{'title_body': "Is there a Stack Exchange icon available? StackAuth /sites route provides all the site's icons except for the one of the Stack Exchange master site.\nCould you please provide it in some way (a static SVG would be good)?", 'upvoted_answer': 'Here it is!\n\nDead link: SVG version here\nNote: the same restrictions on this trademarked icon that apply here, also apply to the icon above.', 'downvoted_answer': 'No, the /sites route is not the right place for that.\n\n/sites enumerates all websites that expose API end-points. StackExchange.com does not expose such an endpoint, so it does not (and will not) appear in the results.'}
此特定示例对应于 following page 。
数据集中包含以下信息的字段:
我们为该数据集提供多个划分,每个划分对应于一个特定的社区频道。下面详细介绍了每个划分的数据对数:
Number of pairs | |
---|---|
gaming | 82,887 |
dba | 71,449 |
codereview | 41,748 |
gis | 100,254 |
english | 100,640 |
mathoverflow | 85,289 |
askubuntu | 267,135 |
electronics | 129,494 |
apple | 92,487 |
diy | 52,896 |
magento | 79,241 |
gamedev | 40,154 |
mathematica | 59,895 |
ell | 77,892 |
judaism | 26,085 |
drupal | 67,817 |
blender | 54,153 |
biology | 19,277 |
android | 38,077 |
crypto | 19,404 |
christianity | 11,498 |
cs | 30,010 |
academia | 32,137 |
chemistry | 27,061 |
aviation | 18,755 |
history | 10,766 |
japanese | 20,948 |
cooking | 22,641 |
law | 16,133 |
hermeneutics | 9,516 |
hinduism | 8,999 |
graphicdesign | 28,083 |
dsp | 17,430 |
bicycles | 15,708 |
ethereum | 26,124 |
ja | 17,376 |
arduino | 16,281 |
bitcoin | 22,474 |
islam | 10,052 |
datascience | 20,503 |
german | 13,733 |
codegolf | 8,211 |
boardgames | 11,805 |
economics | 8,844 |
emacs | 16,830 |
buddhism | 6,787 |
gardening | 13,246 |
astronomy | 9,086 |
anime | 10,131 |
fitness | 8,297 |
cstheory | 7,742 |
engineering | 8,649 |
chinese | 8,646 |
linguistics | 6,843 |
cogsci | 5,101 |
french | 10,578 |
literature | 3,539 |
ai | 5,763 |
craftcms | 11,236 |
health | 4,494 |
chess | 6,392 |
interpersonal | 3,398 |
expressionengine | 10,742 |
earthscience | 4,396 |
civicrm | 10,648 |
joomla | 5,887 |
homebrew | 5,608 |
latin | 3,969 |
ham | 3,501 |
hsm | 2,517 |
avp | 6,450 |
expatriates | 4,913 |
matheducators | 2,706 |
genealogy | 2,895 |
3dprinting | 3,488 |
devops | 3,462 |
bioinformatics | 3,135 |
computergraphics | 2,306 |
elementaryos | 5,917 |
martialarts | 1,737 |
hardwarerecs | 2,050 |
lifehacks | 2,576 |
crafts | 1,659 |
italian | 3,101 |
freelancing | 1,663 |
materials | 1,101 |
bricks | 3,530 |
cseducators | 902 |
eosio | 1,940 |
iot | 1,359 |
languagelearning | 948 |
beer | 1,012 |
ebooks | 1,107 |
coffee | 1,188 |
esperanto | 1,466 |
korean | 1,406 |
cardano | 248 |
conlang | 334 |
drones | 496 |
iota | 775 |
salesforce | 87,272 |
wordpress | 83,621 |
rpg | 40,435 |
scifi | 54,805 |
stats | 115,679 |
serverfault | 238,507 |
physics | 141,230 |
sharepoint | 80,420 |
security | 51,355 |
worldbuilding | 26,210 |
softwareengineering | 51,326 |
superuser | 352,610 |
meta | 1,000 |
money | 29,404 |
travel | 36,533 |
photo | 23,204 |
webmasters | 30,370 |
workplace | 24,012 |
ux | 28,901 |
philosophy | 13,114 |
music | 19,936 |
politics | 11,047 |
movies | 18,243 |
space | 12,893 |
skeptics | 8,145 |
raspberrypi | 24,143 |
rus | 16,528 |
puzzling | 17,448 |
webapps | 24,867 |
mechanics | 18,613 |
writers | 9,867 |
networkengineering | 12,590 |
parenting | 5,998 |
softwarerecs | 11,761 |
quant | 12,933 |
spanish | 7,675 |
scicomp | 7,036 |
pets | 6,156 |
sqa | 9,256 |
sitecore | 7,838 |
vi | 9,000 |
outdoors | 5,278 |
sound | 8,303 |
pm | 5,435 |
reverseengineering | 5,817 |
retrocomputing | 3,907 |
tridion | 5,907 |
quantumcomputing | 4,320 |
sports | 4,707 |
robotics | 4,648 |
russian | 3,937 |
opensource | 3,221 |
woodworking | 2,955 |
ukrainian | 1,767 |
opendata | 3,842 |
patents | 3,573 |
mythology | 1,595 |
portuguese | 1,964 |
tor | 4,167 |
monero | 3,508 |
sustainability | 1,674 |
musicfans | 2,431 |
poker | 1,665 |
or | 1,490 |
windowsphone | 2,807 |
stackapps | 1,518 |
moderators | 504 |
vegetarianism | 585 |
tezos | 1,169 |
stellar | 1,078 |
pt | 103,277 |
unix | 155,414 |
tex | 171,628 |
ru | 253,289 |
total | 4,750,619 |
我们主要为了句子嵌入的训练设计了这个数据集。实际上,句子嵌入可以通过对比学习的设置进行训练,其中模型被训练以将每个句子与其对应的多个候选句子中的一个关联起来。这样的模型需要大量的示例才能高效,因此数据集的创建可能是繁琐的。像 Stack Exchange 这样的社区网络可以帮助我们半自动地构建许多示例。
源数据来自于 Stack Exchange 。
初始数据收集和规范化我们从数学社区收集了数据。
我们筛选了标题或正文长度不到20个字符以及正文长度超过4096个字符的问题。
谁是源语言的生产者?问题和答案是由 Stack Exchange 社区的开发者编写的。
请参阅 https://archive.org/details/stackexchange 处的许可信息。
@misc{StackExchangeDataset, author = {Flax Sentence Embeddings Team}, title = {Stack Exchange question pairs}, year = {2021}, howpublished = {https://huggingface.co/datasets/flax-sentence-embeddings/}, }
感谢 Flax 句子嵌入团队为添加此数据集。