数据集:
flax-sentence-embeddings/stackexchange_titlebody_best_and_down_voted_answer_jsonl
任务:
问答子任务:
closed-domain-qa语言:
en计算机处理:
multilingual语言创建人:
found批注创建人:
found源数据集:
original许可:
cc-by-nc-sa-4.0我们从 Stack Exchange 个网络中自动提取了问题和答案(Q&A)对。Stack Exchange集合了50个在线平台上的许多Q&A社区,包括著名的Stack Overflow和其他技术网站。每个月有1亿开发者访问Stack Exchange。该数据集是一个平行语料库,每个问题都映射到最高评级的答案。该数据集根据覆盖的社区分割,这些社区涵盖了各种领域,包括3D打印、经济学、树莓派或Emacs。所有社区的详尽列表可在 here 上找到。
Stack Exchange主要使用英语(en)。
每个数据样本的格式如下:
{'title_body': "Is there a Stack Exchange icon available? StackAuth /sites route provides all the site's icons except for the one of the Stack Exchange master site.\nCould you please provide it in some way (a static SVG would be good)?", 'upvoted_answer': 'Here it is!\n\nDead link: SVG version here\nNote: the same restrictions on this trademarked icon that apply here, also apply to the icon above.', 'downvoted_answer': 'No, the /sites route is not the right place for that.\n\n/sites enumerates all websites that expose API end-points. StackExchange.com does not expose such an endpoint, so it does not (and will not) appear in the results.'}
这个特定例子对应于 following page
数据集中的字段包含以下信息:
我们为这个数据集提供了多个拆分,每个拆分对应一个给定的社区频道。下面详细介绍了每个拆分中的数据对数量:
Number of pairs | |
---|---|
english | 13,003 |
academia | 2,465 |
christianity | 1,502 |
apple | 6,696 |
electronics | 4,014 |
gaming | 7,321 |
askubuntu | 9,975 |
ell | 4,438 |
hermeneutics | 1,719 |
judaism | 2,216 |
diy | 2,037 |
law | 1,297 |
history | 1,099 |
islam | 2,037 |
dba | 2,502 |
cooking | 2,064 |
gamedev | 1,598 |
drupal | 1,714 |
chemistry | 1,523 |
android | 2,830 |
mathoverflow | 1,109 |
magento | 1,849 |
buddhism | 770 |
gis | 1,843 |
graphicdesign | 1,565 |
codereview | 666 |
aviation | 903 |
bicycles | 984 |
japanese | 1,124 |
cs | 936 |
german | 1,047 |
interpersonal | 469 |
biology | 832 |
bitcoin | 1,068 |
blender | 1,312 |
crypto | 595 |
anime | 802 |
boardgames | 691 |
hinduism | 343 |
french | 632 |
fitness | 567 |
economics | 441 |
chinese | 611 |
codegolf | 333 |
linguistics | 442 |
astronomy | 371 |
arduino | 595 |
chess | 402 |
cstheory | 314 |
ja | 328 |
martialarts | 254 |
mathematica | 262 |
dsp | 387 |
ethereum | 479 |
health | 299 |
cogsci | 221 |
earthscience | 229 |
gardening | 210 |
datascience | 325 |
literature | 191 |
matheducators | 177 |
lifehacks | 316 |
engineering | 227 |
ham | 158 |
3dprinting | 109 |
italian | 181 |
emacs | 188 |
homebrew | 176 |
ai | 130 |
avp | 152 |
expatriates | 132 |
elementaryos | 224 |
cseducators | 67 |
hsm | 70 |
expressionengine | 91 |
joomla | 124 |
freelancing | 70 |
crafts | 72 |
genealogy | 86 |
latin | 55 |
hardwarerecs | 58 |
devops | 53 |
coffee | 47 |
beer | 57 |
languagelearning | 42 |
ebooks | 54 |
bricks | 79 |
civicrm | 85 |
bioinformatics | 39 |
esperanto | 56 |
computergraphics | 30 |
conlang | 8 |
korean | 28 |
iota | 31 |
eosio | 44 |
craftcms | 26 |
iot | 10 |
drones | 6 |
cardano | 7 |
materials | 1 |
ru | 6,305 |
softwareengineering | 4,238 |
scifi | 5,176 |
workplace | 4,317 |
serverfault | 7,969 |
rpg | 4,212 |
physics | 8,362 |
superuser | 17,425 |
worldbuilding | 2,087 |
security | 3,069 |
pt | 3,718 |
unix | 6,173 |
meta | 61 |
politics | 1,468 |
stats | 2,238 |
movies | 1,577 |
photo | 1,432 |
wordpress | 3,046 |
music | 1,228 |
philosophy | 1,184 |
skeptics | 670 |
money | 1,905 |
salesforce | 1,781 |
parenting | 624 |
raspberrypi | 1,011 |
travel | 1,317 |
mechanics | 842 |
tex | 1,095 |
ux | 1,107 |
sharepoint | 1,691 |
webapps | 1,906 |
puzzling | 784 |
networkengineering | 476 |
webmasters | 854 |
sports | 455 |
rus | 514 |
space | 405 |
writers | 407 |
pets | 322 |
pm | 241 |
russian | 353 |
spanish | 366 |
sound | 365 |
quant | 340 |
sqa | 353 |
outdoors | 221 |
softwarerecs | 348 |
retrocomputing | 135 |
mythology | 103 |
portuguese | 144 |
opensource | 123 |
scicomp | 127 |
ukrainian | 87 |
patents | 137 |
sustainability | 152 |
poker | 115 |
robotics | 110 |
woodworking | 93 |
reverseengineering | 97 |
sitecore | 122 |
tor | 137 |
vi | 95 |
windowsphone | 153 |
vegetarianism | 35 |
moderators | 23 |
quantumcomputing | 46 |
musicfans | 78 |
tridion | 68 |
opendata | 45 |
tezos | 11 |
stellar | 3 |
or | 13 |
monero | 26 |
stackapps | 15 |
total | 210,748 |
我们主要为句子嵌入训练设计了这个数据集。确实,句子嵌入可以使用对比学习设置进行训练,其中模型被训练成将每个句子与其对应的选项中的一对关联起来。这样的模型需要大量的示例才能有效,因此数据集创建可能是繁琐的。Stack Exchange等社区网络使我们能够半自动地建立许多示例。
来源数据是来自 Stack Exchange 的转储数据。
初始数据收集和规范化我们从数学社区中收集数据。
我们过滤掉标题或正文长度低于20个字符的问题,以及正文长度超过4096个字符的问题。在提取最多赞的答案时,我们筛选了最多赞和最多踩答案之间至少有100票差距的配对。
谁是源语言的生产者?问题和答案是由Stack Exchange的开发者社区撰写的。
请参阅许可信息: https://archive.org/details/stackexchange
@misc{StackExchangeDataset, author = {Flax Sentence Embeddings Team}, title = {Stack Exchange question pairs}, year = {2021}, howpublished = {https://huggingface.co/datasets/flax-sentence-embeddings/}, }
感谢Flax句子嵌入团队添加了这个数据集。