数据集:

flax-sentence-embeddings/stackexchange_title_best_voted_answer_jsonl

任务:

问答

语言:

en

计算机处理:

multilingual

语言创建人:

found

批注创建人:

found

源数据集:

original
英文

数据集卡片创建指南

数据集概述

我们从 Stack Exchange 个网络中自动提取了问题和答案 (Q&A) 对。Stack Exchange 聚集了来自 50 个在线平台的许多 Q&A 社区,包括著名的 Stack Overflow 和其他技术网站。每个月有1亿开发者访问 Stack Exchange。该数据集是一个平行语料库,其中每个问题映射到评分最高的答案。数据集根据不同的社区进行划分,涵盖了各种领域,包括3D打印、经济学、树莓派或者Emacs等。可以在 here 的链接中找到所有社区的详尽列表。

语言

Stack Exchange 主要由英语(en)构成。

数据集结构

数据实例

每个数据样本的呈现方式如下:

{'title_body': "Is there a Stack Exchange icon available? StackAuth /sites route provides all the site's icons except for the one of the Stack Exchange master site.\nCould you please provide it in some way (a static SVG would be good)?",
 'upvoted_answer': 'Here it is!\n\nDead link: SVG version here\nNote: the same restrictions on this trademarked icon that apply here, also apply to the icon above.',
 'downvoted_answer': 'No, the /sites route is not the right place for that.\n\n/sites enumerates all websites that expose API end-points. StackExchange.com does not expose such an endpoint, so it does not (and will not) appear in the results.'}

此特定示例对应于 following page

数据字段

数据集中包含以下信息的字段:

  • title_body:问题的标题和正文的拼接
  • upvoted_answer:得票最高的答案的正文

数据划分

我们为该数据集提供多个划分,每个划分对应于一个特定的社区频道。下面详细介绍了每个划分的数据对数:

Number of pairs
gaming 82,887
dba 71,449
codereview 41,748
gis 100,254
english 100,640
mathoverflow 85,289
askubuntu 267,135
electronics 129,494
apple 92,487
diy 52,896
magento 79,241
gamedev 40,154
mathematica 59,895
ell 77,892
judaism 26,085
drupal 67,817
blender 54,153
biology 19,277
android 38,077
crypto 19,404
christianity 11,498
cs 30,010
academia 32,137
chemistry 27,061
aviation 18,755
history 10,766
japanese 20,948
cooking 22,641
law 16,133
hermeneutics 9,516
hinduism 8,999
graphicdesign 28,083
dsp 17,430
bicycles 15,708
ethereum 26,124
ja 17,376
arduino 16,281
bitcoin 22,474
islam 10,052
datascience 20,503
german 13,733
codegolf 8,211
boardgames 11,805
economics 8,844
emacs 16,830
buddhism 6,787
gardening 13,246
astronomy 9,086
anime 10,131
fitness 8,297
cstheory 7,742
engineering 8,649
chinese 8,646
linguistics 6,843
cogsci 5,101
french 10,578
literature 3,539
ai 5,763
craftcms 11,236
health 4,494
chess 6,392
interpersonal 3,398
expressionengine 10,742
earthscience 4,396
civicrm 10,648
joomla 5,887
homebrew 5,608
latin 3,969
ham 3,501
hsm 2,517
avp 6,450
expatriates 4,913
matheducators 2,706
genealogy 2,895
3dprinting 3,488
devops 3,462
bioinformatics 3,135
computergraphics 2,306
elementaryos 5,917
martialarts 1,737
hardwarerecs 2,050
lifehacks 2,576
crafts 1,659
italian 3,101
freelancing 1,663
materials 1,101
bricks 3,530
cseducators 902
eosio 1,940
iot 1,359
languagelearning 948
beer 1,012
ebooks 1,107
coffee 1,188
esperanto 1,466
korean 1,406
cardano 248
conlang 334
drones 496
iota 775
salesforce 87,272
wordpress 83,621
rpg 40,435
scifi 54,805
stats 115,679
serverfault 238,507
physics 141,230
sharepoint 80,420
security 51,355
worldbuilding 26,210
softwareengineering 51,326
superuser 352,610
meta 1,000
money 29,404
travel 36,533
photo 23,204
webmasters 30,370
workplace 24,012
ux 28,901
philosophy 13,114
music 19,936
politics 11,047
movies 18,243
space 12,893
skeptics 8,145
raspberrypi 24,143
rus 16,528
puzzling 17,448
webapps 24,867
mechanics 18,613
writers 9,867
networkengineering 12,590
parenting 5,998
softwarerecs 11,761
quant 12,933
spanish 7,675
scicomp 7,036
pets 6,156
sqa 9,256
sitecore 7,838
vi 9,000
outdoors 5,278
sound 8,303
pm 5,435
reverseengineering 5,817
retrocomputing 3,907
tridion 5,907
quantumcomputing 4,320
sports 4,707
robotics 4,648
russian 3,937
opensource 3,221
woodworking 2,955
ukrainian 1,767
opendata 3,842
patents 3,573
mythology 1,595
portuguese 1,964
tor 4,167
monero 3,508
sustainability 1,674
musicfans 2,431
poker 1,665
or 1,490
windowsphone 2,807
stackapps 1,518
moderators 504
vegetarianism 585
tezos 1,169
stellar 1,078
pt 103,277
unix 155,414
tex 171,628
ru 253,289
total 4,750,619

数据集创建

策划理由

我们主要为了句子嵌入的训练设计了这个数据集。实际上,句子嵌入可以通过对比学习的设置进行训练,其中模型被训练以将每个句子与其对应的多个候选句子中的一个关联起来。这样的模型需要大量的示例才能高效,因此数据集的创建可能是繁琐的。像 Stack Exchange 这样的社区网络可以帮助我们半自动地构建许多示例。

数据源

源数据来自于 Stack Exchange

初始数据收集和规范化

我们从数学社区收集了数据。

我们筛选了标题或正文长度不到20个字符以及正文长度超过4096个字符的问题。

谁是源语言的生产者?

问题和答案是由 Stack Exchange 社区的开发者编写的。

其他信息

许可信息

请参阅 https://archive.org/details/stackexchange 处的许可信息。

引用信息

@misc{StackExchangeDataset,
  author = {Flax Sentence Embeddings Team},
  title = {Stack Exchange question pairs},
  year = {2021},
  howpublished = {https://huggingface.co/datasets/flax-sentence-embeddings/},
}

贡献

感谢 Flax 句子嵌入团队为添加此数据集。