数据集:
so_stacksample
任务:
文生文语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original许可:
cc-by-sa-3.0此数据集包含来自Stack Overflow编程问答网站的10%问题和答案的文本。
它以三个表的形式组织:
问题表包含非删除的所有Stack Overflow问题的标题、正文、创建日期、关闭日期(如果适用)、分数和所有者ID,其中id是10的倍数。答案表包含每个问题的答案的正文、创建日期、分数和所有者ID。ParentId列链接回问题表。标签表包含这些问题的标签。
示例项目包括:
英语(en)和编程语言。
答案数据实例:
{ "Id": { # Unique ID given to the Answer post "feature_type": "Value", "dtype": "int32" }, "OwnerUserId": { # The UserID of the person who generated the Answer on StackOverflow. -1 means NA "feature_type": "Value", "dtype": "int32" }, "CreationDate": { # The date the Answer was generated. Follows standard datetime format. "feature_type": "Value", "dtype": "string" }, "ParentId": { # Refers to the `Id` of the Question the Answer belong to. "feature_type": "Value", "dtype": "int32" }, "Score": { # The sum of up and down votes given to the Answer. Can be negative. "feature_type": "Value", "dtype": "int32" }, "Body": { # The body content of the Answer. "feature_type": "Value", "dtype": "string" } }
问题数据实例:
{ "Id": { # Unique ID given to the Question post "feature_type": "Value", "dtype": "int32" }, "OwnerUserId": { # The UserID of the person who generated the Question on StackOverflow. -1 means NA. "feature_type": "Value", "dtype": "int32" }, "CreationDate": { # The date the Question was generated. Follows standard datetime format. "feature_type": "Value", "dtype": "string" }, "ClosedDate": { # The date the Question was generated. Follows standard datetime format. Can be NA. "feature_type": "Value", "dtype": "string" }, "Score": { # The sum of up and down votes given to the Question. Can be negative. "feature_type": "Value", "dtype": "int32" }, "Title": { # The title of the Question. "feature_type": "Value", "dtype": "string" }, "Body": { # The body content of the Question. "feature_type": "Value", "dtype": "string" } }
标签数据实例:
{ "Id": { # ID of the Question the tag belongs to "feature_type": "Value", "dtype": "int32" }, "Tag": { # The tag name "feature_type": "Value", "dtype": "string" } }
`
答案数据字段:- ID:答案帖子的唯一ID OwnerUserId:在Stack Overflow上生成答案的用户ID。-1表示NA" CreationDate:"生成答案的日期。遵循标准的日期时间格式。" ParentId:"引用答案所属问题的ID。" Score:"答案获得的赞和踩的总数。可以为负数。" Body:"答案的正文内容。
问题数据字段:
标签数据字段:
数据集有3个拆分:
Kaggle上还有所有R语言问题和所有Python问题的数据集,但是这个数据集特别适用于跨多种语言进行分析。
[需要更多信息]
谁是数据源的语言生成者?StackOverflow的用户。
[需要更多信息]
注释者是谁?[需要更多信息]
此数据包含可能识别StackOverflow个人用户的信息。这些信息是自报告的。
[需要更多信息]
StackOverflow的答案不能保证是安全、可靠或正确的。一些答案可能故意不安全,如用户 zys 在答案 https://stackoverflow.com/a/35571883/5768407 中展示的目的是绕过Google Play商店安全检查的解决方案。这样的答案可能导致使用这些数据的偏见模型,并进一步传播不安全和不可靠的编程实践。
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
所有Stack Overflow用户的贡献都在CC-BY-SA 3.0许可下,要求署名。
内容来自Stack Overflow。
感谢用户 @ncoop57 添加此数据集。