俄罗斯StackOverflow数据集

描述

摘要: 包含来自 ru.stackoverflow.com 的问题、答案和评论的数据集。

脚本: create_stackoverflow.py

联系人: Ilya Gusev

语言: 该数据集使用俄语编写，包含一些编程代码。

使用

先决条件:

pip install datasets zstandard jsonlines pysimdjson

加载:

from datasets import load_dataset
dataset = load_dataset('IlyaGusev/ru_stackoverflow', split="train")
for example in dataset:
    print(example["text_markdown"])
    print()

数据实例

{
  "question_id": 11235,
  "answer_count": 1,
  "url": "https://ru.stackoverflow.com/questions/11235",
  "score": 2,
  "tags": ["c++", "сериализация"],
  "title": "Извлечение из файла, запись в файл",
  "views": 1309,
  "author": "...",
  "timestamp": 1303205289,
  "text_html": "...",
  "text_markdown": "...",
  "comments": {
    "text": ["...", "...",
    "author": ["...", "..."],
    "comment_id": [11236, 11237],
    "score": [0, 0],
    "timestamp": [1303205411, 1303205678]
  },
  "answers": {
    "answer_id": [11243, 11245],
    "timestamp": [1303207791, 1303207792],
    "is_accepted": [1, 0],
    "text_html": ["...", "..."],
    "text_markdown": ["...", "..."],
    "score": [3, 0],
    "author": ["...", "..."],
    "comments": {
      "text": ["...", "..."],
      "author": ["...", "..."],
      "comment_id": [11246, 11249],
      "score": [0, 0],
      "timestamp": [1303207961, 1303207800]
    }
  }
}

您可以使用此辅助工具对序列进行展平:

def revert_flattening(records):
    fixed_records = []
    for key, values in records.items():
        if not fixed_records:
            fixed_records = [{} for _ in range(len(values))]
        for i, value in enumerate(values):
            fixed_records[i][key] = value
    return fixed_records

原始JSONL已经被展平。

数据来源

数据来源是 Russian StackOverflow 网站。
原始XML文件: ru.stackoverflow.com.7z 。
处理脚本为 here 。

个人和敏感信息

该数据集未经匿名处理，因此可以在数据集中找到个人姓名。在可能的情况下，数据集包含有关原始作者的信息。

许可信息

根据原始数据的许可证，该数据集在 CC BY-SA 2.5 下进行分发。

作者:

IlyaGusev

数据集大小:

639.42 MB