摘要: 包含来自 ru.stackoverflow.com 的问题、答案和评论的数据集。
联系人: Ilya Gusev
语言: 该数据集使用俄语编写,包含一些编程代码。
先决条件:
pip install datasets zstandard jsonlines pysimdjson
加载:
from datasets import load_dataset
dataset = load_dataset('IlyaGusev/ru_stackoverflow', split="train")
for example in dataset:
print(example["text_markdown"])
print()
{
"question_id": 11235,
"answer_count": 1,
"url": "https://ru.stackoverflow.com/questions/11235",
"score": 2,
"tags": ["c++", "сериализация"],
"title": "Извлечение из файла, запись в файл",
"views": 1309,
"author": "...",
"timestamp": 1303205289,
"text_html": "...",
"text_markdown": "...",
"comments": {
"text": ["...", "...",
"author": ["...", "..."],
"comment_id": [11236, 11237],
"score": [0, 0],
"timestamp": [1303205411, 1303205678]
},
"answers": {
"answer_id": [11243, 11245],
"timestamp": [1303207791, 1303207792],
"is_accepted": [1, 0],
"text_html": ["...", "..."],
"text_markdown": ["...", "..."],
"score": [3, 0],
"author": ["...", "..."],
"comments": {
"text": ["...", "..."],
"author": ["...", "..."],
"comment_id": [11246, 11249],
"score": [0, 0],
"timestamp": [1303207961, 1303207800]
}
}
}
您可以使用此辅助工具对序列进行展平:
def revert_flattening(records):
fixed_records = []
for key, values in records.items():
if not fixed_records:
fixed_records = [{} for _ in range(len(values))]
for i, value in enumerate(values):
fixed_records[i][key] = value
return fixed_records
原始JSONL已经被展平。
该数据集未经匿名处理,因此可以在数据集中找到个人姓名。在可能的情况下,数据集包含有关原始作者的信息。
根据原始数据的许可证,该数据集在 CC BY-SA 2.5 下进行分发。