摘要: 包含来自 ru.stackoverflow.com 的问题、答案和评论的数据集。
联系人: Ilya Gusev
语言: 该数据集使用俄语编写,包含一些编程代码。
先决条件:
pip install datasets zstandard jsonlines pysimdjson
加载:
from datasets import load_dataset dataset = load_dataset('IlyaGusev/ru_stackoverflow', split="train") for example in dataset: print(example["text_markdown"]) print()
{ "question_id": 11235, "answer_count": 1, "url": "https://ru.stackoverflow.com/questions/11235", "score": 2, "tags": ["c++", "сериализация"], "title": "Извлечение из файла, запись в файл", "views": 1309, "author": "...", "timestamp": 1303205289, "text_html": "...", "text_markdown": "...", "comments": { "text": ["...", "...", "author": ["...", "..."], "comment_id": [11236, 11237], "score": [0, 0], "timestamp": [1303205411, 1303205678] }, "answers": { "answer_id": [11243, 11245], "timestamp": [1303207791, 1303207792], "is_accepted": [1, 0], "text_html": ["...", "..."], "text_markdown": ["...", "..."], "score": [3, 0], "author": ["...", "..."], "comments": { "text": ["...", "..."], "author": ["...", "..."], "comment_id": [11246, 11249], "score": [0, 0], "timestamp": [1303207961, 1303207800] } } }
您可以使用此辅助工具对序列进行展平:
def revert_flattening(records): fixed_records = [] for key, values in records.items(): if not fixed_records: fixed_records = [{} for _ in range(len(values))] for i, value in enumerate(values): fixed_records[i][key] = value return fixed_records
原始JSONL已经被展平。
该数据集未经匿名处理,因此可以在数据集中找到个人姓名。在可能的情况下,数据集包含有关原始作者的信息。
根据原始数据的许可证,该数据集在 CC BY-SA 2.5 下进行分发。