摘要:这是来自俄罗斯博客——IT、计算机科学和与互联网相关的博客 habr.com 的帖子和评论的数据集。
脚本: create_habr.py
联系人:Ilya Gusev
语言:俄语、英语、一些编程代码。
先决条件:
pip install datasets zstandard jsonlines pysimdjson
数据集迭代:
from datasets import load_dataset dataset = load_dataset('IlyaGusev/habr', split="train", streaming=True) for example in dataset: print(example["text_markdown"])
{ "id": 12730, "language": "ru", "url": "https://habr.com/ru/post/12730/", "text_markdown": "...", "text_html": "...", "lead_markdown": "...", "lead_html": "...", "type": "article", "labels": [], "original_author": null, "original_url": null, "time_published": 1185962380, "author": "...", "title": "Хочешь в университет — сделай презентацию", "statistics": { "commentsCount": 23, "favoritesCount": 1, "readingCount": 1542, "score": 7, "votesCount": 15, "votesCountPlus": 11, "votesCountMinus": 4 }, "hubs": [ "itcompanies" ], "flows": [ "popsci" ], "tags": [ "PowerPoint", "презентация", "абитуриенты", ], "reading_time": 1, "format": null, "complexity": null, "comments": { "id": [11653537, 11653541], "parent_id": [null, 11653537], "level": [0, 1], "time_published": [1185963192, 1185967886], "score": [-1, 0], "votes": [1, 0], "message_html": ["...", "..."], "author": ["...", "..."], "children": [[11653541], []] } }
您可以使用这个小工具将序列转为嵌套形式:
def revert_flattening(records): fixed_records = [] for key, values in records.items(): if not fixed_records: fixed_records = [{} for _ in range(len(values))] for i, value in enumerate(values): fixed_records[i][key] = value return fixed_records
原始的JSONL已经是嵌套形式的。
数据集未经匿名处理,因此数据集中可能包含个人姓名。在可能的情况下,原始作者的信息已包含在数据集中。