摘要: pikabu.ru 是一个俄罗斯版的Reddit/9gag网站,该数据集包含了该网站上的帖子和评论。
联系人:Ilya Gusev
语言:主要是俄语。
先决条件:
pip install datasets zstandard jsonlines pysimdjson
数据集迭代:
from datasets import load_dataset dataset = load_dataset('IlyaGusev/pikabu', split="train", streaming=True) for example in dataset: print(example["text_markdown"])
{ "id": 69911642, "title": "Что можно купить в Китае за цену нового iPhone 11 Pro", "text_markdown": "...", "timestamp": 1571221527, "author_id": 2900955, "username": "chinatoday.ru", "rating": -4, "pluses": 9, "minuses": 13, "url": "...", "tags": ["Китай", "AliExpress", "Бизнес"], "blocks": {"data": ["...", "..."], "type": ["text", "text"]}, "comments": { "id": [152116588, 152116426], "text_markdown": ["...", "..."], "text_html": ["...", "..."], "images": [[], []], "rating": [2, 0], "pluses": [2, 0], "minuses": [0, 0], "author_id": [2104711, 2900955], "username": ["FlyZombieFly", "chinatoday.ru"] } }
您可以使用以下辅助函数来取消扁平化的序列:
def revert_flattening(records): fixed_records = [] for key, values in records.items(): if not fixed_records: fixed_records = [{} for _ in range(len(values))] for i, value in enumerate(values): fixed_records[i][key] = value return fixed_records
该数据集没有进行匿名化处理,因此可以在数据集中找到个人姓名。在可能的情况下,数据集中包含了原始作者的信息。