数据集:

IlyaGusev/pikabu

语言:

ru

大小:

1M<n<10M
英文

Pikabu数据集

描述

摘要: pikabu.ru 是一个俄罗斯版的Reddit/9gag网站,该数据集包含了该网站上的帖子和评论。

脚本: convert_pikabu.py

联系人:Ilya Gusev

语言:主要是俄语。

用法

先决条件:

pip install datasets zstandard jsonlines pysimdjson

数据集迭代:

from datasets import load_dataset
dataset = load_dataset('IlyaGusev/pikabu', split="train", streaming=True)
for example in dataset:
    print(example["text_markdown"])

数据实例

{
  "id": 69911642,
  "title": "Что можно купить в Китае за цену нового iPhone 11 Pro",
  "text_markdown": "...",
  "timestamp": 1571221527,
  "author_id": 2900955,
  "username": "chinatoday.ru",
  "rating": -4,
  "pluses": 9,
  "minuses": 13,
  "url": "...",
  "tags": ["Китай", "AliExpress", "Бизнес"],
  "blocks": {"data": ["...", "..."], "type": ["text", "text"]},
  "comments": {
    "id": [152116588, 152116426],
    "text_markdown": ["...", "..."],
    "text_html": ["...", "..."],
    "images": [[], []],
    "rating": [2, 0],
    "pluses": [2, 0],
    "minuses": [0, 0],
    "author_id": [2104711, 2900955],
    "username": ["FlyZombieFly", "chinatoday.ru"]
  }
}

您可以使用以下辅助函数来取消扁平化的序列:

def revert_flattening(records):
    fixed_records = []
    for key, values in records.items():
        if not fixed_records:
            fixed_records = [{} for _ in range(len(values))]
        for i, value in enumerate(values):
            fixed_records[i][key] = value
    return fixed_records

源数据

个人和敏感信息

该数据集没有进行匿名化处理,因此可以在数据集中找到个人姓名。在可能的情况下,数据集中包含了原始作者的信息。