摘要: pikabu.ru 是一个俄罗斯版的Reddit/9gag网站,该数据集包含了该网站上的帖子和评论。
联系人:Ilya Gusev
语言:主要是俄语。
先决条件:
pip install datasets zstandard jsonlines pysimdjson
数据集迭代:
from datasets import load_dataset
dataset = load_dataset('IlyaGusev/pikabu', split="train", streaming=True)
for example in dataset:
print(example["text_markdown"])
{
"id": 69911642,
"title": "Что можно купить в Китае за цену нового iPhone 11 Pro",
"text_markdown": "...",
"timestamp": 1571221527,
"author_id": 2900955,
"username": "chinatoday.ru",
"rating": -4,
"pluses": 9,
"minuses": 13,
"url": "...",
"tags": ["Китай", "AliExpress", "Бизнес"],
"blocks": {"data": ["...", "..."], "type": ["text", "text"]},
"comments": {
"id": [152116588, 152116426],
"text_markdown": ["...", "..."],
"text_html": ["...", "..."],
"images": [[], []],
"rating": [2, 0],
"pluses": [2, 0],
"minuses": [0, 0],
"author_id": [2104711, 2900955],
"username": ["FlyZombieFly", "chinatoday.ru"]
}
}
您可以使用以下辅助函数来取消扁平化的序列:
def revert_flattening(records):
fixed_records = []
for key, values in records.items():
if not fixed_records:
fixed_records = [{} for _ in range(len(values))]
for i, value in enumerate(values):
fixed_records[i][key] = value
return fixed_records
该数据集没有进行匿名化处理,因此可以在数据集中找到个人姓名。在可能的情况下,数据集中包含了原始作者的信息。