数据集:

IlyaGusev/rulm

语言:

ru

大小:

10M<n<100M
中文

Dataset for training Russian language models

Overall: 75G

Scripts: https://github.com/IlyaGusev/rulm/tree/master/data_processing

Website Char count (M) Word count (M)
pikabu 14938 2161
lenta 1008 135
stihi 2994 393
stackoverflow 1073 228
habr 5112 753
taiga_fontanka 419 55
librusec 10149 1573
buriy 2646 352
ods_tass 1908 255
wiki 3473 469
math 987 177