数据集:

sobir-hf/tajik-text-segmentation

中文

This dataset contains texts in Tajik language with sentence annotations. It can be used to train and evaluate sentence-wise text segmentation algorithms. The dataset contains more than 100 short and long texts and more than 3000 annotated sentences. The texts were carefully selected from different catergories such as news, articles, novels, classical texts, poetry, and religious texts. It deliberately contains more of "hard" passages where splitting them by period "." characters would result in bad segmentation.

No preprocessing is done except reducing consecutive whitespaces and linebreaks to singles. The texts are sometimes poorly formatted just as they are copied and pasted from the web. This could make the training algorithm robust to noises.