sobir-hf/tajik-text-segmentation | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

sobir-hf/tajik-text-segmentation

任务:

特征提取

语言:

大小:

1K<n<10K

其他:

text_segmentaion nlp tg

许可:

apache-2.0

数据集介绍文件清单

中文

This dataset contains texts in Tajik language with sentence annotations. It can be used to train and evaluate sentence-wise text segmentation algorithms. The dataset contains more than 100 short and long texts and more than 3000 annotated sentences. The texts were carefully selected from different catergories such as news, articles, novels, classical texts, poetry, and religious texts. It deliberately contains more of "hard" passages where splitting them by period "." characters would result in bad segmentation.

No preprocessing is done except reducing consecutive whitespaces and linebreaks to singles. The texts are sometimes poorly formatted just as they are copied and pasted from the web. This could make the training algorithm robust to noises.

作者:

sobir-hf

数据集大小:

1.15 MB