数据集:
mbshr/XSUMUrdu-DW_BBC
基于BBC乌尔都语和DW乌尔都语新闻网站抓取的76,637条文章+摘要配对的乌尔都语概括数据集。-预处理版本,最多512个标记(~单词);去除了URL、图片标题等
概括-抽取式和生成式-urT5(单语词汇表;乌尔都语40k标记)使用自己的词汇表从mT5进行微调-ROUGE-1 F分数:总共40.03,BBC乌尔都语数据点46.35,DW乌尔都语数据点36.91-BERT分数:总共75.1,BBC乌尔都语数据点77.0,DW乌尔都语数据点74.16
乌尔都语。
[需要更多信息]
- url: URL of the article from where it was scrapped (BBC Urdu URLs in english topic text with number & DW Urdu with Urdu topic text) dtype: {string} - Summary: Short Summary of article written by author of article like highlights. dtype: {string} - Text: Complete Text of article which are intelligently trucated to 512 tokens. dtype: {string}
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]