Urdu Summarization

Dataset Overview

The Urdu Summarization dataset contains news articles in Urdu language along with their summaries. The dataset contains a total of 48,071 news articles collected from the BBC Urdu website. Each article is labeled with its headline, summary, and full text.

Dataset Details

The dataset contains the following columns:

id (string): Unique identifier for each article
url (string): URL for the original article
title (string): Headline of the article
summary (string): Summary of the article
text (string): Full text of the article The dataset is distributed under the MIT License.

Data Collection

The data was collected from the BBC Urdu website using web scraping techniques. The articles were published between 2003 and 2020, covering a wide range of topics such as politics, sports, technology, and entertainment.

Data Preprocessing

The text data was preprocessed to remove any HTML tags and non-Urdu characters. The summaries were created by human annotators, who read the full text of the articles and summarized the main points. The dataset was split into training, validation, and test sets, with 80%, 10%, and 10% of the data in each set respectively.

Potential Use Cases

This dataset can be used for training and evaluating models for automatic summarization of Urdu text. It can also be used for research in natural language processing, machine learning, and information retrieval.

Acknowledgements

I thank the BBC Urdu team for publishing the news articles on their website and making them publicly available. We also thank the human annotators who created the summaries for the articles.

Relevant Papers

No papers have been published yet using this dataset.

License

The dataset is distributed under the MIT License.

作者:

mwz

数据集大小:

318.61 MB