The Urdu Summarization dataset contains news articles in Urdu language along with their summaries. The dataset contains a total of 48,071 news articles collected from the BBC Urdu website. Each article is labeled with its headline, summary, and full text.
The dataset contains the following columns:
The data was collected from the BBC Urdu website using web scraping techniques. The articles were published between 2003 and 2020, covering a wide range of topics such as politics, sports, technology, and entertainment.
The text data was preprocessed to remove any HTML tags and non-Urdu characters. The summaries were created by human annotators, who read the full text of the articles and summarized the main points. The dataset was split into training, validation, and test sets, with 80%, 10%, and 10% of the data in each set respectively.
This dataset can be used for training and evaluating models for automatic summarization of Urdu text. It can also be used for research in natural language processing, machine learning, and information retrieval.
I thank the BBC Urdu team for publishing the news articles on their website and making them publicly available. We also thank the human annotators who created the summaries for the articles.
No papers have been published yet using this dataset.
The dataset is distributed under the MIT License.