数据集:

SZTAKI-HLT/HunSum-1

中文

Dataset Card for HunSum-1

Dataset Description

Dataset Summary

The HunSum-1 Dataset is a Hungarian-language dataset containing over 1.1M unique news articles with lead and other metadata. The dataset contains articles from 9 major Hungarian news websites.

Supported Tasks and Leaderboards

  • 'summarization'
  • 'title generation'

Dataset Structure

Data Fields

  • uuid : a string containing the unique id
  • article : a string containing the body of the news article
  • lead : a string containing the lead of the article
  • title : a string containing the title of the article
  • url : a string containing the URL for the article
  • domain : a string containing the domain of the url
  • date_of_creation : a timestamp containing the date when the article was created
  • tags : a sequence containing the tags of the article

Data Splits

The HunSum-1 dataset has 3 splits: train , validation , and test .

Dataset Split Number of Instances in Split
Train 1,144,255
Validation 1996
Test 1996

Citation

If you use our dataset, please cite the following paper:

@inproceedings {HunSum-1,
    title = {{HunSum-1: an Abstractive Summarization Dataset for Hungarian}},
    booktitle = {XIX. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2023)},
    year = {2023},
    publisher = {Szegedi Tudományegyetem, Informatikai Intézet},
    address = {Szeged, Magyarország},
    author = {Barta, Botond and Lakatos, Dorina and Nagy, Attila and Nyist, Mil{\'{a}}n Konor and {\'{A}}cs, Judit},
    pages = {231--243}
}