该语料库总共包含10472个句子,分属以下类别:
Hindi
{'Story_no': 15, 'Sentence': '从这可以看出,它花费了三卢比,现在甚至不会发出声音!是你的问题! “这里牵涉到主人有什么问题?”', 'Discourse Mode': '对话'}
句子编号、故事编号、句子和话语模式
详细信息,请参见本文 https://www.aclweb.org/anthology/2020.lrec-1.149/
[需要更多信息]
[需要更多信息]
请参阅此链接: https://github.com/midas-research/hindi-discourse
如果您使用了该数据集,请引用以下出版物: https://aclanthology.org/2020.lrec-1.149/
@inproceedings{dhanwal-etal-2020-annotated, title = "An Annotated Dataset of Discourse Modes in {H}indi Stories", author = "Dhanwal, Swapnil and Dutta, Hritwik and Nankani, Hitesh and Shrivastava, Nilay and Kumar, Yaman and Li, Junyi Jessy and Mahata, Debanjan and Gosangi, Rakesh and Zhang, Haimin and Shah, Rajiv Ratn and Stent, Amanda", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.149", pages = "1191--1196", abstract = "In this paper, we present a new corpus consisting of sentences from Hindi short stories annotated for five different discourse modes argumentative, narrative, descriptive, dialogic and informative. We present a detailed account of the entire data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.87 k-alpha). We analyze the data in terms of label distributions, part of speech tags, and sentence lengths. We characterize the performance of various classification algorithms on this dataset and perform ablation studies to understand the nature of the linguistic models suitable for capturing the nuances of the embedded discourse structures in the presented corpus.", language = "English", ISBN = "979-10-95546-34-4", }
感谢 @duttahritwik 添加了此数据集。