数据集:
DFKI-SLT/scidtb
SciDTB是一个在英语科技文章上标注的领域特定的篇章树库。与广泛使用的RST-DT和PDTB不同,SciDTB使用依存树来表示篇章结构,这种方式灵活且简化,但不会牺牲结构完整性。此外,该树库是为评估篇章依存解析器而创建的基准。这个数据集可以用于许多下游自然语言处理任务,例如机器翻译和自动摘要。
[需要更多信息]
英语
典型的数据点由根组成,它是依存树中的节点列表。列表中的每个节点有四个字段:包含节点id的"id"字段,包含父节点id的"parent"字段,表示当前节点的范围的"text"字段,以及表示当前节点与父节点之间关系的"relation"字段。
下面是来自SciDTB训练集的一个示例:
{ "root": [ { "id": 0, "parent": -1, "text": "ROOT", "relation": "null" }, { "id": 1, "parent": 0, "text": "We propose a neural network approach ", "relation": "ROOT" }, { "id": 2, "parent": 1, "text": "to benefit from the non-linearity of corpus-wide statistics for part-of-speech ( POS ) tagging . <S>", "relation": "enablement" }, { "id": 3, "parent": 1, "text": "We investigated several types of corpus-wide information for the words , such as word embeddings and POS tag distributions . <S>", "relation": "elab-aspect" }, { "id": 4, "parent": 5, "text": "Since these statistics are encoded as dense continuous features , ", "relation": "cause" }, { "id": 5, "parent": 3, "text": "it is not trivial to combine these features ", "relation": "elab-addition" }, { "id": 6, "parent": 5, "text": "comparing with sparse discrete features . <S>", "relation": "comparison" }, { "id": 7, "parent": 1, "text": "Our tagger is designed as a combination of a linear model for discrete features and a feed-forward neural network ", "relation": "elab-aspect" }, { "id": 8, "parent": 7, "text": "that captures the non-linear interactions among the continuous features . <S>", "relation": "elab-addition" }, { "id": 9, "parent": 10, "text": "By using several recent advances in the activation functions for neural networks , ", "relation": "manner-means" }, { "id": 10, "parent": 1, "text": "the proposed method marks new state-of-the-art accuracies for English POS tagging tasks . <S>", "relation": "evaluation" } ] }
可以在 here 中找到更多这样的原始数据示例。
数据集包括三个拆分:训练集(train)、开发集(dev)和测试集(test)。
Train | Valid | Test |
---|---|---|
743 | 154 | 152 |
[需要更多信息]
[需要更多信息]
那些是源语言的生成者?[需要更多信息]
请参阅 here 中的更多信息。
那些是标注者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@inproceedings{yang-li-2018-scidtb, title = "{S}ci{DTB}: Discourse Dependency {T}ree{B}ank for Scientific Abstracts", author = "Yang, An and Li, Sujian", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)", month = jul, year = "2018", address = "Melbourne, Australia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P18-2071", doi = "10.18653/v1/P18-2071", pages = "444--449", abstract = "Annotation corpus for discourse relations benefits NLP tasks such as machine translation and question answering. In this paper, we present SciDTB, a domain-specific discourse treebank annotated on scientific articles. Different from widely-used RST-DT and PDTB, SciDTB uses dependency trees to represent discourse structure, which is flexible and simplified to some extent but do not sacrifice structural integrity. We discuss the labeling framework, annotation workflow and some statistics about SciDTB. Furthermore, our treebank is made as a benchmark for evaluating discourse dependency parsers, on which we provide several baselines as fundamental work.", }