数据集:

DFKI-SLT/scidtb

子任务:

parsing

语言:

en

计算机处理:

monolingual

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original
英文

SciDTB数据集卡片

数据集摘要

SciDTB是一个在英语科技文章上标注的领域特定的篇章树库。与广泛使用的RST-DT和PDTB不同,SciDTB使用依存树来表示篇章结构,这种方式灵活且简化,但不会牺牲结构完整性。此外,该树库是为评估篇章依存解析器而创建的基准。这个数据集可以用于许多下游自然语言处理任务,例如机器翻译和自动摘要。

支持的任务和排行榜

[需要更多信息]

语言

英语

数据集结构

数据示例

典型的数据点由根组成,它是依存树中的节点列表。列表中的每个节点有四个字段:包含节点id的"id"字段,包含父节点id的"parent"字段,表示当前节点的范围的"text"字段,以及表示当前节点与父节点之间关系的"relation"字段。

下面是来自SciDTB训练集的一个示例:

{
    "root": [
        {
            "id": 0,
            "parent": -1,
            "text": "ROOT",
            "relation": "null"
        },
        {
            "id": 1,
            "parent": 0,
            "text": "We propose a neural network approach ",
            "relation": "ROOT"
        },
        {
            "id": 2,
            "parent": 1,
            "text": "to benefit from the non-linearity of corpus-wide statistics for part-of-speech ( POS ) tagging . <S>",
            "relation": "enablement"
        },
        {
            "id": 3,
            "parent": 1,
            "text": "We investigated several types of corpus-wide information for the words , such as word embeddings and POS tag distributions . <S>",
            "relation": "elab-aspect"
        },
        {
            "id": 4,
            "parent": 5,
            "text": "Since these statistics are encoded as dense continuous features , ",
            "relation": "cause"
        },
        {
            "id": 5,
            "parent": 3,
            "text": "it is not trivial to combine these features ",
            "relation": "elab-addition"
        },
        {
            "id": 6,
            "parent": 5,
            "text": "comparing with sparse discrete features . <S>",
            "relation": "comparison"
        },
        {
            "id": 7,
            "parent": 1,
            "text": "Our tagger is designed as a combination of a linear model for discrete features and a feed-forward neural network ",
            "relation": "elab-aspect"
        },
        {
            "id": 8,
            "parent": 7,
            "text": "that captures the non-linear interactions among the continuous features . <S>",
            "relation": "elab-addition"
        },
        {
            "id": 9,
            "parent": 10,
            "text": "By using several recent advances in the activation functions for neural networks , ",
            "relation": "manner-means"
        },
        {
            "id": 10,
            "parent": 1,
            "text": "the proposed method marks new state-of-the-art accuracies for English POS tagging tasks . <S>",
            "relation": "evaluation"
        }
    ]
}

可以在 here 中找到更多这样的原始数据示例。

数据字段

  • id:节点的整数标识符
  • parent:父节点的整数标识符
  • text:当前节点的文本字符串
  • relation:当前节点与父节点之间的篇章关系字符串

数据拆分

数据集包括三个拆分:训练集(train)、开发集(dev)和测试集(test)。

Train Valid Test
743 154 152

数据集创建

策划理由

[需要更多信息]

源数据

初始数据收集和标准化

[需要更多信息]

那些是源语言的生成者?

[需要更多信息]

标注

注释过程

请参阅 here 中的更多信息。

那些是标注者?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

[需要更多信息]

引用信息

@inproceedings{yang-li-2018-scidtb,
    title = "{S}ci{DTB}: Discourse Dependency {T}ree{B}ank for Scientific Abstracts",
    author = "Yang, An  and
      Li, Sujian",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2018",
    address = "Melbourne, Australia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P18-2071",
    doi = "10.18653/v1/P18-2071",
    pages = "444--449",
    abstract = "Annotation corpus for discourse relations benefits NLP tasks such as machine translation and question answering. In this paper, we present SciDTB, a domain-specific discourse treebank annotated on scientific articles. Different from widely-used RST-DT and PDTB, SciDTB uses dependency trees to represent discourse structure, which is flexible and simplified to some extent but do not sacrifice structural integrity. We discuss the labeling framework, annotation workflow and some statistics about SciDTB. Furthermore, our treebank is made as a benchmark for evaluating discourse dependency parsers, on which we provide several baselines as fundamental work.",
}