数据集:
qanastek/Biosses-BLUE
任务:
文本分类语言:
en计算机处理:
monolingual大小:
n<1K语言创建人:
found批注创建人:
expert-generated源数据集:
original许可:
gpl-3.0BIOSSES 是用于生物医学句子相似度评估的基准数据集。该数据集包括100个句子对,每个句子都从包含生物医学领域文章的 TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset 句子中选择而来。BIOSSES 中的句子对是从引用句子中选择的,即具有对参考文章的引用的句子。
这些句子对由五名不同的人工专家评估其相似性,并给出从0(无关系)到4(相对等)的分数。在原始论文中,将五名人工注释者分配的分数的平均值作为黄金标准。黄金标准分数与模型估计分数之间的皮尔逊相关性被用作评估指标。相关性的强弱可以通过 Evans(1996)提出的一般指导原则进行评估,如下所示:
name | Train | Dev | Test |
---|---|---|---|
biosses | 64 | 16 | 20 |
生物医学语义相似度评分。
英语。
对于每个实例,有两个句子(即句子1和句子2),以及对应的相似度分数(五名人工注释者分配的分数的平均值)。
{ "id": "0", "sentence1": "Centrosomes increase both in size and in microtubule-nucleating capacity just before mitotic entry.", "sentence2": "Functional studies showed that, when introduced into cell lines, miR-146a was found to promote cell proliferation in cervical cancer cells, which suggests that miR-146a works as an oncogenic miRNA in these cancers.", "score": 0.0 }
这是 TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset 的数据。
这些句子对由五名不同的人工专家评估其相似性,并给出从0(无关系)到4(相对等)的分数。分数范围是基于 SemEval 2012 任务6 STS(Agirre et al.,2012)的指导方针描述的。除了注释说明之外,还为注释者提供了来自生物医学文献的示例句子,以用于每个相似度程度。
下表显示了每个注释者的分数与其余四个注释者的平均分数之间的皮尔逊相关性。可以看出,注释者的分数之间存在很强的关联性。最低的相关性是0.902,可以被认为是对此数据集进行算法测量的上限。
Correlation r | |
---|---|
Annotator A | 0.952 |
Annotator B | 0.958 |
Annotator C | 0.917 |
Annotator D | 0.902 |
Annotator E | 0.941 |
BIOSSES 在 The GNU Common Public License v.3.0 的条款下提供。
@article{10.1093/bioinformatics/btx238, author = {Soğancıoğlu, Gizem and Öztürk, Hakime and Özgür, Arzucan}, title = "{BIOSSES: a semantic sentence similarity estimation system for the biomedical domain}", journal = {Bioinformatics}, volume = {33}, number = {14}, pages = {i49-i58}, year = {2017}, month = {07}, abstract = "{The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text.We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods.The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6\\% in terms of the Pearson correlation metric.A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/.}", issn = {1367-4803}, doi = {10.1093/bioinformatics/btx238}, url = {https://doi.org/10.1093/bioinformatics/btx238}, eprint = {https://academic.oup.com/bioinformatics/article-pdf/33/14/i49/25157316/btx238.pdf}, }
感谢 @qanastek 添加了该数据集。