数据集:
Salesforce/rose
语言:
enThis repo contiains the RoSE benchmark of our paper "Revisiting the Gold Standard:Grounding Summarization Evaluation with Robust Human Evaluation".
Please visit here for a demo page of this project.
RoSE benchmark contains system outputs annotated with our ACU protocol. It contains four parts:
We summarize the statistics below.
Dataset | Split | #Doc. | #Sys. | #Total Summ. | HF Name |
---|---|---|---|---|---|
CNNDM | Test | 500 | 12 | 6000 | cnndm_test |
CNNDM | Validation | 1000 | 8 | 8000 | cnndm_validation |
XSum | Test | 500 | 8 | 4000 | xsum |
SamSum | Test | 500 | 8 | 4000 | samsum |
We have system outputs annotated with four different human evaluation protocols in total.We summarize them below.
Protocol | w/ Input Document | w/ Reference Summary | Fine-grained |
---|---|---|---|
Prior | ✗ | ✗ | ✗ |
Ref-free | ✓ | ✗ | ✗ |
Ref-based | ✗ | ✓ | ✗ |
ACU | ✗ | ✓ | ✓ |
We annotated two sets of system summaries.
本存储库包含了我们论文《重新审视黄金标准: 通过强大的人工评估来支持摘要评估》中的RoSE基准测试。
请访问 here 以查看此项目的演示页面。
RoSE基准测试包含使用我们的ACU协议注释的系统输出。它包含四个部分:
我们总结如下统计数据。
Dataset | Split | #Doc. | #Sys. | #Total Summ. | HF Name |
---|---|---|---|---|---|
CNNDM | Test | 500 | 12 | 6000 | cnndm_test |
CNNDM | Validation | 1000 | 8 | 8000 | cnndm_validation |
XSum | Test | 500 | 8 | 4000 | xsum |
SamSum | Test | 500 | 8 | 4000 | samsum |
我们总共对系统输出进行了四种不同的人工评估协议的注释。我们总结如下。
Protocol | w/ Input Document | w/ Reference Summary | Fine-grained |
---|---|---|---|
Prior | ✗ | ✗ | ✗ |
Ref-free | ✓ | ✗ | ✗ |
Ref-based | ✗ | ✓ | ✗ |
ACU | ✗ | ✓ | ✓ |
我们对两组系统摘要进行了注释。