数据集:
stsb_multi_mt
任务:
文本分类计算机处理:
multilingual大小:
10K<n<100K批注创建人:
crowdsourced源数据集:
extended|other-sts-b预印本库:
arxiv:1708.00055许可:
otherSTS基准包含了在SemEval 2012至2017年举办的STS任务中使用的英文数据集的选择。这些数据集包括来自图像标题、新闻标题和用户论坛的文本。( source )
这些是不同的多语言翻译和英文原文( STSbenchmark dataset )。翻译是使用( deepl.com )完成的。它可用于像( sentence embeddings )这样的训练任务。
使用示例
加载德语开发数据集:
from datasets import load_dataset dataset = load_dataset("stsb_multi_mt", name="de", split="dev")
加载英语训练数据集:
from datasets import load_dataset dataset = load_dataset("stsb_multi_mt", name="en", split="train")
[需要更多信息]
可用语言:de, en, es, fr, it, nl, pl, pt, ru, zh
该数据集提供了一对句子及其相似度得分。
score | 2 example sentences | explanation |
---|---|---|
5 | The bird is bathing in the sink. Birdie is washing itself in the water basin. | The two sentences are completely equivalent, as they mean the same thing. |
4 | Two boys on a couch are playing video games. Two boys are playing a video game. | The two sentences are mostly equivalent, but some unimportant details differ. |
3 | John said he is considered a witness but not a suspect. “He is not a suspect anymore.” John said. | The two sentences are roughly equivalent, but some important information differs/missing. |
2 | They flew out of the nest in groups. They flew into the nest together. | The two sentences are not equivalent, but share some details. |
1 | The woman is playing the violin. The young lady enjoys listening to the guitar. | The two sentences are not equivalent, but are on the same topic. |
0 | The black dog is running through the snow. A race car driver is driving his car through the mud. | The two sentences are completely dissimilar. |
示例:
{ "sentence1": "A man is playing a large flute.", "sentence2": "A man is playing a flute.", "similarity_score": 3.8 }
[需要更多信息]
[需要更多信息]
源语言制作人是谁?[需要更多信息]
[需要更多信息]
标注者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
请参阅 LICENSE 和 download at original dataset 。
@InProceedings{huggingface:dataset:stsb_multi_mt, title = {Machine translated multilingual STS benchmark dataset.}, author={Philip May}, year={2021}, url={https://github.com/PhilipMay/stsb-multi-mt} }
感谢 @PhilipMay 添加了该数据集。