ParaDetox：使用平行数据进行排毒（英文）。毒性任务结果

该存储库包含来自 English Paradetox dataset 收集流水线的毒性任务标记的信息。原始论文 "ParaDetox: Detoxification with Parallel Data" 在ACL 2022主会议上发表。

ParaDetox收集流水线

ParaDetox数据集的收集是通过 Yandex.Toloka 众包平台进行的。收集分为三个步骤：

任务1：生成释义：第一个众包任务要求用户消除给定句子中的毒性，同时保留内容。
任务2：内容保留检查：我们向用户展示生成的释义以及它们的原始变体，并要求他们指示是否具有相近的含义。
任务3：毒性检查：最后，我们检查工作者是否成功消除了毒性。

具体来说，该存储库包含任务3的结果：毒性检查。在这里，标记置信度>=90的样本存在。输入是文本，标签显示文本是否有毒。总共有26,507个样本。其中，少部分是有毒示例（4,009对）。

引用

@inproceedings{logacheva-etal-2022-paradetox,
    title = "{P}ara{D}etox: Detoxification with Parallel Data",
    author = "Logacheva, Varvara  and
      Dementieva, Daryna  and
      Ustyantsev, Sergey  and
      Moskovskiy, Daniil  and
      Dale, David  and
      Krotova, Irina  and
      Semenov, Nikita  and
      Panchenko, Alexander",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.469",
    pages = "6804--6818",
    abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.",
}

联系方式

如有任何问题，请联系：Daryna Dementieva（dardem96@gmail.com）

作者:

s-nlp

数据集大小:

1.38 MB