ParaDetox：使用并行数据进行排毒（英文）

该存储库包含有关ParaDetox数据集的信息，这是用于排毒任务的第一个并行语料库，以及用于对英文文本进行排毒的模型和评估方法。原始论文于ACL 2022年主会议上发表。

ParaDetox数据集收集流程

ParaDetox数据集的收集是通过 Yandex.Toloka 众包平台进行的。收集分为三个步骤：

任务1：生成改写句：第一个众包任务要求用户在保留内容的同时消除给定句子中的毒性。
任务2：内容保留检查：我们向用户展示生成的改写句及其原始变体，并要求他们指出它们是否具有相似的含义。
任务3：毒性检查：最后，我们检查工作者是否成功消除了毒性。

所有这些步骤都是为了确保数据的高质量，并使收集过程自动化。有关更多详细信息，请参阅原始论文。

ParaDetox数据集

因此，我们获得了11,939个有毒句子的改写句（平均每个句子1.66个改写句），总共19,766个改写句。整个数据集可以在 here 处找到。ParaDetox数据集示例：

除了所有ParaDetox数据集之外，我们还公开了 samples 个在众包流程的任务1中被注释员标记为“无法改写”的样本。

排毒评估

模型的自动评估基于三个参数：

样式转移准确性（STA）：由样式分类器识别非有毒输出的百分比。我们在Jigsaw数据上预训练了毒性分类器，并在HuggingFace?网站上发布了它（ repo ）。
内容保留（SIM）：原始文本与使用 Wieting et al. (2019) 模型计算的输出的嵌入之间的余弦相似度。
流畅度（FL）：由在 CoLA dataset 上训练的基于RoBERTa的语言可接受性分类器识别的流畅句子的百分比。

我们所有实验中使用的代码都可以通过Colab笔记本运行。

排毒模型

排毒任务的新SOTA（基本）模型 - 在ParaDetox数据集上训练的BART模型 - 我们在HuggingFace?仓库中发布（ here ）。

您还可以查看我们的 demo 和电报 bot 。

引用

@inproceedings{logacheva-etal-2022-paradetox,
    title = "{P}ara{D}etox: Detoxification with Parallel Data",
    author = "Logacheva, Varvara  and
      Dementieva, Daryna  and
      Ustyantsev, Sergey  and
      Moskovskiy, Daniil  and
      Dale, David  and
      Krotova, Irina  and
      Semenov, Nikita  and
      Panchenko, Alexander",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.469",
    pages = "6804--6818",
    abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.",
}

联系方式

如果您发现了一些问题，请毫不犹豫地添加到 Github Issues 中。

如有任何问题，请联系：Daryna Dementieva（dardem96@gmail.com）

作者:

s-nlp

数据集大小:

1.95 MB