XLM-RoBERTa（基础版）在HC3上进行微调用于ChatGPT文本检测

XLM-RoBERTa（基础版）在 Hello-SimpleAI 个HC3语料库上进行微调，用于ChatGPT文本检测。

感谢 Hello-SimpleAI 做出的巨大工作！

测试数据集的F1得分：0.9736

模型

XLM-RoBERTa模型在2.5TB的经过筛选的CommonCrawl数据上进行了预训练，包含100种语言。它是由Conneau等人在论文"Unsupervised Cross-lingual Representation Learning at Scale"中提出的，并首次在该存储库中发布。

数据集

人类ChatGPT对比语料库（HC3）

由 Hello-SimpleAI 创建的第一个人类-ChatGPT对比语料库，命名为HC3数据集

该数据集在以下论文中介绍：

论文： How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection

指标

metric	value
F1	0.9736

使用方法

from transformers import pipeline

ckpt = "mrm8488/xlm-roberta-base-finetuned-HC3-mix"

detector = pipeline('text-classification', model=ckpt)

text = "Here your text..."

result = detector(text)

print(result)

引用

@misc {manuel_romero_2023,
    author       = { {Manuel Romero} },
    title        = { xlm-roberta-base-finetuned-HC3-mix (Revision b18de48) },
    year         = 2023,
    url          = { https://huggingface.co/mrm8488/xlm-roberta-base-finetuned-HC3-mix },
    doi          = { 10.57967/hf/0306 },
    publisher    = { Hugging Face }
}

作者:

Manuel Romero

数据集大小:

2.09 GB