XLM-RoBERTa (base) fine-tuned on HC3 for ChatGPT text detection

XLM-RoBERTa (base) fine-tuned on Hello-SimpleAI HC3 corpus for ChatGPT text detection.

All credit to Hello-SimpleAI for their huge work!

F1 score on test dataset: 0.9736

The model

XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al. and first released in this repository.

The dataset

Human ChatGPT Comparison Corpus (HC3)

The first human-ChatGPT comparison corpus, named HC3 dataset by Hello-SimpleAI

This dataset is introduced in the paper:

Paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection

Metrics

metric	value
F1	0.9736

Usage

from transformers import pipeline

ckpt = "mrm8488/xlm-roberta-base-finetuned-HC3-mix"

detector = pipeline('text-classification', model=ckpt)

text = "Here your text..."

result = detector(text)

print(result)

Citation

@misc {manuel_romero_2023,
    author       = { {Manuel Romero} },
    title        = { xlm-roberta-base-finetuned-HC3-mix (Revision b18de48) },
    year         = 2023,
    url          = { https://huggingface.co/mrm8488/xlm-roberta-base-finetuned-HC3-mix },
    doi          = { 10.57967/hf/0306 },
    publisher    = { Hugging Face }
}

作者:

Manuel Romero

数据集大小:

2.09 GB