数据集:

aadityaubhat/GPT-wiki-intro

中文

GPT Wiki Intro

Overview

Dataset for training models to classify human written vs GPT/ChatGPT generated text. This dataset contains Wikipedia introductions and GPT (Curie) generated introductions for 150k topics.

Prompt used for generating text

200 word wikipedia style introduction on '{title}'
{starter_text}

where title is the title for the wikipedia page, and starter_text is the first seven words of the wikipedia introduction. Here's an example of prompt used to generate the introduction paragraph for 'Secretory protein' -

'200 word wikipedia style introduction on Secretory protein

A secretory protein is any protein, whether'

Configuration used for GPT model

model="text-curie-001",
prompt=prompt,
temperature=0.7,
max_tokens=300,
top_p=1,
frequency_penalty=0.4,
presence_penalty=0.1

Schema for the dataset

Column Datatype Description
id int64 ID
url string Wikipedia URL
title string Title
wiki_intro string Introduction paragraph from wikipedia
generated_intro string Introduction generated by GPT (Curie) model
title_len int64 Number of words in title
wiki_intro_len int64 Number of words in wiki_intro
generated_intro_len int64 Number of words in generated_intro
prompt string Prompt used to generate intro
generated_text string Text continued after the prompt
prompt_tokens int64 Number of tokens in the prompt
generated_text_tokens int64 Number of tokens in generated text

Credits

Code

Code to create this dataset can be found on GitHub

Citation

@misc {aaditya_bhat_2023,
    author       = { {Aaditya Bhat} },
    title        = { GPT-wiki-intro (Revision 0e458f5) },
    year         = 2023,
    url          = { https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro },
    doi          = { 10.57967/hf/0326 },
    publisher    = { Hugging Face }
}