中文

reStructured Pre-training (RST)

official repository , paper , easter eggs

RST is a new paradigm for language pre-training, which
  • unifies 26 different types of signal from 10 data sources (Totten Tomatoes, Dailymail, Wikipedia, Wikidata, Wikihow, Wordnet, arXiv etc ) in the world structurally, being pre-trained with a monolithcal model,
  • surpasses strong competitors (e.g., T0) on 52/55 popular datasets from a variety of NLP tasks (classification, IE, retrieval, generation etc)
  • achieves superior performance in National College Entrance Examination (Gaokao-English, 高考-英语) achieves 40 points higher than the average scores made by students and 15 points higher than GPT3 with 1/16 parameters. In particular, Qin gets a high score of 138.5 (the full mark is 150) in the 2018 English exam

In such a pre-training paradigm,

  • Data-centric Pre-training: the role of data will be re-emphasized, and model pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing
  • Pre-training over JSON instead of TEXT: a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access.

Model Description

We release all models introduced in our paper , covering 13 different application scenarios. Each model contains 11 billion parameters.

Model Description Recommended Application
rst-all-11b Trained with all the signals below except signals that are used to train Gaokao models All applications below (specialized models are recommended first if high performance is preferred)
rst-fact-retrieval-11b Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym, wikiHow category hierarchy, Wikidata relation, Wikidata entity typing, Paperswithcode entity typing Knowledge intensive tasks, information extraction tasks,factual checker
rst-summarization-11b Trained with the following signals: DailyMail summary, Paperswithcode summary, arXiv summary, wikiHow summary Summarization or other general generation tasks, meta-evaluation (e.g., BARTScore)
rst-temporal-reasoning-11b Trained with the following signals: DailyMail temporal information, wikiHow procedure Temporal reasoning, relation extraction, event-based extraction
rst-information-extraction-11b Trained with the following signals: Paperswithcode entity, Paperswithcode entity typing, Wikidata entity typing, Wikidata relation, Wikipedia entity Named entity recognition, relation extraction and other general IE tasks in the news, scientific or other domains
rst-intent-detection-11b Trained with the following signals: wikiHow goal-step relation Intent prediction, event prediction
rst-topic-classification-11b Trained with the following signals: DailyMail category, arXiv category, wikiHow text category, Wikipedia section title general text classification
rst-word-sense-disambiguation-11b Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym Word sense disambiguation, part-of-speech tagging, general IE tasks, common sense reasoning
rst-natural-language-inference-11b Trained with the following signals: ConTRoL dataset, DREAM dataset, LogiQA dataset, RACE & RACE-C dataset, ReClor dataset, DailyMail temporal information Natural language inference, multiple-choice question answering, reasoning
rst-sentiment-classification-11b Trained with the following signals: Rotten Tomatoes sentiment, Wikipedia sentiment Sentiment classification, emotion classification
rst-gaokao-rc-11b Trained with multiple-choice QA datasets that are used to train the T0pp model General multiple-choice question answering
rst-gaokao-cloze-11b Trained with manually crafted cloze datasets General cloze filling
rst-gaokao-writing-11b Trained with example essays from past Gaokao-English exams and grammar error correction signals Essay writing, story generation, grammar error correction and other text generation tasks

Have a try?

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("XLab/rst-all-11b")
model = AutoModelForSeq2SeqLM.from_pretrained("XLab/rst-all-11b")

inputs = tokenizer.encode("TEXT: this is the best cast iron skillet you will ever buy. QUERY: Is this review \"positive\" or \"negative\"", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

Data for reStructure Pre-training

This dataset is a precious treasure, containing a variety of naturally occurring signals. Any downstream task you can think of (e.g., the college entrance exam mentioned in the RST paper) can benefit from being pre-trained on some of our provided signals. We spent several months collecting the following 29 signal types, accounting for a total of 46,926,447 data samples. We hope this dataset will be a valuable asset for everyone in natural language processing research.

We provide collected signals through DataLab . For efficiency, we only provide 50,000 samples at most for each signal type. If you want all the samples we collected, please fill this form . More specifically, we collected the following signals.

We will be happy :smiley: to know if the resource is helpful for your work, and please cite our work :blush:
Mine Signal #Sample Use in DataLab Some Applications
Rotten Tomatoes (review, rating) 5,311,109 load_dataset("rst", "rotten_tomatoes_sentiment") Sentiment classification
Daily Mail (text, category) 899,904 load_dataset("rst", "daily_mail_category") Topic classification
Daily Mail (title, text, summary) 1,026,616 load_dataset("rst", "daily_mail_summary") Summarization; Sentence expansion
Daily Mail (text, events) 1,006,412 load_dataset("rst", "daily_mail_temporal") Temporal reasoning
Wikidata (entity, entity_type, text) 2,214,274 load_dataset("rst", "wikidata_entity") Entity typing
Wikidata (subject, object, relation, text) 1,526,674 load_dataset("rst", "wikidata_relation") Relation extraction; Fact retrieval
wikiHow (text, category) 112,109 load_dataset("rst", "wikihow_text_category") Topic classification
wikiHow (low_category, high_category) 4,868 load_dataset("rst", "wikihow_category_hierarchy") Relation extraction; Commonsense reasoning
wikiHow (goal, steps) 47,956 load_dataset("rst", "wikihow_goal_step") Intent detection
wikiHow (text, summary) 703,278 load_dataset("rst", "wikihow_summary") Summarization; Sentence expansion
wikiHow (goal, first_step, second_step) 47,787 load_dataset("rst", "wikihow_procedure") Temporal reasoning
wikiHow (question, description, answer, related_questions) 47,705 load_dataset("rst", "wikihow_question") Question generation
Wikipedia (text, entities) 22,231,011 load_dataset("rst", "wikipedia_entities") Entity recognition
Wikipedia (texts, titles) 3,296,225 load_dataset("rst", "wikipedia_sections") Summarization
WordNet (word, sentence, pos) 27,123 load_dataset("rst", "wordnet_pos") Part-of-speech tagging
WordNet (word, sentence, meaning, possible_meanings) 27,123 load_dataset("rst", "wordnet_meaning") Word sense disambiguation
WordNet (word, sentence, synonyms) 17,804 load_dataset("rst", "wordnet_synonym") Paraphrasing
WordNet (word, sentence, antonyms) 6,408 load_dataset("rst", "wordnet_antonym") Negation
ConTRoL (premise, hypothesis, label) 8,323 load_dataset("rst", "qa_control") Natural language inference
DREAM (context, question, options, answer) 9,164 load_dataset("rst", "qa_dream") Reading comprehension
LogiQA (context, question, options, answer) 7,974 load_dataset("rst", "qa_logiqa") Reading comprehension
ReClor (context, question, options, answer) 5,138 load_dataset("rst", "qa_reclor") Reading comprehension
RACE (context, question, options, answer) 44,880 load_dataset("rst", "qa_race") Reading comprehension
RACE-C (context, question, options, answer) 5,093 load_dataset("rst", "qa_race_c") Reading comprehension
TriviaQA (context, question, answer) 46,636 load_dataset("rst", "qa_triviaqa") Reading comprehension
Arxiv (text, category) 1,696,348 load_dataset("rst", "arxiv_category") Topic classification
Arxiv (text, summary) 1,696,348 load_dataset("rst", "arxiv_summary") Summarization; Sentence expansion
Paperswithcode (text, entities, datasets, methods, tasks, metrics) 4,731,233 load_dataset("rst", "paperswithcode_entity") Entity recognition
Paperswithcode (text, summary) 120,924 load_dataset("rst", "paperswithcode_summary") Summarization; Sentence expansion

Bibtext for Citation Info

@article{yuan2022restructured,
  title={reStructured Pre-training},
  author={Yuan, Weizhe and Liu, Pengfei},
  journal={arXiv preprint arXiv:2206.11147},
  year={2022}
}