reStructured Pre-training (RST)

官方 repository , paper , easter eggs

RST是一种新的语言预训练范式，它以结构化方式将来自全球10个数据源（Totten Tomatoes，Dailymail，Wikipedia，Wikidata，Wikihow，Wordnet，arXiv等）的26种不同类型的信号统一起来，并通过一个整体模型进行预训练，

在各种NLP任务（分类，信息提取，检索，生成等）的52/55个流行数据集上超越了强大的竞争对手（例如，T0）

在高考英语考试中，取得比学生平均分高40分，比GPT3高15分，但参数量只有GPT3的1/16。特别是，秦在2018年的英语考试中获得了高达138.5分（满分150分）

在这样的预训练范式中，

数据为中心的预训练：数据的作用将被重新强调，模型的预训练和下游任务的微调被视为数据存储和访问的过程
以JSON而不是TEXT进行预训练：良好的存储机制不仅应具有缓存大量数据的能力，还应考虑访问的便利性

模型描述

我们发布了我们 paper 中介绍的所有模型，涵盖了13个不同的应用场景。每个模型包含110亿个参数。

Model	Description	Recommended Application
rst-all-11b	Trained with all the signals below except signals that are used to train Gaokao models	All applications below （specialized models are recommended first if high performance is preferred）
rst-fact-retrieval-11b	Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym, wikiHow category hierarchy, Wikidata relation, Wikidata entity typing, Paperswithcode entity typing	Knowledge intensive tasks, information extraction tasks,factual checker
rst-summarization-11b	Trained with the following signals: DailyMail summary, Paperswithcode summary, arXiv summary, wikiHow summary	Summarization or other general generation tasks, meta-evaluation (e.g., BARTScore)
rst-temporal-reasoning-11b	Trained with the following signals: DailyMail temporal information, wikiHow procedure	Temporal reasoning, relation extraction, event-based extraction
rst-information-extraction-11b	Trained with the following signals: Paperswithcode entity, Paperswithcode entity typing, Wikidata entity typing, Wikidata relation, Wikipedia entity	Named entity recognition, relation extraction and other general IE tasks in the news, scientific or other domains
rst-intent-detection-11b	Trained with the following signals: wikiHow goal-step relation	Intent prediction, event prediction
rst-topic-classification-11b	Trained with the following signals: DailyMail category, arXiv category, wikiHow text category, Wikipedia section title	general text classification
rst-word-sense-disambiguation-11b	Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym	Word sense disambiguation, part-of-speech tagging, general IE tasks, common sense reasoning
rst-natural-language-inference-11b	Trained with the following signals: ConTRoL dataset, DREAM dataset, LogiQA dataset, RACE & RACE-C dataset, ReClor dataset, DailyMail temporal information	Natural language inference, multiple-choice question answering, reasoning
rst-sentiment-classification-11b	Trained with the following signals: Rotten Tomatoes sentiment, Wikipedia sentiment	Sentiment classification, emotion classification
rst-gaokao-rc-11b	Trained with multiple-choice QA datasets that are used to train the 1236321 model	General multiple-choice question answering
rst-gaokao-cloze-11b	Trained with manually crafted cloze datasets	General cloze filling
rst-gaokao-writing-11b	Trained with example essays from past Gaokao-English exams and grammar error correction signals	Essay writing, story generation, grammar error correction and other text generation tasks

试一试？

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("XLab/rst-all-11b")
model = AutoModelForSeq2SeqLM.from_pretrained("XLab/rst-all-11b")

inputs = tokenizer.encode("TEXT: this is the best cast iron skillet you will ever buy. QUERY: Is this review \"positive\" or \"negative\"", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))

用于重构预训练的数据

这个数据集是一宝贵资源，包含各种自然发生的信号。任何您能想到的下游任务（例如，RST论文中提到的高考英语考试）都可以从在我们提供的某些信号上进行预训练中受益。我们花了几个月时间收集了以下29种信号类型，共计46,926,447个数据样本。希望这个数据集能成为自然语言处理研究中的宝贵资产。

我们通过 DataLab 提供收集的信号。为了效率，我们每种信号类型最多只提供50,000个样本。如果您想要我们收集的所有样本，请填写此 form 。具体而言，我们收集了以下信号

我们将非常乐意 :smiley: 知道这个资源是否对您的工作有帮助，请引用我们的 work :blush:

Mine	Signal	#Sample	Use in DataLab	Some Applications
12310321	(review, rating)	5,311,109	load_dataset("rst", "rotten_tomatoes_sentiment")	Sentiment classification
12311321	(text, category)	899,904	load_dataset("rst", "daily_mail_category")	Topic classification
12311321	(title, text, summary)	1,026,616	load_dataset("rst", "daily_mail_summary")	Summarization; Sentence expansion
12311321	(text, events)	1,006,412	load_dataset("rst", "daily_mail_temporal")	Temporal reasoning
12314321	(entity, entity_type, text)	2,214,274	load_dataset("rst", "wikidata_entity")	Entity typing
12314321	(subject, object, relation, text)	1,526,674	load_dataset("rst", "wikidata_relation")	Relation extraction; Fact retrieval
12316321	(text, category)	112,109	load_dataset("rst", "wikihow_text_category")	Topic classification
12316321	(low_category, high_category)	4,868	load_dataset("rst", "wikihow_category_hierarchy")	Relation extraction; Commonsense reasoning
12316321	(goal, steps)	47,956	load_dataset("rst", "wikihow_goal_step")	Intent detection
12316321	(text, summary)	703,278	load_dataset("rst", "wikihow_summary")	Summarization; Sentence expansion
12316321	(goal, first_step, second_step)	47,787	load_dataset("rst", "wikihow_procedure")	Temporal reasoning
12316321	(question, description, answer, related_questions)	47,705	load_dataset("rst", "wikihow_question")	Question generation
12322321	(text, entities)	22,231,011	load_dataset("rst", "wikipedia_entities")	Entity recognition
12322321	(texts, titles)	3,296,225	load_dataset("rst", "wikipedia_sections")	Summarization
12324321	(word, sentence, pos)	27,123	load_dataset("rst", "wordnet_pos")	Part-of-speech tagging
12324321	(word, sentence, meaning, possible_meanings)	27,123	load_dataset("rst", "wordnet_meaning")	Word sense disambiguation
12324321	(word, sentence, synonyms)	17,804	load_dataset("rst", "wordnet_synonym")	Paraphrasing
12324321	(word, sentence, antonyms)	6,408	load_dataset("rst", "wordnet_antonym")	Negation
12328321	(premise, hypothesis, label)	8,323	load_dataset("rst", "qa_control")	Natural language inference
12329321	(context, question, options, answer)	9,164	load_dataset("rst", "qa_dream")	Reading comprehension
12330321	(context, question, options, answer)	7,974	load_dataset("rst", "qa_logiqa")	Reading comprehension
12331321	(context, question, options, answer)	5,138	load_dataset("rst", "qa_reclor")	Reading comprehension
12332321	(context, question, options, answer)	44,880	load_dataset("rst", "qa_race")	Reading comprehension
12333321	(context, question, options, answer)	5,093	load_dataset("rst", "qa_race_c")	Reading comprehension
12334321	(context, question, answer)	46,636	load_dataset("rst", "qa_triviaqa")	Reading comprehension
12335321	(text, category)	1,696,348	load_dataset("rst", "arxiv_category")	Topic classification
12335321	(text, summary)	1,696,348	load_dataset("rst", "arxiv_summary")	Summarization; Sentence expansion
12337321	(text, entities, datasets, methods, tasks, metrics)	4,731,233	load_dataset("rst", "paperswithcode_entity")	Entity recognition
12337321	(text, summary)	120,924	load_dataset("rst", "paperswithcode_summary")	Summarization; Sentence expansion

引文信息的Bibtext

@article{yuan2022restructured,
  title={reStructured Pre-training},
  author={Yuan, Weizhe and Liu, Pengfei},
  journal={arXiv preprint arXiv:2206.11147},
  year={2022}
}

作者:

GAIR

数据集大小:

42.13 GB