英文

reStructured Pre-training (RST)

官方 repository , paper , easter eggs

RST是一种新的语言预训练范式,它以结构化方式将来自全球10个数据源(Totten Tomatoes,Dailymail,Wikipedia,Wikidata,Wikihow,Wordnet,arXiv等)的26种不同类型的信号统一起来,并通过一个整体模型进行预训练,
  • 在各种NLP任务(分类,信息提取,检索,生成等)的52/55个流行数据集上超越了强大的竞争对手(例如,T0)
  • 在高考英语考试中,取得比学生平均分高40分,比GPT3高15分,但参数量只有GPT3的1/16。特别是,秦在2018年的英语考试中获得了高达138.5分(满分150分)
  • 在这样的预训练范式中,

    • 数据为中心的预训练:数据的作用将被重新强调,模型的预训练和下游任务的微调被视为数据存储和访问的过程
    • 以JSON而不是TEXT进行预训练:良好的存储机制不仅应具有缓存大量数据的能力,还应考虑访问的便利性

    模型描述

    我们发布了我们 paper 中介绍的所有模型,涵盖了13个不同的应用场景。每个模型包含110亿个参数。

    Model Description Recommended Application
    rst-all-11b Trained with all the signals below except signals that are used to train Gaokao models All applications below (specialized models are recommended first if high performance is preferred)
    rst-fact-retrieval-11b Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym, wikiHow category hierarchy, Wikidata relation, Wikidata entity typing, Paperswithcode entity typing Knowledge intensive tasks, information extraction tasks,factual checker
    rst-summarization-11b Trained with the following signals: DailyMail summary, Paperswithcode summary, arXiv summary, wikiHow summary Summarization or other general generation tasks, meta-evaluation (e.g., BARTScore)
    rst-temporal-reasoning-11b Trained with the following signals: DailyMail temporal information, wikiHow procedure Temporal reasoning, relation extraction, event-based extraction
    rst-information-extraction-11b Trained with the following signals: Paperswithcode entity, Paperswithcode entity typing, Wikidata entity typing, Wikidata relation, Wikipedia entity Named entity recognition, relation extraction and other general IE tasks in the news, scientific or other domains
    rst-intent-detection-11b Trained with the following signals: wikiHow goal-step relation Intent prediction, event prediction
    rst-topic-classification-11b Trained with the following signals: DailyMail category, arXiv category, wikiHow text category, Wikipedia section title general text classification
    rst-word-sense-disambiguation-11b Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym Word sense disambiguation, part-of-speech tagging, general IE tasks, common sense reasoning
    rst-natural-language-inference-11b Trained with the following signals: ConTRoL dataset, DREAM dataset, LogiQA dataset, RACE & RACE-C dataset, ReClor dataset, DailyMail temporal information Natural language inference, multiple-choice question answering, reasoning
    rst-sentiment-classification-11b Trained with the following signals: Rotten Tomatoes sentiment, Wikipedia sentiment Sentiment classification, emotion classification
    rst-gaokao-rc-11b Trained with multiple-choice QA datasets that are used to train the 1236321 model General multiple-choice question answering
    rst-gaokao-cloze-11b Trained with manually crafted cloze datasets General cloze filling
    rst-gaokao-writing-11b Trained with example essays from past Gaokao-English exams and grammar error correction signals Essay writing, story generation, grammar error correction and other text generation tasks

    试一试?

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    tokenizer = AutoTokenizer.from_pretrained("XLab/rst-all-11b")
    model = AutoModelForSeq2SeqLM.from_pretrained("XLab/rst-all-11b")
    
    inputs = tokenizer.encode("TEXT: this is the best cast iron skillet you will ever buy. QUERY: Is this review \"positive\" or \"negative\"", return_tensors="pt")
    outputs = model.generate(inputs)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
    

    用于重构预训练的数据

    这个数据集是一宝贵资源,包含各种自然发生的信号。任何您能想到的下游任务(例如,RST论文中提到的高考英语考试)都可以从在我们提供的某些信号上进行预训练中受益。我们花了几个月时间收集了以下29种信号类型,共计46,926,447个数据样本。希望这个数据集能成为自然语言处理研究中的宝贵资产。

    我们通过 DataLab 提供收集的信号。为了效率,我们每种信号类型最多只提供50,000个样本。如果您想要我们收集的所有样本,请填写此 form 。具体而言,我们收集了以下信号

    我们将非常乐意 :smiley: 知道这个资源是否对您的工作有帮助,请引用我们的 work :blush:
    Mine Signal #Sample Use in DataLab Some Applications
    12310321 (review, rating) 5,311,109 load_dataset("rst", "rotten_tomatoes_sentiment") Sentiment classification
    12311321 (text, category) 899,904 load_dataset("rst", "daily_mail_category") Topic classification
    12311321 (title, text, summary) 1,026,616 load_dataset("rst", "daily_mail_summary") Summarization; Sentence expansion
    12311321 (text, events) 1,006,412 load_dataset("rst", "daily_mail_temporal") Temporal reasoning
    12314321 (entity, entity_type, text) 2,214,274 load_dataset("rst", "wikidata_entity") Entity typing
    12314321 (subject, object, relation, text) 1,526,674 load_dataset("rst", "wikidata_relation") Relation extraction; Fact retrieval
    12316321 (text, category) 112,109 load_dataset("rst", "wikihow_text_category") Topic classification
    12316321 (low_category, high_category) 4,868 load_dataset("rst", "wikihow_category_hierarchy") Relation extraction; Commonsense reasoning
    12316321 (goal, steps) 47,956 load_dataset("rst", "wikihow_goal_step") Intent detection
    12316321 (text, summary) 703,278 load_dataset("rst", "wikihow_summary") Summarization; Sentence expansion
    12316321 (goal, first_step, second_step) 47,787 load_dataset("rst", "wikihow_procedure") Temporal reasoning
    12316321 (question, description, answer, related_questions) 47,705 load_dataset("rst", "wikihow_question") Question generation
    12322321 (text, entities) 22,231,011 load_dataset("rst", "wikipedia_entities") Entity recognition
    12322321 (texts, titles) 3,296,225 load_dataset("rst", "wikipedia_sections") Summarization
    12324321 (word, sentence, pos) 27,123 load_dataset("rst", "wordnet_pos") Part-of-speech tagging
    12324321 (word, sentence, meaning, possible_meanings) 27,123 load_dataset("rst", "wordnet_meaning") Word sense disambiguation
    12324321 (word, sentence, synonyms) 17,804 load_dataset("rst", "wordnet_synonym") Paraphrasing
    12324321 (word, sentence, antonyms) 6,408 load_dataset("rst", "wordnet_antonym") Negation
    12328321 (premise, hypothesis, label) 8,323 load_dataset("rst", "qa_control") Natural language inference
    12329321 (context, question, options, answer) 9,164 load_dataset("rst", "qa_dream") Reading comprehension
    12330321 (context, question, options, answer) 7,974 load_dataset("rst", "qa_logiqa") Reading comprehension
    12331321 (context, question, options, answer) 5,138 load_dataset("rst", "qa_reclor") Reading comprehension
    12332321 (context, question, options, answer) 44,880 load_dataset("rst", "qa_race") Reading comprehension
    12333321 (context, question, options, answer) 5,093 load_dataset("rst", "qa_race_c") Reading comprehension
    12334321 (context, question, answer) 46,636 load_dataset("rst", "qa_triviaqa") Reading comprehension
    12335321 (text, category) 1,696,348 load_dataset("rst", "arxiv_category") Topic classification
    12335321 (text, summary) 1,696,348 load_dataset("rst", "arxiv_summary") Summarization; Sentence expansion
    12337321 (text, entities, datasets, methods, tasks, metrics) 4,731,233 load_dataset("rst", "paperswithcode_entity") Entity recognition
    12337321 (text, summary) 120,924 load_dataset("rst", "paperswithcode_summary") Summarization; Sentence expansion

    引文信息的Bibtext

    @article{yuan2022restructured,
      title={reStructured Pre-training},
      author={Yuan, Weizhe and Liu, Pengfei},
      journal={arXiv preprint arXiv:2206.11147},
      year={2022}
    }