模型:
GAIR/rst-all-11b
官方 repository , paper , easter eggs
RST是一种新的语言预训练范式,它以结构化方式将来自全球10个数据源(Totten Tomatoes,Dailymail,Wikipedia,Wikidata,Wikihow,Wordnet,arXiv等)的26种不同类型的信号统一起来,并通过一个整体模型进行预训练,在这样的预训练范式中,
我们发布了我们 paper 中介绍的所有模型,涵盖了13个不同的应用场景。每个模型包含110亿个参数。
Model | Description | Recommended Application |
---|---|---|
rst-all-11b | Trained with all the signals below except signals that are used to train Gaokao models | All applications below (specialized models are recommended first if high performance is preferred) |
rst-fact-retrieval-11b | Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym, wikiHow category hierarchy, Wikidata relation, Wikidata entity typing, Paperswithcode entity typing | Knowledge intensive tasks, information extraction tasks,factual checker |
rst-summarization-11b | Trained with the following signals: DailyMail summary, Paperswithcode summary, arXiv summary, wikiHow summary | Summarization or other general generation tasks, meta-evaluation (e.g., BARTScore) |
rst-temporal-reasoning-11b | Trained with the following signals: DailyMail temporal information, wikiHow procedure | Temporal reasoning, relation extraction, event-based extraction |
rst-information-extraction-11b | Trained with the following signals: Paperswithcode entity, Paperswithcode entity typing, Wikidata entity typing, Wikidata relation, Wikipedia entity | Named entity recognition, relation extraction and other general IE tasks in the news, scientific or other domains |
rst-intent-detection-11b | Trained with the following signals: wikiHow goal-step relation | Intent prediction, event prediction |
rst-topic-classification-11b | Trained with the following signals: DailyMail category, arXiv category, wikiHow text category, Wikipedia section title | general text classification |
rst-word-sense-disambiguation-11b | Trained with the following signals: WordNet meaning, WordNet part-of-speech, WordNet synonym, WordNet antonym | Word sense disambiguation, part-of-speech tagging, general IE tasks, common sense reasoning |
rst-natural-language-inference-11b | Trained with the following signals: ConTRoL dataset, DREAM dataset, LogiQA dataset, RACE & RACE-C dataset, ReClor dataset, DailyMail temporal information | Natural language inference, multiple-choice question answering, reasoning |
rst-sentiment-classification-11b | Trained with the following signals: Rotten Tomatoes sentiment, Wikipedia sentiment | Sentiment classification, emotion classification |
rst-gaokao-rc-11b | Trained with multiple-choice QA datasets that are used to train the 1236321 model | General multiple-choice question answering |
rst-gaokao-cloze-11b | Trained with manually crafted cloze datasets | General cloze filling |
rst-gaokao-writing-11b | Trained with example essays from past Gaokao-English exams and grammar error correction signals | Essay writing, story generation, grammar error correction and other text generation tasks |
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("XLab/rst-all-11b") model = AutoModelForSeq2SeqLM.from_pretrained("XLab/rst-all-11b") inputs = tokenizer.encode("TEXT: this is the best cast iron skillet you will ever buy. QUERY: Is this review \"positive\" or \"negative\"", return_tensors="pt") outputs = model.generate(inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
这个数据集是一宝贵资源,包含各种自然发生的信号。任何您能想到的下游任务(例如,RST论文中提到的高考英语考试)都可以从在我们提供的某些信号上进行预训练中受益。我们花了几个月时间收集了以下29种信号类型,共计46,926,447个数据样本。希望这个数据集能成为自然语言处理研究中的宝贵资产。
我们通过 DataLab 提供收集的信号。为了效率,我们每种信号类型最多只提供50,000个样本。如果您想要我们收集的所有样本,请填写此 form 。具体而言,我们收集了以下信号
我们将非常乐意 :smiley: 知道这个资源是否对您的工作有帮助,请引用我们的 work :blush:Mine | Signal | #Sample | Use in DataLab | Some Applications |
---|---|---|---|---|
12310321 | (review, rating) | 5,311,109 | load_dataset("rst", "rotten_tomatoes_sentiment") | Sentiment classification |
12311321 | (text, category) | 899,904 | load_dataset("rst", "daily_mail_category") | Topic classification |
12311321 | (title, text, summary) | 1,026,616 | load_dataset("rst", "daily_mail_summary") | Summarization; Sentence expansion |
12311321 | (text, events) | 1,006,412 | load_dataset("rst", "daily_mail_temporal") | Temporal reasoning |
12314321 | (entity, entity_type, text) | 2,214,274 | load_dataset("rst", "wikidata_entity") | Entity typing |
12314321 | (subject, object, relation, text) | 1,526,674 | load_dataset("rst", "wikidata_relation") | Relation extraction; Fact retrieval |
12316321 | (text, category) | 112,109 | load_dataset("rst", "wikihow_text_category") | Topic classification |
12316321 | (low_category, high_category) | 4,868 | load_dataset("rst", "wikihow_category_hierarchy") | Relation extraction; Commonsense reasoning |
12316321 | (goal, steps) | 47,956 | load_dataset("rst", "wikihow_goal_step") | Intent detection |
12316321 | (text, summary) | 703,278 | load_dataset("rst", "wikihow_summary") | Summarization; Sentence expansion |
12316321 | (goal, first_step, second_step) | 47,787 | load_dataset("rst", "wikihow_procedure") | Temporal reasoning |
12316321 | (question, description, answer, related_questions) | 47,705 | load_dataset("rst", "wikihow_question") | Question generation |
12322321 | (text, entities) | 22,231,011 | load_dataset("rst", "wikipedia_entities") | Entity recognition |
12322321 | (texts, titles) | 3,296,225 | load_dataset("rst", "wikipedia_sections") | Summarization |
12324321 | (word, sentence, pos) | 27,123 | load_dataset("rst", "wordnet_pos") | Part-of-speech tagging |
12324321 | (word, sentence, meaning, possible_meanings) | 27,123 | load_dataset("rst", "wordnet_meaning") | Word sense disambiguation |
12324321 | (word, sentence, synonyms) | 17,804 | load_dataset("rst", "wordnet_synonym") | Paraphrasing |
12324321 | (word, sentence, antonyms) | 6,408 | load_dataset("rst", "wordnet_antonym") | Negation |
12328321 | (premise, hypothesis, label) | 8,323 | load_dataset("rst", "qa_control") | Natural language inference |
12329321 | (context, question, options, answer) | 9,164 | load_dataset("rst", "qa_dream") | Reading comprehension |
12330321 | (context, question, options, answer) | 7,974 | load_dataset("rst", "qa_logiqa") | Reading comprehension |
12331321 | (context, question, options, answer) | 5,138 | load_dataset("rst", "qa_reclor") | Reading comprehension |
12332321 | (context, question, options, answer) | 44,880 | load_dataset("rst", "qa_race") | Reading comprehension |
12333321 | (context, question, options, answer) | 5,093 | load_dataset("rst", "qa_race_c") | Reading comprehension |
12334321 | (context, question, answer) | 46,636 | load_dataset("rst", "qa_triviaqa") | Reading comprehension |
12335321 | (text, category) | 1,696,348 | load_dataset("rst", "arxiv_category") | Topic classification |
12335321 | (text, summary) | 1,696,348 | load_dataset("rst", "arxiv_summary") | Summarization; Sentence expansion |
12337321 | (text, entities, datasets, methods, tasks, metrics) | 4,731,233 | load_dataset("rst", "paperswithcode_entity") | Entity recognition |
12337321 | (text, summary) | 120,924 | load_dataset("rst", "paperswithcode_summary") | Summarization; Sentence expansion |
@article{yuan2022restructured, title={reStructured Pre-training}, author={Yuan, Weizhe and Liu, Pengfei}, journal={arXiv preprint arXiv:2206.11147}, year={2022} }