T5-small基于WikiSQL微调

Google's T5 small 对英文转SQL进行了微调 WikiSQL 。

T5的详细信息

T5模型是由Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li、Peter J. Liu于 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 年提出的，摘要如下：

迁移学习是一种在数据丰富的任务上预训练模型，然后在下游任务上进行微调的强大技术，在自然语言处理（NLP）中得到了广泛应用。迁移学习的有效性产生了多样化的方法、方法论和实践。在本文中，我们通过引入一个将每个语言问题转换为文本到文本格式的统一框架，探索了NLP的迁移学习技术。我们的系统研究比较了预训练目标、体系结构、无标签数据集、转移方法和其他因素在数十个语言理解任务上的表现。通过结合我们的探索结果、规模和我们的新的“超大干净爬网语料库”的见解，我们在涵盖摘要、问答、文本分类等许多基准测试中取得了最先进的结果。为了促进NLP迁移学习的未来工作，我们发布了我们的数据集、预训练模型和代码。

数据集的详细信息 📚

数据集ID：wikisql，来自 Huggingface/NLP

Dataset	Split	# samples
wikisql	train	56355
wikisql	valid	14436

如何从 nlp 加载它

train_dataset  = nlp.load_dataset('wikisql', split=nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset('wikisql', split=nlp.Split.VALIDATION)

在 NLP Viewer 中了解更多关于此数据集和其他数据集的信息

模型微调 🏋️‍

训练脚本是 this Colab Notebook 的稍加修改版本，由 Suraj Patil 创建，所有功劳归他！

模型展示 🚀

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-small-finetuned-wikiSQL")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-small-finetuned-wikiSQL")

def get_sql(query):
  input_text = "translate English to SQL: %s </s>" % query
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'])
  
  return tokenizer.decode(output[0])

query = "How many millions of params there are in HF-hub?"

get_sql(query)

# output: 'SELECT COUNT Params FROM table WHERE Location = HF-hub'

创建者： Manuel Romero/@mrm8488 | LinkedIn

在西班牙制作 ♥

作者:

Manuel Romero

数据集大小:

838.95 MB