模型:

pszemraj/flan-t5-large-grammar-synthesis

任务:

文生文

类库:

PyTorch ONNX Safetensors Transformers

数据集:

jfleg 3Ajfleg

数字对象标识符:

10.57967/hf/0138

其他:

t5 grammar spelling punctuation error-correction grammar synthesis FLAN AutoTrain Compatible grammar+synthesis text-generation-inference

预印本库:

arxiv:2107.06751

许可:

cc-by-nc-sa-4.0

apache-2.0

模型介绍文件清单

英文

grammar-synthesis-large: FLAN-t5

一个在扩展版数据集上用于语法纠正的细调版本 google/flan-t5-large 。 Demo 在HF空间上。

示例

与原始 grammar-synthesis-large 进行比较。

Python中的用法

已经有一个colab笔记本实现了这个基本版本(点击打开Colab按钮)。

在运行下面的代码之前，先安装transformers：

from transformers import pipeline

corrector = pipeline(
              'text2text-generation',
              'pszemraj/flan-t5-large-grammar-synthesis',
              )
raw_text = 'i can has cheezburger'
results = corrector(raw_text)
print(results)

对于批量推理: 具体详情请参见 this discussion thread ，但基本上数据集由一次多个句子组成，因此建议以相同的方式运行推理: 每个批次64-96个token(或者使用正则表达式分割成2-3个句子)。

在使用text2text模型进行文本转文本之前，先检查给定的句子是否需要矫正语法。可以使用在CoLA上细调的BERT模型，例如textattack/roberta-base-CoLA。
我制作了一个演示批量推理的笔记本 here 。

模型描述

目的是创建一个文本到文本语言模型，能够成功地在潜在的语法不正确的文本上进行"一次性语法矫正"，并且不会在语法正确的文本/信息上语义改变。

对比一下 other grammar correction models 中的一些错误较重的例子，看看差别 :)

ONNX检查点

该模型已经转换为ONNX格式，可以使用huggingface的optimum库加载/使用。

首先，需要 install optimum 。

pip install optimum[onnxruntime]
# ^ if you want to use a different runtime read their docs

用optimum pipeline加载。

from optimum.pipelines import pipeline

corrector = pipeline(
    "text2text-generation", model=corrector_model_name, accelerator="ort"
)
# use as normal

其他检查点

如果在您的使用场景中，为了更快的推理速度而牺牲了一些语法矫正质量，可以查看从相关t5检查点细调的 base 和 small 检查点。

限制

数据集: cc-by-nc-sa-4.0
模型: apache-2.0
尽管这仍然是一个正在进行中的工作，虽然在许多情况下对于"一次性语法矫正"可能有用，但对于输出结果，请仔细检查。

应用场景

显然，由于"通用一次性语法矫正"的用途很广泛，所以这部分内容非常通用。一些想法或用例:

纠正高错误率的LM输出。一些例子包括音频转录(ASR)（这些文字就是例子之一）或类似手写OCR的东西。

根据使用的模型/系统，可能值得在字符识别之后应用这种矫正方法。

修正文本生成模型生成的文本，使其连贯/去除破坏对话流畅性的明显错误。我将其用于 this OPT 2.7B chatbot-esque model of myself 的输出。

Original response:
                ive heard it attributed to a bunch of different philosophical schools, including stoicism, pragmatism, existentialism and even some forms of post-structuralism. i think one of the most interesting (and most difficult) philosophical problems is trying to let dogs (or other animals) out of cages. the reason why this is a difficult problem is because it seems to go against our grain (so to
synthesizing took 306.12 seconds
Final response in 1294.857 s:
        I've heard it attributed to a bunch of different philosophical schools, including solipsism, pragmatism, existentialism and even some forms of post-structuralism. i think one of the most interesting (and most difficult) philosophical problems is trying to let dogs (or other animals) out of cages. the reason why this is a difficult problem is because it seems to go against our grain (so to speak)

注意: 对于这个聊天机器人设置中的最后一个句子，我还有一些其他逻辑，将最后一个句子末尾的句号去除 to avoid coming off as passive aggressive 。

与上述第2点有些相关，在被语言模型生成的文本中修复/修正所谓的 tortured-phrases ，这些文本表明其由语言模型生成。请注意，有些情况下并没有修正这些文本，特别是当它们进入特定领域术语时（例如不规则林地代替随机森林）。

引用信息

如果您发现这个细调模型在您的工作中很有用，请考虑引用它 :)

@misc {peter_szemraj_2022,
    author       = { {Peter Szemraj} },
    title        = { flan-t5-large-grammar-synthesis (Revision d0b5ae2) },
    year         = 2022,
    url          = { https://huggingface.co/pszemraj/flan-t5-large-grammar-synthesis },
    doi          = { 10.57967/hf/0138 },
    publisher    = { Hugging Face }
}

作者:

Peter Szemraj

数据集大小:

6.17 GB