模型:

1-800-BAD-CODE/punctuation_fullstop_truecase_english

英文

模型概述

该模型接受小写、无标点的英文文本作为输入,并一次性执行标点还原、首字母大写和句子边界检测(分割)。

与许多类似的模型不同,该模型可以通过特殊的“缩写”类预测带标点的缩写词(例如“U.S.”),以及通过多标签的真实大小写预测预测任意大小写的单词(NATO,McDonald's等)。

小部件说明:文本生成小部件似乎不尊重换行符。相反,该模型在预测句子边界(换行符)的位置插入了一个新的换行符标记 \n。

用法

使用该模型的简单方法是安装 punctuators :

pip install punctuators

如果此软件包出现问题,请在社区标签中告知我(我会为每个模型更新它并且容易出错)。

让我们对我的周末回顾以及我编造或在维基百科找到的一些有趣的缩写和缩写的句子进行标点符号修复:

示例用法
from typing import List

from punctuators.models import PunctCapSegModelONNX

# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = PunctCapSegModelONNX.from_pretrained("pcs_en")

# Define some input texts to punctuate
input_texts: List[str] = [
    # Literally my weekend
    "i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends",
    "despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live",
    "when i got home i trained this model on the lambda cloud on an a100 gpu with about 10 million lines of text the total budget was less than 5 dollars",
    # Real acronyms in sentences that I made up
    "george hw bush was the president of the us for 8 years",
    "i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter",
    # See how the model performs on made-up acronyms 
    "i went to the fgw store and bought a new tg optical scope",
    # First few sentences from today's featured article summary on wikipedia
    "it's that man again itma was a radio comedy programme that was broadcast by the bbc for twelve series from 1939 to 1949 featuring tommy handley in the central role itma was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations parts of the scripts were rewritten in the hours before the broadcast to ensure topicality"
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    print(f"Outputs:")
    for text in output_texts:
        print(f"\t{text}")
    print()

实际输出结果可能会因模型版本而异;以下是当前输出:

期望的输出
In: i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends
    Out: I woke up at 6 a.m. and took the dog for a hike in the Metacomet Mountains.
    Out: We like to take morning adventures on the weekends.

In: despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live
    Out: Despite being mid March, it snowed overnight and into the morning.
    Out: Here in Connecticut, it was snowier up in the mountains than in the Farmington Valley where I live.

In: when i got home i trained this model on the lambda cloud on an a100 gpu with about 10 million lines of text the total budget was less than 5 dollars
    Out: When I got home, I trained this model on the Lambda Cloud.
    Out: On an A100 GPU with about 10 million lines of text, the total budget was less than 5 dollars.

In: george hw bush was the president of the us for 8 years
    Out: George H.W. Bush was the president of the U.S. for 8 years.

In: i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter
    Out: I saw Mr. Smith at the store he was shopping for a new lawn mower.
    Out: I suggested he get one of those new battery operated ones.
    Out: They're so much quieter.

In: i went to the fgw store and bought a new tg optical scope
    Out: I went to the FGW store and bought a new TG optical scope.

In: it's that man again itma was a radio comedy programme that was broadcast by the bbc for twelve series from 1939 to 1949 featuring tommy handley in the central role itma was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations parts of the scripts were rewritten in the hours before the broadcast to ensure topicality
    Out: It's that man again.
    Out: ITMA was a radio comedy programme that was broadcast by the BBC for Twelve Series from 1939 to 1949, featuring Tommy Handley.
    Out: In the central role, ITMA was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations.
    Out: Parts of the scripts were rewritten in the hours before the broadcast to ensure topicality.

模型详细信息

该模型实现了下面的图表,并对每个步骤进行了简要的描述。

  • 编码:模型首先使用子词标记器对文本进行标记化。此处使用的标记器是具有32k词汇表大小的SentencePiece模型。接下来,输入序列用512维的基准大小Transformer进行编码,其中包含6层。

  • 标点符号:然后,将编码的序列输入到前馈分类网络中,以预测标点符号标记。为了允许正确标点缩写词,每个子词都会预测标点符号。子词预测的间接好处是可以使模型在适用于连续脚本语言(例如中文)的图表中运行。

  • 句子边界检测:对于句子边界检测,我们通过嵌入将模型与标点符号联系起来。每个标点符号预测用于选择该令牌的嵌入,该嵌入与编码表示连接在一起。SBD模块分析未标点序列的编码和标点预测,并预测哪些令牌是句子边界。

  • 移动和连接句子边界:在英文中,每个句子的第一个字符应为大写。因此,我们应该将句子边界信息提供给真实大小写分类网络。由于真正的大小写分类网络是前馈的,并且没有时间上下文,因此每个时间步必须嵌入是否是句子的第一个单词。因此,我们将二进制句子边界决策向右移动一位:如果标记N-1是句子边界,则标记N是句子的第一个单词。将其与编码文本连接起来后,每个时间步包含句子的第一个单词,这是由SBD模块预测的。

  • 真实大小写预测:在掌握了标点和句子边界的知识后,分类网络预测真实大小写。由于真实大小写应该基于每个字符进行,分类网络对于每个令牌进行N次预测,其中N是子词的长度(实际上,N是可能的最长子词,并且额外的预测会被忽略)。这种方案可以捕捉缩写词,例如“NATO”,以及两个字母大写的单词,例如“MacDonald”。

  • 该模型的最大长度为256个子词,因为受训嵌入的限制。然而,如上所述, punctuators 软件包会在长输入的重叠子段上透明地预测并融合结果,然后返回输出,从而可以处理任意长度的输入。

    标点符号标记

    该模型预测以下一组标点符号标记:

    Token Description
    NULL Predict no punctuation
    ACRONYM Every character in this subword ends with a period
    . Latin full stop
    , Latin comma
    ? Latin question mark

    训练详细信息

    训练框架

    该模型是在 NeMo 框架的一个分支上进行训练的。

    训练数据

    该模型使用了来自WMT的News Crawl数据进行训练。

    使用了约1000万行数据,其中包括2021年和2012年的数据。后者用于尝试减少偏差:年度新闻通常由一些主题主导,而2021年则以COVID讨论为主。

    限制

    领域

    该模型是使用新闻数据训练的,可能在对话或非正式数据上表现不佳。

    训练数据的噪声

    训练数据中存在噪声,并且没有使用手动清理。

    缩写和缩写词

    缩写和缩写词尤其容易出现错误;下表显示了训练数据中每个令牌的变体数量。

    Token Count
    Mr 115232
    Mr. 108212
    Token Count
    U.S. 85324
    US 37332
    U.S 354
    U.s 108
    u.S. 65

    因此,模型对于缩略语和缩写的预测可能有些不可预测。

    句子边界检测目标

    句子边界检测目标的一个假设是输入数据的每一行恰好是一个句子。然而,训练数据中有相当大一部分包含多个句子。因此,如果句子边界与训练数据中的错误类似,SBD模块可能会忽略一个明显的句子边界。

    评估

    在这些指标中,请记住:

  • 数据有噪声
  • 句子边界和真实大小写是基于预测的标点符号条件的,而预测的标点符号是最困难的任务,有时是错误的。当根据参考标点符号进行条件预测时,真实大小写和SBD的指标相对于参考目标更高。
  • 标点符号可能是主观的。例如:
  • 你好,弗兰克,最近怎么样?或者Hello Frank. How's it going?句子越长,实际应用中的这种歧义越多,影响所有3个指标。

    测试数据和示例生成

    每个测试示例是使用以下步骤生成的:

  • 连接10个随机句子
  • 将连接的句子转换为小写
  • 删除所有标点符号
  • 这些数据是News Crawl的一个保留部分,经过了去重处理。使用了3,000行数据,生成了3,000个每个含有10个句子的唯一示例。

    结果

    标点报告
        label                                                precision    recall       f1           support   
        <NULL> (label_id: 0)                                    98.83      98.49      98.66     446496
        <ACRONYM> (label_id: 1)                                 74.15      94.26      83.01        697
        . (label_id: 2)                                         90.64      92.99      91.80      30002
        , (label_id: 3)                                         77.19      79.13      78.15      23321
        ? (label_id: 4)                                         76.58      74.56      75.56       1022
        -------------------
        micro avg                                               97.21      97.21      97.21     501538
        macro avg                                               83.48      87.89      85.44     501538
        weighted avg                                            97.25      97.21      97.23     501538
    
    真实大小写报告
    # With predicted punctuation (not aligned with targets)
        label                                                precision    recall       f1           support   
        LOWER (label_id: 0)                                     99.76      99.72      99.74    2020678
        UPPER (label_id: 1)                                     93.32      94.20      93.76      83873
        -------------------
        micro avg                                               99.50      99.50      99.50    2104551
        macro avg                                               96.54      96.96      96.75    2104551
        weighted avg                                            99.50      99.50      99.50    2104551
    
    
    # With reference punctuation (punctuation matches targets)
        label                                                precision    recall       f1           support   
        LOWER (label_id: 0)                                     99.83      99.81      99.82    2020678
        UPPER (label_id: 1)                                     95.51      95.90      95.71      83873
        -------------------
        micro avg                                               99.66      99.66      99.66    2104551
        macro avg                                               97.67      97.86      97.76    2104551
        weighted avg                                            99.66      99.66      99.66    2104551
    
    句子边界检测报告
    # With predicted punctuation (not aligned with targets)
        label                                                precision    recall       f1           support   
        NOSTOP (label_id: 0)                                    99.59      99.45      99.52     471608
        FULLSTOP (label_id: 1)                                  91.47      93.53      92.49      29930
        -------------------
        micro avg                                               99.09      99.09      99.09     501538
        macro avg                                               95.53      96.49      96.00     501538
        weighted avg                                            99.10      99.09      99.10     501538
    
    
    # With reference punctuation (punctuation matches targets)
        label                                                precision    recall       f1           support   
        NOSTOP (label_id: 0)                                   100.00      99.97      99.98     471608
        FULLSTOP (label_id: 1)                                  99.63      99.93      99.78      32923
        -------------------
        micro avg                                               99.97      99.97      99.97     504531
        macro avg                                               99.81      99.95      99.88     504531
        weighted avg                                            99.97      99.97      99.97     504531
        
    

    有趣的事实

    这一部分提供了一些有趣的事实。

    嵌入

    让我们检查嵌入(见上面的图表)以查看模型是否有意义地使用它们。

    我们在这里显示每个令牌嵌入之间的余弦相似度:

    NULL ACRONYM . , ?
    NULL 1.00
    ACRONYM -0.49 1.00
    . -1.00 0.48 1.00
    , 1.00 -0.48 -1.00 1.00
    ? -1.00 0.49 1.00 -1.00 1.00

    请记住,这些嵌入用于预测句子边界……因此,我们应该预期句号会聚集。

    确实,我们可以看到 NULL 和 “ , ”完全相同,因为它们都没有句子边界的暗示。

    接下来,我们可以看到 “ . ” 和 “ ? ”完全相同,因为在句子边界检测方面,它们是完全相同的:强烈的句号暗示。(尽管我们可能会预期这些令牌之间存在一些差异,因为在不是句号的缩写词(例如 “Mr.”)之后会预测到 “ . ”)

    此外,我们可以看到 “ . ” 和 “ ? ”与 NULL 完全相反。这是符合预期的,因为这些令牌通常意味着句子边界,而 NULL 和 “ , ”从不意味着句子边界。

    最后,我们可以看到 ACRONYM 与 “ . ” 和 “ ? ”相似,但不完全相同,并且与 NULL 和 “ , ”相距甚远。这是预期的,因为缩略词既可以是句号(“我住在北美洲”)也可以不是(“现在是早上5点,我很累”)。