




  • 西班牙语
  • 法语
  • 葡萄牙语
  • 加泰罗尼亚语
  • 意大利语
  • 罗马尼亚语




  • ¿
  • 缩写

虽然在这些语言中很少见(相对于英语),但特殊令牌"缩写"允许对令牌进行全面标点处理,例如 "pm" → "p.m."。




使用该模型的简单方法是安装 punctuators

pip install punctuators


from typing import List

from punctuators.models import PunctCapSegModelONNX

# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = PunctCapSegModelONNX.from_pretrained("pcs_romance")

# Define some input texts to punctuate, at least one per language
input_texts: List[str] = [
    "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
    "hola amigo cómo estás es un día lluvioso hoy",
    "hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat",
    "ciao amico come va oggi è stata una giornata piovosa",
    "olá amigo como tá indo estava chuvoso hoje",
    "salut l'ami comment ça va il pleuvait aujourd'hui",
    "salut prietene cum stă treaba azi a fost ploios",
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    for text in output_texts:


Input: este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt
    Este modelo fue entrenado en un GPU A100.
    En realidad, no se que dice esta frase lo traduje con NMT.

Input: hola amigo cómo estás es un día lluvioso hoy
    Hola, amigo.
    ¿Cómo estás?
    Es un día lluvioso hoy.

Input: hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat
    Hola, amic.
    Com va avui?
    Ha estat un dia plujós.
    El català prediu massa puntuació per com s'ha entrenat.

Input: ciao amico come va oggi è stata una giornata piovosa
    Ciao amico, come va?
    Oggi è stata una giornata piovosa.

Input: olá amigo como tá indo estava chuvoso hoje
    Olá, amigo, como tá indo?
    Estava chuvoso hoje.

Input: salut l'ami comment ça va il pleuvait aujourd'hui
    Salut l'ami.
    Comment ça va?
    Il pleuvait aujourd'hui.

Input: salut prietene cum stă treaba azi a fost ploios
    Salut prietene, cum stă treaba azi?
    A fost ploios.


input_texts: List[str] = [
    "hola amigo cómo estás es un día lluvioso hoy",
results: List[str] = m.infer(input_texts, apply_sbd=False)


Hola, amigo. ¿Cómo estás? Es un día lluvioso hoy.



加泰罗尼亚语不包含在StatMT的News Crawl中。为了完整表示罗曼语系,对于加泰罗尼亚语使用了约50万行OpenSubtitles的文本。因此,加泰罗尼亚语的性能可能较差,并且可能会过度预测标点符号和句子分隔符,这在OpenSubtitles中是一种典型情况。




训练期间的最大长度为256个子令牌。 punctuators包可以为任意长度的输入加标点。这是通过将输入分成重叠的256个令牌子段并合并结果来完成的。





测试集使用每种语言的3000行留存数据生成(对于加泰罗尼亚语使用OpenSubtitles,对于其他所有语言使用News Crawl)。示例是通过连接10个句子而得到的,移除了所有标点符号,并且将所有字母转换为小写。

由于标点符号是主观的(例如,请查看上面的示例中的“hello friend how's it going”),因此标点符号指标可能具有误导性。





Pre-punctuation report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.92      99.97      99.95     572069
    ¿ (label_id: 1)                                         81.93      60.46      69.57       1095
    micro avg                                               99.90      99.90      99.90     573164
    macro avg                                               90.93      80.22      84.76     573164
    weighted avg                                            99.89      99.90      99.89     573164
Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.70      98.44      98.57     517310
    <ACRONYM> (label_id: 1)                                 39.68      86.21      54.35         58
    . (label_id: 2)                                         87.72      90.41      89.04      29267
    , (label_id: 3)                                         73.17      74.68      73.92      25422
    ? (label_id: 4)                                         69.49      59.26      63.97       1107
    micro avg                                               96.90      96.90      96.90     573164
    macro avg                                               73.75      81.80      75.97     573164
    weighted avg                                            96.94      96.90      96.92     573164
True-casing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.85      99.73      99.79    2164982
    UPPER (label_id: 1)                                     92.01      95.32      93.64      69437
    micro avg                                               99.60      99.60      99.60    2234419
    macro avg                                               95.93      97.53      96.71    2234419
    weighted avg                                            99.61      99.60      99.60    2234419

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                   100.00      99.98      99.99     543228
    FULLSTOP (label_id: 1)                                  99.66      99.93      99.80      32931
    micro avg                                               99.98      99.98      99.98     576159
    macro avg                                               99.83      99.96      99.89     576159
    weighted avg                                            99.98      99.98      99.98     576159
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                   100.00     100.00     100.00     539822
    ¿ (label_id: 1)                                          0.00       0.00       0.00          0
    micro avg                                              100.00     100.00     100.00     539822
    macro avg                                              100.00     100.00     100.00     539822
    weighted avg                                           100.00     100.00     100.00     539822

Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.77      98.27      98.52     481148
    <ACRONYM> (label_id: 1)                                  0.00       0.00       0.00          0
    . (label_id: 2)                                         87.63      90.63      89.11      29090
    , (label_id: 3)                                         74.44      78.69      76.50      28549
    ? (label_id: 4)                                         66.30      52.27      58.45       1035
    micro avg                                               96.74      96.74      96.74     539822
    macro avg                                               81.79      79.96      80.65     539822
    weighted avg                                            96.82      96.74      96.77     539822

True-casing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.90      99.82      99.86    2082598
    UPPER (label_id: 1)                                     94.75      97.08      95.90      70555
    micro avg                                               99.73      99.73      99.73    2153153
    macro avg                                               97.32      98.45      97.88    2153153
    weighted avg                                            99.73      99.73      99.73    2153153

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                   100.00      99.98      99.99     509905
    FULLSTOP (label_id: 1)                                  99.72      99.98      99.85      32909
    micro avg                                               99.98      99.98      99.98     542814
    macro avg                                               99.86      99.98      99.92     542814
    weighted avg                                            99.98      99.98      99.98     542814
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                   100.00     100.00     100.00     580702
    ¿ (label_id: 1)                                          0.00       0.00       0.00          0
    micro avg                                              100.00     100.00     100.00     580702
    macro avg                                              100.00     100.00     100.00     580702
    weighted avg                                           100.00     100.00     100.00     580702

Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.56      98.47      98.51     520647
    <ACRONYM> (label_id: 1)                                 52.00      79.89      63.00        179
    . (label_id: 2)                                         87.29      89.37      88.32      29852
    , (label_id: 3)                                         75.26      74.69      74.97      29218
    ? (label_id: 4)                                         60.73      55.46      57.98        806
    micro avg                                               96.74      96.74      96.74     580702
    macro avg                                               74.77      79.57      76.56     580702
    weighted avg                                            96.74      96.74      96.74     580702

Truecasing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.84      99.75      99.79    2047297
    UPPER (label_id: 1)                                     93.56      95.65      94.59      77424
    micro avg                                               99.60      99.60      99.60    2124721
    macro avg                                               96.70      97.70      97.19    2124721
    weighted avg                                            99.61      99.60      99.60    2124721

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                   100.00      99.96      99.98     550858
    FULLSTOP (label_id: 1)                                  99.26      99.94      99.60      32833
    micro avg                                               99.95      99.95      99.95     583691
    macro avg                                               99.63      99.95      99.79     583691
    weighted avg                                            99.96      99.95      99.96     583691
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                   100.00     100.00     100.00     577636
    ¿ (label_id: 1)                                          0.00       0.00       0.00          0
    micro avg                                              100.00     100.00     100.00     577636
    macro avg                                              100.00     100.00     100.00     577636
    weighted avg                                           100.00     100.00     100.00     577636

Punctuation report: 
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.10      97.73      97.91     522727
    <ACRONYM> (label_id: 1)                                 41.76      48.72      44.97         78
    . (label_id: 2)                                         81.71      86.70      84.13      28881
    , (label_id: 3)                                         61.72      63.24      62.47      24703
    ? (label_id: 4)                                         62.55      41.78      50.10       1247
    micro avg                                               95.58      95.58      95.58     577636
    macro avg                                               69.17      67.63      67.92     577636
    weighted avg                                            95.64      95.58      95.60     577636

Truecasing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.76      99.70      99.73    2160781
    UPPER (label_id: 1)                                     91.18      92.76      91.96      72471
    micro avg                                               99.47      99.47      99.47    2233252
    macro avg                                               95.47      96.23      95.85    2233252
    weighted avg                                            99.48      99.47      99.48    2233252

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.98      99.99     547875
    FULLSTOP (label_id: 1)                                  99.72      99.91      99.82      32742
    micro avg                                               99.98      99.98      99.98     580617
    macro avg                                               99.86      99.95      99.90     580617
    weighted avg                                            99.98      99.98      99.98     580617
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                   100.00     100.00     100.00     614010
    ¿ (label_id: 1)                                          0.00       0.00       0.00          0
    micro avg                                              100.00     100.00     100.00     614010
    macro avg                                              100.00     100.00     100.00     614010
    weighted avg                                           100.00     100.00     100.00     614010

Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    98.72      98.57      98.65     556366
    <ACRONYM> (label_id: 1)                                 38.46      71.43      50.00         49
    . (label_id: 2)                                         86.41      88.56      87.47      28969
    , (label_id: 3)                                         72.15      72.80      72.47      27183
    ? (label_id: 4)                                         75.81      67.78      71.57       1443
    micro avg                                               96.88      96.88      96.88     614010
    macro avg                                               74.31      79.83      76.03     614010
    weighted avg                                            96.91      96.88      96.89     614010

Truecasing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.84      99.80      99.82    2127174
    UPPER (label_id: 1)                                     93.72      94.73      94.22      66496
    micro avg                                               99.65      99.65      99.65    2193670
    macro avg                                               96.78      97.27      97.02    2193670
    weighted avg                                            99.65      99.65      99.65    2193670

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.99      99.94      99.97     584331
    FULLSTOP (label_id: 1)                                  98.92      99.90      99.41      32661
    micro avg                                               99.94      99.94      99.94     616992
    macro avg                                               99.46      99.92      99.69     616992
    weighted avg                                            99.94      99.94      99.94     616992
Pre-punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    99.97     100.00      99.98     143817
    ¿ (label_id: 1)                                          0.00       0.00       0.00         50
    micro avg                                               99.97      99.97      99.97     143867
    macro avg                                               49.98      50.00      49.99     143867
    weighted avg                                            99.93      99.97      99.95     143867

Punctuation report:
    label                                                precision    recall       f1           support   
    <NULL> (label_id: 0)                                    97.61      97.73      97.67     119040
    <ACRONYM> (label_id: 1)                                  0.00       0.00       0.00         28
    . (label_id: 2)                                         74.02      79.46      76.65      15282
    , (label_id: 3)                                         60.88      50.75      55.36       5836
    ? (label_id: 4)                                         64.94      60.28      62.52       3681
    micro avg                                               92.90      92.90      92.90     143867
    macro avg                                               59.49      57.64      58.44     143867
    weighted avg                                            92.76      92.90      92.80     143867

Truecasing report:
    label                                                precision    recall       f1           support   
    LOWER (label_id: 0)                                     99.81      99.83      99.82     422395
    UPPER (label_id: 1)                                     97.09      96.81      96.95      24854
    micro avg                                               99.66      99.66      99.66     447249
    macro avg                                               98.45      98.32      98.39     447249
    weighted avg                                            99.66      99.66      99.66     447249

Fullstop report:
    label                                                precision    recall       f1           support   
    NOSTOP (label_id: 0)                                    99.93      99.63      99.78     123867
    FULLSTOP (label_id: 1)                                  97.97      99.59      98.77      22000
    micro avg                                               99.63      99.63      99.63     145867
    macro avg                                               98.95      99.61      99.28     145867
    weighted avg                                            99.63      99.63      99.63     145867