英文

该模型预测英语、意大利语、法语和德语文本的标点符号。我们开发了该模型用于恢复转录的口语文本的标点符号。

该多语言模型是基于由提供的 Europarl Dataset 进行训练的,对于荷兰语,我们还包括了 SoNaR Dataset 。请注意,该数据集包含政治演讲。因此,该模型在其他领域的文本上可能会表现不同。

该模型恢复以下标点符号标记:“。”“,”“?”“-”“:”

示例代码

我们提供了一个简单的Python软件包,可以处理任意长度的文本。

安装

要开始使用,请从 pypi 安装软件包:

pip install deepmultilingualpunctuation

恢复标点符号

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

输出

我的名字是克拉拉,我住在加利福尼亚的伯克利。Ist das eine Frage, Frau Müller?

预测标签

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

输出

[['My', '0', 0.99998856], ['name', '0', 0.9999708], ['is', '0', 0.99975926], ['Clara', '0', 0.6117834], ['and', '0', 0.9999014], ['I', '0', 0.9999808], ['live', '0', 0.9999666], ['in', '0', 0.99990165], ['Berkeley', ',', 0.9941764], ['California', '。', 0.9952892], ['Ist', '0', 0.9999577], ['das', '0', 0.9999678], ['eine', '0', 0.99998224], ['Frage', ',', 0.9952265], ['Frau', '0', 0.99995995], ['Müller', '?', 0.972517]]

结果

对于连字符和冒号等单个标点符号的性能有所不同,在许多情况下,它们是可选的,可以用逗号或句号代替。该模型根据不同语言的实现以下F1分数:

Label English German French Italian Dutch
0 0.990 0.996 0.991 0.988 0.994
. 0.924 0.951 0.921 0.917 0.959
? 0.825 0.829 0.800 0.736 0.817
, 0.798 0.937 0.811 0.778 0.813
: 0.535 0.608 0.578 0.544 0.657
- 0.345 0.384 0.353 0.344 0.464
macro average 0.736 0.784 0.742 0.718 0.784
micro average 0.975 0.987 0.977 0.972 0.983

语言

模型

Languages Model
English, Italian, French and German 12310321
English, Italian, French, German and Dutch 12311321
Dutch 12312321

社区模型

Languages Model
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian 12313321
Catalan 12314321

您可以通过设置模型参数来使用不同的模型:

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

如何引用我们

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}
@misc{https://doi.org/10.48550/arxiv.2301.03319,
  doi = {10.48550/ARXIV.2301.03319},
  url = {https://arxiv.org/abs/2301.03319},
  author = {Vandeghinste, Vincent and Guhr, Oliver},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7},
  title = {FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers},
  publisher = {arXiv},
  year = {2023},  
  copyright = {Creative Commons Attribution Share Alike 4.0 International}
}