oliverguhr/fullstop-punctuation-multilang-large | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

模型:

oliverguhr/fullstop-punctuation-multilang-large

任务:

标记分类

类库:

PyTorch TensorFlow Safetensors Transformers

数据集:

wmt/europarl 3Awmt/europarl

语言:

其他:

xlm-roberta punctuation prediction punctuation AutoTrain Compatible punctuation+prediction

许可:

mit

模型介绍文件清单

英文

这个模型预测英语、意大利语、法语和德语文本的标点符号。我们开发它来恢复转录的口语语言的标点符号。

这个多语言模型是用 SEPP-NLG Shared Task 提供的 Europarl Dataset 进行训练的。请注意，该数据集包含政治演讲。因此，该模型在其他领域的文本上可能表现不同。

该模型还原以下的标点符号标记："。"，"，"，"？"，"-"，":"

示例代码

我们提供了一个简单的Python包，可以处理任意长度的文本。

安装

要开始，请从 pypi 安装该软件包：

pip install deepmultilingualpunctuation

恢复标点符号

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

输出

我的名字叫Clara，我住在加利福尼亚的伯克利。Ist das eine Frage, Frau Müller?

预测标签

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

输出

[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '。', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', '，', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '？', 0.99863917]]

结果

对于连字符和冒号这样的单个标点符号，性能有所不同，因为它们在许多情况下是可选的，可以用逗号或句号替代。该模型对不同语言的F1得分如下：

Label	EN	DE	FR	IT
0	0.991	0.997	0.992	0.989
.	0.948	0.961	0.945	0.942
?	0.890	0.893	0.871	0.832
,	0.819	0.945	0.831	0.798
:	0.575	0.652	0.620	0.588
-	0.425	0.435	0.431	0.421
macro average	0.775	0.814	0.782	0.762

语言

模型

Languages	Model
English, Italian, French and German	1238321
English, Italian, French, German and Dutch	1239321
Dutch	12310321

社区模型

Languages	Model
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian	12311321
Catalan	12312321
Welsh	12313321

您可以通过设置模型参数来使用不同的模型：

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

代码在哪里可以找到，我可以训练自己的模型吗？

是的，您可以！完整的研究项目代码请参考 this repository 。

这里还有一个指南 how to fine tune this model for you data / language 。

参考资料

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

作者:

Oliver Guhr

数据集大小:

6.25 GB