模型:
oliverguhr/fullstop-punctuation-multilang-large
这个模型预测英语、意大利语、法语和德语文本的标点符号。我们开发它来恢复转录的口语语言的标点符号。
这个多语言模型是用 SEPP-NLG Shared Task 提供的 Europarl Dataset 进行训练的。请注意,该数据集包含政治演讲。因此,该模型在其他领域的文本上可能表现不同。
该模型还原以下的标点符号标记:"。",",","?","-",":"
我们提供了一个简单的Python包,可以处理任意长度的文本。
要开始,请从 pypi 安装该软件包:
pip install deepmultilingualpunctuation
from deepmultilingualpunctuation import PunctuationModel model = PunctuationModel() text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller" result = model.restore_punctuation(text) print(result)
输出
我的名字叫Clara,我住在加利福尼亚的伯克利。Ist das eine Frage, Frau Müller?
from deepmultilingualpunctuation import PunctuationModel model = PunctuationModel() text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller" clean_text = model.preprocess(text) labled_words = model.predict(clean_text) print(labled_words)
输出
[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '。', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]
对于连字符和冒号这样的单个标点符号,性能有所不同,因为它们在许多情况下是可选的,可以用逗号或句号替代。该模型对不同语言的F1得分如下:
Label | EN | DE | FR | IT |
---|---|---|---|---|
0 | 0.991 | 0.997 | 0.992 | 0.989 |
. | 0.948 | 0.961 | 0.945 | 0.942 |
? | 0.890 | 0.893 | 0.871 | 0.832 |
, | 0.819 | 0.945 | 0.831 | 0.798 |
: | 0.575 | 0.652 | 0.620 | 0.588 |
- | 0.425 | 0.435 | 0.431 | 0.421 |
macro average | 0.775 | 0.814 | 0.782 | 0.762 |
Languages | Model |
---|---|
English, Italian, French and German | 1238321 |
English, Italian, French, German and Dutch | 1239321 |
Dutch | 12310321 |
Languages | Model |
---|---|
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian | 12311321 |
Catalan | 12312321 |
Welsh | 12313321 |
您可以通过设置模型参数来使用不同的模型:
model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")
是的,您可以!完整的研究项目代码请参考 this repository 。
这里还有一个指南 how to fine tune this model for you data / language 。
@article{guhr-EtAl:2021:fullstop, title={FullStop: Multilingual Deep Models for Punctuation Prediction}, author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim}, booktitle = {Proceedings of the Swiss Text Analytics Conference 2021}, month = {June}, year = {2021}, address = {Winterthur, Switzerland}, publisher = {CEUR Workshop Proceedings}, url = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf} }