英文

此模型预测荷兰文本的标点符号。我们开发它来恢复转录的口语语言中的标点符号。

该模型是在 SoNaR Dataset 上进行训练的。

该模型恢复以下标点符号标记:“。”“,”“?”“-”“:”

示例代码

我们提供了一个简单的Python软件包,允许您处理任意长度的文本。

安装

要开始安装软件包,请从 pypi 中安装:

pip install deepmultilingualpunctuation

恢复标点

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-dutch-sonar-punctuation-prediction")
text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
result = model.restore_punctuation(text)
print(result)

输出

hervatting van de zitting. ik verklaar de zitting van het europees parlement, die op vrijdag 17 december werd onderbroken, te zijn hervat.

预测标签

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-dutch-sonar-punctuation-prediction")
text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

输出

[['hervatting', '0', 0.99998724], ['van', '0', 0.9999784], ['de', '0', 0.99991274], ['zitting', '.', 0.6771242], ['ik', '0', 0.9999466], ['verklaar', '0', 0.9998566], ['de', '0', 0.9999783], ['zitting', '0', 0.9999809], ['van', '0', 0.99996245], ['het', '0', 0.99997795], ['europees', '0', 0.9999783], ['parlement', ',', 0.9908242], ['die', '0', 0.999985], ['op', '0', 0.99998224], ['vrijdag', '0', 0.9999831], ['17', '0', 0.99997985], ['december', '0', 0.9999827], ['werd', '0', 0.999982], ['onderbroken', ',', 0.9951485], ['te', '0', 0.9999677], ['zijn', '0', 0.99997723], ['hervat', '.', 0.9957053]]

结果

个别标点符号的性能有所不同,例如连字符和冒号,在许多情况下是可选的,可以由逗号或句号替代。该模型实现了以下F1分数:

Label F1 Score
0 0.985816
. 0.854380
? 0.684060
, 0.719308
: 0.696088
- 0.722000
macro average 0.776942
micro average 0.963427

语言

模型

Languages Model
English, Italian, French and German 1237321
English, Italian, French, German and Dutch 1238321
Dutch 1239321

社区模型

Languages Model
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian 12310321
Catalan 12311321

您可以通过设置模型参数来使用不同的模型:

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

如何引用我们

@misc{https://doi.org/10.48550/arxiv.2301.03319,
  doi = {10.48550/ARXIV.2301.03319},
  url = {https://arxiv.org/abs/2301.03319},
  author = {Vandeghinste, Vincent and Guhr, Oliver},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7},
  title = {FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers},
  publisher = {arXiv},
  year = {2023},  
  copyright = {Creative Commons Attribution Share Alike 4.0 International}
}