SEC-BERT 
 
  SEC-BERT是一个专为金融领域开发的BERT模型系列,旨在辅助金融自然语言处理研究和金融科技应用。SEC-BERT包括以下模型: 
   -    
    SEC-BERT-BASE
      :与BERT-BASE具有相同的架构,使用金融文档进行训练。  
   -    
    SEC-BERT-NUM
      :与SEC-BERT-BASE相同,但我们用[NUM]伪标记替换了每个数字标记,以统一处理所有数值表达式,防止它们被分割。  
   -    SEC-BERT-SHAPE   (本模型):与SEC-BERT-BASE相同,但我们用表示数字形状的伪标记替换了数字,因此数值表达式(已知形状)不再被分割,例如,'53.2'变为'[XX.X]','40,200.5'变为'[XX,XXX.X]'。  
  
   预训练语料库 
   该模型基于1993年至2019年的260,773份10-K备案文件进行了预训练,这些文件可在
   U.S. Securities and Exchange Commission (SEC)
  上公开获取。 
   预训练细节 
    加载预训练模型 
 from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-shape")
   预处理文本 
   要使用SEC-BERT-SHAPE,您需要对文本进行预处理,将每个数字标记替换为对应的形状伪标记,来自一个包含214个预定义形状伪标记的列表。如果数值标记不对应任何形状伪标记,则用[NUM]伪标记进行替换。以下是如何预处理简单句子的示例。这种方法非常简单,您可以根据需要进行修改。 
 import re
import spacy
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
spacy_tokenizer = spacy.load("en_core_web_sm")
sentence = "Total net sales decreased 2% or $5.4 billion during 2019 compared to 2018."
def sec_bert_shape_preprocess(text):
    tokens = [t.text for t in spacy_tokenizer(sentence)]
    processed_text = []
    for token in tokens:
        if re.fullmatch(r"(\d+[\d,.]*)|([,.]\d+)", token):
            shape = '[' + re.sub(r'\d', 'X', token) + ']'
            if shape in tokenizer.additional_special_tokens:
                processed_text.append(shape)
            else:
                processed_text.append('[NUM]')
        else:
            processed_text.append(token)
            
    return ' '.join(processed_text)
        
tokenized_sentence = tokenizer.tokenize(sec_bert_shape_preprocess(sentence))
print(tokenized_sentence)
"""
['total', 'net', 'sales', 'decreased', '[X]', '%', 'or', '$', '[X.X]', 'billion', 'during', '[XXXX]', 'compared', 'to', '[XXXX]', '.']
"""
   使用SEC-BERT变体作为语言模型 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018.
     | 
    
     decreased
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     increased (0.221), were (0.131), are (0.103), rose (0.075), of (0.058)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     increased (0.678), decreased (0.282), declined (0.017), grew (0.016), rose (0.004)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     increased (0.753), decreased (0.211), grew (0.019), declined (0.010), rose (0.006)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     increased (0.747), decreased (0.214), grew (0.021), declined (0.013), rose (0.002)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     Total net sales decreased 2% or $5.4 [MASK] during 2019 compared to 2018.
     | 
    
     billion
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     billion (0.841), million (0.097), trillion (0.028), ##m (0.015), ##bn (0.006)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     million (0.972), billion (0.028), millions (0.000), ##million (0.000), m (0.000)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     million (0.974), billion (0.012), , (0.010), thousand (0.003), m (0.000)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     million (0.978), billion (0.021), % (0.000), , (0.000), millions (0.000)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     Total net sales decreased [MASK]% or $5.4 billion during 2019 compared to 2018.
     | 
    
     2
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     20 (0.031), 10 (0.030), 6 (0.029), 4 (0.027), 30 (0.027)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     13 (0.045), 12 (0.040), 11 (0.040), 14 (0.035), 10 (0.035)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     [NUM] (1.000), one (0.000), five (0.000), three (0.000), seven (0.000)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     [XX] (0.316), [XX.X] (0.253), [X.X] (0.237), [X] (0.188), [X.XX] (0.002)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     Total net sales decreased 2[MASK] or $5.4 billion during 2019 compared to 2018.
     | 
    
     %
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     % (0.795), percent (0.174), ##fold (0.009), billion (0.004), times (0.004)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     % (0.924), percent (0.076), points (0.000), , (0.000), times (0.000)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     % (0.882), percent (0.118), million (0.000), units (0.000), bps (0.000)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     % (0.961), percent (0.039), bps (0.000), , (0.000), bcf (0.000)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     Total net sales decreased 2% or $[MASK] billion during 2019 compared to 2018.
     | 
    
     5.4
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     1 (0.074), 4 (0.045), 3 (0.044), 2 (0.037), 5 (0.034)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     1 (0.218), 2 (0.136), 3 (0.078), 4 (0.066), 5 (0.048)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     [NUM] (1.000), l (0.000), 1 (0.000), - (0.000), 30 (0.000)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     [X.X] (0.787), [X.XX] (0.095), [XX.X] (0.049), [X.XXX] (0.046), [X] (0.013)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     Total net sales decreased 2% or $5.4 billion during [MASK] compared to 2018.
     | 
    
     2019
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     2017 (0.485), 2018 (0.169), 2016 (0.164), 2015 (0.070), 2014 (0.022)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     2019 (0.990), 2017 (0.007), 2018 (0.003), 2020 (0.000), 2015 (0.000)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     [NUM] (1.000), as (0.000), fiscal (0.000), year (0.000), when (0.000)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     [XXXX] (1.000), as (0.000), year (0.000), periods (0.000), , (0.000)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     Total net sales decreased 2% or $5.4 billion during 2019 compared to [MASK].
     | 
    
     2018
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     2017 (0.100), 2016 (0.097), above (0.054), inflation (0.050), previously (0.037)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     2018 (0.999), 2019 (0.000), 2017 (0.000), 2016 (0.000), 2014 (0.000)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     [NUM] (1.000), year (0.000), last (0.000), sales (0.000), fiscal (0.000)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     [XXXX] (1.000), year (0.000), sales (0.000), prior (0.000), years (0.000)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     During 2019, the Company [MASK] $67.1 billion of its common stock and paid dividend equivalents of $14.1 billion.
     | 
    
     repurchased
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     held (0.229), sold (0.192), acquired (0.172), owned (0.052), traded (0.033)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     repurchased (0.913), issued (0.036), purchased (0.029), redeemed (0.010), sold (0.003)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     repurchased (0.917), purchased (0.054), reacquired (0.013), issued (0.005), acquired (0.003)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     repurchased (0.902), purchased (0.068), issued (0.010), reacquired (0.008), redeemed (0.006)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     During 2019, the Company repurchased $67.1 billion of its common [MASK] and paid dividend equivalents of $14.1 billion.
     | 
    
     stock
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     stock (0.835), assets (0.039), equity (0.025), debt (0.021), bonds (0.017)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     stock (0.857), shares (0.135), equity (0.004), units (0.002), securities (0.000)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     stock (0.842), shares (0.157), equity (0.000), securities (0.000), units (0.000)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     stock (0.888), shares (0.109), equity (0.001), securities (0.001), stocks (0.000)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     During 2019, the Company repurchased $67.1 billion of its common stock and paid [MASK] equivalents of $14.1 billion.
     | 
    
     dividend
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     cash (0.276), net (0.128), annual (0.083), the (0.040), debt (0.027)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     dividend (0.890), cash (0.018), dividends (0.016), share (0.013), tax (0.010)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     dividend (0.735), cash (0.115), share (0.087), tax (0.025), stock (0.013)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     dividend (0.655), cash (0.248), dividends (0.042), share (0.019), out (0.003)
     | 
   
  
 
 
  
   
    | 
     Sample
     | 
    
     Masked Token
     | 
   
  
  
   
    | 
     During 2019, the Company repurchased $67.1 billion of its common stock and paid dividend [MASK] of $14.1 billion.
     | 
    
     equivalents
     | 
   
  
 
 
  
   
    | 
     Model
     | 
    
     Predictions (Probability)
     | 
   
  
  
   
    | 
     BERT-BASE-UNCASED
     | 
    
     revenue (0.085), earnings (0.078), rates (0.065), amounts (0.064), proceeds (0.062)
     | 
   
   
    | 
     SEC-BERT-BASE
     | 
    
     payments (0.790), distributions (0.087), equivalents (0.068), cash (0.013), amounts (0.004)
     | 
   
   
    | 
     SEC-BERT-NUM
     | 
    
     payments (0.845), equivalents (0.097), distributions (0.024), increases (0.005), dividends (0.004)
     | 
   
   
    | 
     SEC-BERT-SHAPE
     | 
    
     payments (0.784), equivalents (0.093), distributions (0.043), dividends (0.015), requirements (0.009)
     | 
   
  
 
   文章发表 
   如果您使用了这个模型,请引用以下文章:
   FiNER: Financial Numeric Entity Recognition for XBRL Tagging
   Lefteris Loukas,Manos Fergadiotis,Ilias Chalkidis,Eirini Spyropoulou,Prodromos Malakasiotis,Ion Androutsopoulos和George Paliouras,在第60届计算语言学协会年会(ACL 2022)(Long Papers),爱尔兰都柏林,2022年5月22日-27日 
 @inproceedings{loukas-etal-2022-finer,
    title = {FiNER: Financial Numeric Entity Recognition for XBRL Tagging},
    author = {Loukas, Lefteris and
      Fergadiotis, Manos and
      Chalkidis, Ilias and
      Spyropoulou, Eirini and
      Malakasiotis, Prodromos and
      Androutsopoulos, Ion and
      Paliouras George},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)},
    publisher = {Association for Computational Linguistics},
    location = {Dublin, Republic of Ireland},
    year = {2022},
    url = {https://arxiv.org/abs/2203.06482}
}
   关于我们 
   
   AUEB's Natural Language Processing Group
  开发算法、模型和系统,使计算机能够处理和生成自然语言文本。 
   该团队目前的研究兴趣包括: 
   -    数据库、本体、文档集和Web的问答系统,特别是生物医学问答,  
   -    从数据库和本体生成自然语言,特别是语义Web本体学,文本分类,包括过滤垃圾邮件和辱骂内容,  
   -    信息抽取和意见挖掘,包括法律文本分析和情感分析,  
   -    希腊语的自然语言处理工具,例如解析器和命名实体识别器,自然语言处理中的机器学习,特别是深度学习。  
  
   该团队是希腊雅典经济与商业科学大学信息处理实验室的一部分。 
   
   Manos Fergadiotis
  代表
   AUEB's Natural Language Processing Group