

使用 RoBERTa 预训练目标在英语法律和行政文本上预训练的模型。


法律BERT大型模型是一个基于transformers架构的模型,使用 BERT large model (uncased) 架构在 Pile of Law 数据集上进行预训练,该数据集由大约256GB的英语法律和行政文本组成,用于语言模型的预训练。





>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")

[{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.6343119740486145, 
  'token': 1151, '
  token_str': 'appeal'}, 
  {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.10488124936819077, 
  'token': 3542, 
  'token_str': 'objection'}, 
  {'sequence': 'an application is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.0708756372332573, 
  'token': 1999, 
  'token_str': 'application'}, 
  {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.02558572217822075, 
  'token': 3677, 
  'token_str': 'example'}, 
  {'sequence': 'an action is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.013266939669847488, 
  'token': 1347, 
  'token_str': 'action'}]


from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)


from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)




>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
>>> pipe("The clerk described the robber as a “thin [MASK] male, about six foot tall, wearing a gray hoodie, blue jeans", targets=["black", "white"])

[{'sequence': 'the clerk described the robber as a thin black male, about six foot tall, wearing a gray hoodie, blue jeans', 
  'score': 0.0013972163433209062, 
  'token': 4311, 
  'token_str': 'black'}, 
  {'sequence': 'the clerk described the robber as a thin white male, about six foot tall, wearing a gray hoodie, blue jeans', 
  'score': 0.0009401230490766466, 
  'token': 4249, '
  token_str': 'white'}]






模型词汇表包含来自定制字词词汇表的29,000个标记,该词汇表适用于法律堆,并随机抽取了3,000个黑斯法律词典中的法律术语,词汇表大小为32,000个标记。采用80-10-10的遮蔽、破坏和离开拆分,如 BERT 所述,复制率为20,以针对每个上下文创建不同的掩码。为了生成序列,我们使用 LexNLP sentence segmenter 进行格式化,该工具可处理法律引文的句子分割(这些引文通常被错误地误认为是句子)。输入被填充到包含256个标记的句子,然后是一个[SEP]标记,并填充句子,以使整个跨度在512个标记以下。如果系列中的下一个句子太大,则不会添加该句子,并用填充标记填充剩余的上下文长度。


该模型在一个SambaNova集群上进行培训,使用8个RDUs,进行了170万步的训练。我们使用较小的学习率5e-6和批量大小128来减轻训练不稳定性,可能是由于我们的训练数据中的来源的多样性。使用 RoBERTa 中描述的无NSP损失的遮蔽语言建模(MLM)目标进行预训练。该模型的所有步骤都使用512个长度的序列长度进行预训练。

我们使用相同的设置并行训练了两个模型,使用不同的随机种子。我们选择了对数似然最低的模型( pile-of-law/legalbert-large-1.7M-1 ),我们称之为PoL-BERT-Large,用于实验,但也发布第二个模型( pile-of-law/legalbert-large-1.7M-2 )。


当在 LexGLUE paper 提供的CaseHOLD变体上进行微调时,该模型PoL-BERT-Large达到以下结果。在下表中,我们还报告了对于下游任务进行了超参数调整的模型的结果,以及来自 LexGLUE paper 的CaseLaw-BERT模型使用的固定实验设置的结果。


Model F1
CaseLaw-BERT (tuned) 78.5
CaseLaw-BERT (LexGLUE) 75.4
PoL-BERT-Large 75.0
BERT-Large-Uncased 71.3


  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}