英文

法律BERT大型模型(未分大小写)

使用 RoBERTa 预训练目标在英语法律和行政文本上预训练的模型。

模型描述

法律BERT大型模型是一个基于transformers架构的模型,使用 BERT large model (uncased) 架构在 Pile of Law 数据集上进行预训练,该数据集由大约256GB的英语法律和行政文本组成,用于语言模型的预训练。

使用目的和限制

您可以将原始模型用于遮蔽语言建模,或者对其进行微调以用于下游任务。由于该模型是在英语法律和行政文本语料库上进行预训练的,因此该模型在法律领域的下游任务中可能更具优势。

使用方法

您可以使用管道直接使用该模型进行遮蔽语言建模:

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
>>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")

[{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.6343119740486145, 
  'token': 1151, '
  token_str': 'appeal'}, 
  {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.10488124936819077, 
  'token': 3542, 
  'token_str': 'objection'}, 
  {'sequence': 'an application is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.0708756372332573, 
  'token': 1999, 
  'token_str': 'application'}, 
  {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.02558572217822075, 
  'token': 3677, 
  'token_str': 'example'}, 
  {'sequence': 'an action is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 
  'score': 0.013266939669847488, 
  'token': 1347, 
  'token_str': 'action'}]

以下是如何使用此模型在PyTorch中获取给定文本特征的示例:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

以及在TensorFlow中的示例:

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

限制和偏见

有关数据集和模型使用的版权限制,请参阅《法律堆论文》附录G。

此模型可能具有偏见的预测。在以下示例中,该模型与用于遮蔽语言建模的管道一起使用时,在犯罪者的种族描述符中,模型对“黑人”的评分高于“白人”。

>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1')
>>> pipe("The clerk described the robber as a “thin [MASK] male, about six foot tall, wearing a gray hoodie, blue jeans", targets=["black", "white"])

[{'sequence': 'the clerk described the robber as a thin black male, about six foot tall, wearing a gray hoodie, blue jeans', 
  'score': 0.0013972163433209062, 
  'token': 4311, 
  'token_str': 'black'}, 
  {'sequence': 'the clerk described the robber as a thin white male, about six foot tall, wearing a gray hoodie, blue jeans', 
  'score': 0.0009401230490766466, 
  'token': 4249, '
  token_str': 'white'}]

这种偏见也将影响该模型的所有微调版本。

训练数据

法律BERT大型模型是在法律堆上进行预训练的,该数据集由大约256GB的英语法律和行政文本组成,用于语言模型的预训练。法律堆包括35个数据源,包括法律分析、法院意见和诉讼文件、政府机构出版物、合同、法规、案例书等。我们在《法律堆论文》的附录E中详细描述了数据源。法律堆数据集受到知识共享署名-非商业性使用-相同方式共享4.0国际许可证的约束。

训练过程

预处理

模型词汇表包含来自定制字词词汇表的29,000个标记,该词汇表适用于法律堆,并随机抽取了3,000个黑斯法律词典中的法律术语,词汇表大小为32,000个标记。采用80-10-10的遮蔽、破坏和离开拆分,如 BERT 所述,复制率为20,以针对每个上下文创建不同的掩码。为了生成序列,我们使用 LexNLP sentence segmenter 进行格式化,该工具可处理法律引文的句子分割(这些引文通常被错误地误认为是句子)。输入被填充到包含256个标记的句子,然后是一个[SEP]标记,并填充句子,以使整个跨度在512个标记以下。如果系列中的下一个句子太大,则不会添加该句子,并用填充标记填充剩余的上下文长度。

预训练

该模型在一个SambaNova集群上进行培训,使用8个RDUs,进行了170万步的训练。我们使用较小的学习率5e-6和批量大小128来减轻训练不稳定性,可能是由于我们的训练数据中的来源的多样性。使用 RoBERTa 中描述的无NSP损失的遮蔽语言建模(MLM)目标进行预训练。该模型的所有步骤都使用512个长度的序列长度进行预训练。

我们使用相同的设置并行训练了两个模型,使用不同的随机种子。我们选择了对数似然最低的模型( pile-of-law/legalbert-large-1.7M-1 ),我们称之为PoL-BERT-Large,用于实验,但也发布第二个模型( pile-of-law/legalbert-large-1.7M-2 )。

评估结果

当在 LexGLUE paper 提供的CaseHOLD变体上进行微调时,该模型PoL-BERT-Large达到以下结果。在下表中,我们还报告了对于下游任务进行了超参数调整的模型的结果,以及来自 LexGLUE paper 的CaseLaw-BERT模型使用的固定实验设置的结果。

CaseHOLD测试结果:

Model F1
CaseLaw-BERT (tuned) 78.5
CaseLaw-BERT (LexGLUE) 75.4
PoL-BERT-Large 75.0
BERT-Large-Uncased 71.3

BibTeX条目和引文信息

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}