使用 RoBERTa 预训练目标在英语法律和行政文本上预训练的模型。
法律BERT大型模型是一个基于transformers架构的模型,使用 BERT large model (uncased) 架构在 Pile of Law 数据集上进行预训练,该数据集由大约256GB的英语法律和行政文本组成,用于语言模型的预训练。
您可以将原始模型用于遮蔽语言建模,或者对其进行微调以用于下游任务。由于该模型是在英语法律和行政文本语料库上进行预训练的,因此该模型在法律领域的下游任务中可能更具优势。
您可以使用管道直接使用该模型进行遮蔽语言建模:
>>> from transformers import pipeline >>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1') >>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.") [{'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.6343119740486145, 'token': 1151, ' token_str': 'appeal'}, {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.10488124936819077, 'token': 3542, 'token_str': 'objection'}, {'sequence': 'an application is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.0708756372332573, 'token': 1999, 'token_str': 'application'}, {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.02558572217822075, 'token': 3677, 'token_str': 'example'}, {'sequence': 'an action is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.013266939669847488, 'token': 1347, 'token_str': 'action'}]
以下是如何使用此模型在PyTorch中获取给定文本特征的示例:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1') model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
以及在TensorFlow中的示例:
from transformers import BertTokenizer, TFBertModel tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-1') model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-1') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
有关数据集和模型使用的版权限制,请参阅《法律堆论文》附录G。
此模型可能具有偏见的预测。在以下示例中,该模型与用于遮蔽语言建模的管道一起使用时,在犯罪者的种族描述符中,模型对“黑人”的评分高于“白人”。
>>> from transformers import pipeline >>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-1') >>> pipe("The clerk described the robber as a “thin [MASK] male, about six foot tall, wearing a gray hoodie, blue jeans", targets=["black", "white"]) [{'sequence': 'the clerk described the robber as a thin black male, about six foot tall, wearing a gray hoodie, blue jeans', 'score': 0.0013972163433209062, 'token': 4311, 'token_str': 'black'}, {'sequence': 'the clerk described the robber as a thin white male, about six foot tall, wearing a gray hoodie, blue jeans', 'score': 0.0009401230490766466, 'token': 4249, ' token_str': 'white'}]
这种偏见也将影响该模型的所有微调版本。
法律BERT大型模型是在法律堆上进行预训练的,该数据集由大约256GB的英语法律和行政文本组成,用于语言模型的预训练。法律堆包括35个数据源,包括法律分析、法院意见和诉讼文件、政府机构出版物、合同、法规、案例书等。我们在《法律堆论文》的附录E中详细描述了数据源。法律堆数据集受到知识共享署名-非商业性使用-相同方式共享4.0国际许可证的约束。
模型词汇表包含来自定制字词词汇表的29,000个标记,该词汇表适用于法律堆,并随机抽取了3,000个黑斯法律词典中的法律术语,词汇表大小为32,000个标记。采用80-10-10的遮蔽、破坏和离开拆分,如 BERT 所述,复制率为20,以针对每个上下文创建不同的掩码。为了生成序列,我们使用 LexNLP sentence segmenter 进行格式化,该工具可处理法律引文的句子分割(这些引文通常被错误地误认为是句子)。输入被填充到包含256个标记的句子,然后是一个[SEP]标记,并填充句子,以使整个跨度在512个标记以下。如果系列中的下一个句子太大,则不会添加该句子,并用填充标记填充剩余的上下文长度。
该模型在一个SambaNova集群上进行培训,使用8个RDUs,进行了170万步的训练。我们使用较小的学习率5e-6和批量大小128来减轻训练不稳定性,可能是由于我们的训练数据中的来源的多样性。使用 RoBERTa 中描述的无NSP损失的遮蔽语言建模(MLM)目标进行预训练。该模型的所有步骤都使用512个长度的序列长度进行预训练。
我们使用相同的设置并行训练了两个模型,使用不同的随机种子。我们选择了对数似然最低的模型( pile-of-law/legalbert-large-1.7M-1 ),我们称之为PoL-BERT-Large,用于实验,但也发布第二个模型( pile-of-law/legalbert-large-1.7M-2 )。
当在 LexGLUE paper 提供的CaseHOLD变体上进行微调时,该模型PoL-BERT-Large达到以下结果。在下表中,我们还报告了对于下游任务进行了超参数调整的模型的结果,以及来自 LexGLUE paper 的CaseLaw-BERT模型使用的固定实验设置的结果。
CaseHOLD测试结果:
Model | F1 |
---|---|
CaseLaw-BERT (tuned) | 78.5 |
CaseLaw-BERT (LexGLUE) | 75.4 |
PoL-BERT-Large | 75.0 |
BERT-Large-Uncased | 71.3 |
@misc{hendersonkrass2022pileoflaw, url = {https://arxiv.org/abs/2207.00220}, author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.}, title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset}, publisher = {arXiv}, year = {2022} }