这是一个在英语法律和行政文本上进行预训练的模型,使用 RoBERTa 的预训练目标。此模型与 pile-of-law/legalbert-large-1.7M-1 具有相同的设置,但使用了不同的种子进行训练。
Pile of Law BERT大型模型2是一个基于 BERT large model (uncased) 架构进行预训练的transformers模型,使用了 Pile of Law 数据集进行预训练,该数据集包含约256GB的英语法律和行政文本,用于语言模型的预训练。
您可以使用原始模型进行遮蔽语言建模,或者对其进行微调以用于下游任务。因为此模型是在英语法律和行政文本语料库上进行预训练的,因此在法律相关的下游任务上,此模型可能更适用。
您可以直接使用管道模型进行遮蔽语言建模:
>>> from transformers import pipeline >>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2') >>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.") [{'sequence': 'an exception is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.5218929052352905, 'token': 4028, 'token_str': 'exception'}, {'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.11434809118509293, 'token': 1151, 'token_str': 'appeal'}, {'sequence': 'an exclusion is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.06454459577798843, 'token': 5345, 'token_str': 'exclusion'}, {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.043593790382146835, 'token': 3677, 'token_str': 'example'}, {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.', 'score': 0.03758585825562477, 'token': 3542, 'token_str': 'objection'}]
以下是如何在PyTorch中使用此模型获取给定文本特征的示例:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2') model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
以下是如何在TensorFlow中使用此模型的示例:
from transformers import BertTokenizer, TFBertModel tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2') model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
有关数据集和模型使用的版权限制,请参阅《Pile of Law》论文附录G。
该模型可能会产生偏见的预测。在以下示例中,使用管道模型进行遮蔽语言建模时,对于罪犯的种族描述,该模型对“黑人”比“白人”给出了更高的分数。
>>> from transformers import pipeline >>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2') >>> pipe("The transcript of evidence reveals that at approximately 7:30 a. m. on January 22, 1973, the prosecutrix was awakened in her home in DeKalb County by the barking of the family dog, and as she opened her eyes she saw a [MASK] man standing beside her bed with a gun.", targets=["black", "white"]) [{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a black man standing beside her bed with a gun.', 'score': 0.02685137465596199, 'token': 4311, 'token_str': 'black'}, {'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a white man standing beside her bed with a gun.', 'score': 0.013632853515446186, 'token': 4249, 'token_str': 'white'}]
此偏见也会影响到该模型的所有微调版本。
Pile of Law BERT大型模型是在Pile of Law上进行预训练的,该数据集由约256GB的英语法律和行政文本组成,用于语言模型的预训练。Pile of Law包括35个数据源,包括法律分析、法院意见和文件、政府机构出版物、合同、法规、案例书等。我们在《Pile of Law》论文附录E中详细描述了数据源。Pile of Law数据集采用CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International许可证。
模型的词汇表包含来自Pile of Law的29,000个标记,使用 HuggingFace WordPiece tokenizer 和从Black's Law Dictionary随机抽样的3,000个法律术语进行配对,词汇表大小为32,000个标记。预训练时使用了80-10-10的遮蔽、损坏和离开分割,如 BERT 所述,复制率为20,为每个上下文创建不同的遮蔽。为了生成序列,我们使用 LexNLP sentence segmenter 进行句子分割,以处理法律引文(这些引文通常被错误地认为是句子)。输入被格式化为组成256个标记的句子,后面跟着一个[SEP]标记,然后填充句子,使整个跨度在512个标记以下。如果系列中的下一句过长,则不添加,并使用填充标记填充剩余上下文长度。
该模型在SambaNova集群上进行训练,使用了8个RDUs,共进行170万步的训练。我们使用了较小的学习率5e-6和批量大小128,以减轻训练不稳定性,可能是由于训练数据的多样性。预训练时使用了遮蔽语言建模(MLM)目标,没有NSP损失,如 RoBERTa 所述。模型在所有步骤中以512长度的序列长度进行预训练。
我们使用相同的设置并行训练了两个模型,使用不同的随机种子。我们选择了对数似然值最低的模型 pile-of-law/legalbert-large-1.7M-1 ,我们称其为PoL-BERT-Large,用于实验,同时也发布了第二个模型 pile-of-law/legalbert-large-1.7M-2 。
有关在CaseHOLD变体上进行微调结果的详细信息,请参见 LexGLUE paper 提供的模型卡。
@misc{hendersonkrass2022pileoflaw, url = {https://arxiv.org/abs/2207.00220}, author = {Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.}, title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset}, publisher = {arXiv}, year = {2022} }