模型:

joelito/legal-xlm-longformer-base

任务:

填充掩码

类库:

PyTorch Transformers

数据集:

MultiLegalPile LEXTREME LEXGLUE 3ALEXGLUE 3ALEXTREME 3AMultiLegalPile

语言:

multilingual

其他:

longformer AutoTrain Compatible

预印本库:

arxiv:2306.02069 arxiv:2301.13126 arxiv:2110.00976 arxiv:2306.09237

许可:

模型介绍文件清单

英文

joelito/legal-xlm-longformer-base 的模型说明卡

这个模型是基于法律数据进行预训练的多语言模型。它基于XLM-R ( base 和 large )。我们使用来自24种语言的各种法律数据源的多语言数据集 Multi Legal Pile ( Niklaus et al. 2023 ) 进行了预训练。

模型详细信息

模型描述

开发者： Joel Niklaus： huggingface ；电子邮件
模型类型：基于Transformer的语言模型（Longformer）
语言（自然语言处理）：bg、cs、da、de、el、en、es、et、fi、fr、ga、hr、hu、it、lt、lv、mt、nl、pl、pt、ro、sk、sl、sv
许可证：CC BY-SA

用途

直接使用和下游使用

您可以使用原始模型进行掩码语言建模，因为我们未执行下一个句子预测。然而，它的主要目的是为下游任务进行微调。

需要注意的是，该模型主要用于在依赖整个句子（可能包含掩码元素）进行决策的任务上进行微调。此类任务的示例包括序列分类、标记分类或问答。对于文本生成任务，更适合使用像GPT-2这样的模型。

此外，该模型是专门针对法律数据进行训练的，旨在在该领域提供强大的性能。当应用于非法律数据时，其性能可能会有所不同。

超出范围的用途

对于文本生成等任务，您应该查看类似GPT2的模型。

该模型不应用于故意创建对人们具有敌意或疏远的环境。该模型没有经过训练，无法作为人们或事件的事实或真实表示，因此，使用该模型生成此类内容超出了该模型的能力范围。

偏见、风险和限制

已经进行了大量研究来探索语言模型的偏见和公平性问题（参见，例如， Sheng et al. (2021) 和 Bender et al. (2021) ）。模型生成的预测可能包含跨受保护的类别、身份特征和敏感、社会和职业群体的令人不安和有害的刻板印象。

建议

用户（直接用户和下游用户）应该了解模型的风险、偏见和限制。

如何开始使用该模型

请参见 huggingface tutorials 。有关掩码词预测，请参见 this tutorial 。

训练详细信息

该模型是在 Multi Legal Pile ( Niklaus et al. 2023 ) 上进行预训练的。

我们的预训练流程包括以下关键步骤：

(a) 热启动：我们从原始的XLM-R检查点 ( base 和 large ) 初始化我们的模型，以便从训练良好的基础中受益。

(b) 单词切分：我们训练了一个新的128K BPEs的标记器，以更好地覆盖法律语言。然而，我们对词汇重叠的标记重用了原始的XLM-R嵌入，并对其余的部分使用了随机嵌入。

(c) 预训练：我们在Multi Legal Pile上继续使用512个样本的批次进行额外的100万/50万步的基础/大模型训练。我们使用了预热步骤、线性增加的学习率和余弦衰减调度。在预热阶段，只更新嵌入，并且与 Devlin et al. (2019) 相比，使用了更高的掩码率和基于掩码标记的预测的更高百分比。

(d) 句子采样：我们使用指数平滑的句子采样器处理不同cantons和语言之间的不同标记比例，保持每个cantons和语言的容量。

(e) 混合大小写模型：我们的模型涵盖了大写和小写字母，类似于最近开发的大型PLM。

(f) 长上下文训练：为了考虑法律文件中的长上下文，我们使用窗口化注意力在基础大小的多语言模型上进行长上下文训练。这个变体名为Legal-Swiss-LF-base，使用15%的掩码概率，增加的学习率和与小上下文模型类似的设置。

训练数据

该模型是在 Multi Legal Pile ( Niklaus et al. 2023 ) 上进行预训练的。

预处理

更多详细信息，请参见 Niklaus et al. 2023

训练超参数

批次大小：512个样本
步骤数：基础/大模型的1M/500K个步骤
前5％的总训练步骤的预热步骤
学习率：（线性增加至）1e!-!4$
单词掩码：基础/大模型的掩码率增加了20/30%

评估

关于下游任务（如 LEXTREME ( Niklaus et al. 2023 ) 或 LEXGLUE ( Chalkidis et al. 2021 )）的性能，我们参考Niklaus等人（2023） 1 ， 2 中提供的结果。

模型架构和目标

这是一个基于RoBERTa的模型。运行以下代码以查看架构：

from transformers import AutoModel
model = AutoModel.from_pretrained('joelito/legal-xlm-longformer-base')
print(model)

LongformerModel(
  (embeddings): LongformerEmbeddings(
    (word_embeddings): Embedding(128000, 768, padding_idx=0)
    (position_embeddings): Embedding(4098, 768, padding_idx=0)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): LongformerEncoder(
    (layer): ModuleList(
      (0-11): 12 x LongformerLayer(
        (attention): LongformerAttention(
          (self): LongformerSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (query_global): Linear(in_features=768, out_features=768, bias=True)
            (key_global): Linear(in_features=768, out_features=768, bias=True)
            (value_global): Linear(in_features=768, out_features=768, bias=True)
          )
          (output): LongformerSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): LongformerIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): LongformerOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): LongformerPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

计算基础设施

Google TPU

硬件

Google TPU v3-8

软件

pytorch，transformers

引用 [可选]

@article{Niklaus2023MultiLegalPileA6,
  title={MultiLegalPile: A 689GB Multilingual Legal Corpus},
  author={Joel Niklaus and Veton Matoshi and Matthias Sturmer and Ilias Chalkidis and Daniel E. Ho},
  journal={ArXiv},
  year={2023},
  volume={abs/2306.02069}
}

模型卡作者

Joel Niklaus： huggingface ；电子邮件

Veton Matoshi： huggingface ；电子邮件

模型卡联系方式

Joel Niklaus： huggingface ；电子邮件

Veton Matoshi： huggingface ；电子邮件

作者:

Joel Niklaus

数据集大小:

801.36 MB