英文

自定义的Legal-BERT

来自 When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset 的自定义Legal-BERT模型的模型和分词器文件。

训练数据

预训练语料库通过摄取从1965年到现在的整个哈佛法学案例语料库构建而成( https://case.law/ )。此语料库的大小(37GB)相当可观,涵盖了所有联邦和州法院的3,446,187个法律决定,并且比最初用于训练BERT的BookCorpus/Wikipedia语料库(15GB)还要大。

训练目标

此模型是从头开始进行的MLM和NSP目标的预训练,其分词和句子分割适用于法律文本(请参阅论文)。

该模型还使用了定制的领域特定的法律词汇表。该词汇表集是在我们的预训练语料库的子样本(约13M)上使用 SentencePiece 构建的,令牌数目固定为32,000。

用法

请参阅 casehold repository 以获取支持计算预训练损失和在自定义Legal-BERT上进行分类和多项选择任务(如论文中所述:覆盖、服务条款、案件保存)的脚本。

引用

@inproceedings{zhengguha2021,
    title={When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset},
    author={Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho},
    year={2021},
    eprint={2104.08671},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    booktitle={Proceedings of the 18th International Conference on Artificial Intelligence and Law},
    publisher={Association for Computing Machinery}
}

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL '21), June 21-25, 2021, São Paulo, Brazil. ACM Inc., New York, NY, (in press). arXiv: 2104.08671 [cs.CL] .