来自 When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset 的自定义Legal-BERT模型的模型和分词器文件。
预训练语料库通过摄取从1965年到现在的整个哈佛法学案例语料库构建而成( https://case.law/ )。此语料库的大小(37GB)相当可观,涵盖了所有联邦和州法院的3,446,187个法律决定,并且比最初用于训练BERT的BookCorpus/Wikipedia语料库(15GB)还要大。
此模型是从头开始进行的MLM和NSP目标的预训练,其分词和句子分割适用于法律文本(请参阅论文)。
该模型还使用了定制的领域特定的法律词汇表。该词汇表集是在我们的预训练语料库的子样本(约13M)上使用 SentencePiece 构建的,令牌数目固定为32,000。
请参阅 casehold repository 以获取支持计算预训练损失和在自定义Legal-BERT上进行分类和多项选择任务(如论文中所述:覆盖、服务条款、案件保存)的脚本。
@inproceedings{zhengguha2021, title={When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset}, author={Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho}, year={2021}, eprint={2104.08671}, archivePrefix={arXiv}, primaryClass={cs.CL}, booktitle={Proceedings of the 18th International Conference on Artificial Intelligence and Law}, publisher={Association for Computing Machinery} }
Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL '21), June 21-25, 2021, São Paulo, Brazil. ACM Inc., New York, NY, (in press). arXiv: 2104.08671 [cs.CL] .