模型:

casehold/legalbert

英文

Legal-BERT

来自 When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings 的Legal-BERT模型和分词器文件。

训练数据

预训练语料库是通过导入整个哈佛法学案例文本库(1965年至今, https://case.law/ )构建的。这个语料库的规模很大(37GB),包含了所有联邦和州法院的3,446,187个法律判决,比最初用于训练BERT的BookCorpus/Wikipedia语料库(15GB)要大。

训练目标

该模型使用基本的BERT模型(小写,110M参数) bert-base-uncased 进行初始化,并在MLM和NSP目标上额外进行了1M步的训练,其中词汇标记和句子分割已适应法律文本(详见论文)。

使用方法

请参考 casehold repository 中的脚本,用于计算Legal-BERT的预训练损失并进行细调,用于描述论文中的分类和多项选择任务:Overruling、Terms of Service和CaseHOLD。

引用

@inproceedings{zhengguha2021,
        title={When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset},
        author={Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho},
        year={2021},
        eprint={2104.08671},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        booktitle={Proceedings of the 18th International Conference on Artificial Intelligence and Law},
        publisher={Association for Computing Machinery}
}

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL '21), June 21-25, 2021, São Paulo, Brazil. ACM Inc., New York, NY, (in press). arXiv: 2104.08671 [cs.CL] .