数据集:

eurlex

语言:

en

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

found

源数据集:

original
英文

EUR-Lex数据集的数据卡

数据集摘要

EURLEX57K可以被视为Mencia和Furnkranzand(2007)发布的数据集的改进版本,该数据集广泛用于大规模多标签文本分类(LMTC)研究,但是该数据集的大小不到EURLEX57K的一半(19.6k个文档,4k个EUROVOC标签)并且已有十多年的历史了。EURLEX57K包含来自EUR-Lex( https://eur-lex.europa.eu )的57k份英文立法文件,平均长度为727个词。每个文档包含四个主要区域:

  • 头部,包括标题和执行法律行为的法定机构的名称
  • 依据,即法律背景参考
  • 主体,通常组织成条款

标签/注释

数据集的所有文档都由欧盟出版局( https://publications.europa.eu/en )使用欧罗巴( http://eurovoc.europa.eu/ )的多个概念进行了注释。虽然欧罗巴包括大约7k个概念(标签),但只有4,271个(59.31%)存在于EURLEX57K中,其中只有2,049个(47.97%)被分配给超过10个文档。这4,271个标签也被分为频繁(746个标签),少数样本(3,362个)和零样本(163个),具体取决于它们是否被分配给超过50个、少于50个但至少一个或没有训练文档。

支持的任务和排行榜

数据集支持以下任务:

多标签文本分类:给定一份文件的文本,模型预测相关的欧罗巴概念。

少样本和零样本学习:如前所述,标签可以分为三组:频繁(746个标签),少数样本(3,362个)和零样本(163个),具体取决于它们是否被分配给超过50个、少于50个但至少一个或没有训练文档。

语言

所有文档均为英文编写。

数据集结构

数据实例

{
  "celex_id": "31979D0509", 
  "title": "79/509/EEC: Council Decision of 24 May 1979 on financial aid from the Community for the eradication of African swine fever in Spain", 
  "text": "COUNCIL DECISION  of 24 May 1979  on financial aid from the Community for the eradication of African swine fever in Spain  (79/509/EEC)\nTHE COUNCIL OF THE EUROPEAN COMMUNITIES\nHaving regard to the Treaty establishing the European Economic Community, and in particular Article 43 thereof,\nHaving regard to the proposal from the Commission (1),\nHaving regard to the opinion of the European Parliament (2),\nWhereas the Community should take all appropriate measures to protect itself against the appearance of African swine fever on its territory;\nWhereas to this end the Community has undertaken, and continues to undertake, action designed to contain outbreaks of this type of disease far from its frontiers by helping countries affected to reinforce their preventive measures ; whereas for this purpose Community subsidies have already been granted to Spain;\nWhereas these measures have unquestionably made an effective contribution to the protection of Community livestock, especially through the creation and maintenance of a buffer zone north of the river Ebro;\nWhereas, however, in the opinion of the Spanish authorities themselves, the measures so far implemented must be reinforced if the fundamental objective of eradicating the disease from the entire country is to be achieved;\nWhereas the Spanish authorities have asked the Community to contribute to the expenses necessary for the efficient implementation of a total eradication programme;\nWhereas a favourable response should be given to this request by granting aid to Spain, having regard to the undertaking given by that country to protect the Community against African swine fever and to eliminate completely this disease by the end of a five-year eradication plan;\nWhereas this eradication plan must include certain measures which guarantee the effectiveness of the action taken, and it must be possible to adapt these measures to developments in the situation by means of a procedure establishing close cooperation between the Member States and the Commission;\nWhereas it is necessary to keep the Member States regularly informed as to the progress of the action undertaken,", 
  "eurovoc_concepts": ["192", "2356", "2560", "862", "863"]
}

数据字段

对于文档(训练、验证、测试),提供以下数据字段:

celex_id:(str)文档的官方ID。CELEX编号是Eur-Lex和CELLAR中所有出版物的唯一标识符。 title:(str)文档的标题。 text:(str)每个文档的完整内容,由其标题,依据和主体组成。 eurovoc_concepts:(List[str])相关的欧罗巴概念(标签)。

如果您想使用类似于Chalkidis等人(2020)的欧罗巴概念描述符,请加载: https://archive.org/download/EURLEX57K/eurovoc_concepts.jsonl

import json
with open('./eurovoc_concepts.jsonl') as jsonl_file:
    eurovoc_concepts =  {json.loads(concept) for concept in jsonl_file.readlines()}

数据拆分

Split No of Documents Avg. words Avg. labels
Train 45,000 729 5
Development 6,000 714 5
Test 6,000 725 5

数据集创建

策划理由

该数据集由Chalkidis等人(2019)策划。文档已由欧盟出版局( https://publications.europa.eu/en )进行了注释。

源数据

初始数据收集和归一化

原始数据可从EUR-Lex门户网站( https://eur-lex.europa.eu )以未经处理的格式获得。文档以HTML格式从EUR-Lex门户网站下载。相关元数据和欧罗巴概念从欧盟出版局( http://publications.europa.eu/webapi/rdf/sparql )的SPARQL端点下载。

谁是源语言制作人?

[需要更多信息]

注释

注释过程 谁是注释人员?

欧盟出版局( https://publications.europa.eu/en

个人和敏感信息

数据集不包含个人或敏感信息。

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

Chalkidis等人(2019)

授权信息

© 欧洲联盟,1998-2021

该委员会的文件重用政策基于2011/833/EU决定。除非另有规定,您可以为商业或非商业目的重用在EUR-Lex发布的法律文件。

该网站的编辑内容、欧盟立法摘要和综合文本的版权归欧盟所有,根据知识共享署名4.0国际许可证进行许可。这意味着您可以重用内容,但必须致谢来源并指出您所做的任何更改。

来源: https://eur-lex.europa.eu/content/legal-notice/legal-notice.html 阅读更多: https://eur-lex.europa.eu/content/help/faq/reuse-contents-eurlex.html

引用信息

Ilias Chalkidis,Manos Fergadiotis,Prodromos Malakasiotis和Ion Androutsopoulos。大规模欧盟立法多标签文本分类。计算语言学协会第57届年会论文集(ACL 2019)。意大利佛罗伦萨。2019

@inproceedings{chalkidis-etal-2019-large,
    title = "Large-Scale Multi-Label Text Classification on {EU} Legislation",
    author = "Chalkidis, Ilias  and Fergadiotis, Manos  and Malakasiotis, Prodromos  and Androutsopoulos, Ion",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1636",
    doi = "10.18653/v1/P19-1636",
    pages = "6314--6322"
}

贡献

感谢 @iliaschalkidis 添加此数据集。