欧盟有24种官方语言。当新成员加入欧盟时,官方语言的集合通常会扩展,除非该语言已经包含在内。MultiEURLEX涵盖了来自七个语言系(日耳曼语、罗曼语、斯拉夫语、乌拉尔语、波罗的语、闪米特语、希腊语)的23种语言。所有官方语言的欧盟法律均以所有官方语言出版,除了基于资源原因(详见 https://europa.eu/european-union/about-eu/eu-languages_en )的爱尔兰语。这种广泛覆盖使得MultiEURLEX成为一个有价值的跨语言转移测试平台。除了保加利亚语(西里尔字母)和希腊语外,所有语言都使用拉丁字母。欧盟国家还使用其他几种语言。欧盟国家还有超过60种其他土著区域或少数民族语言,例如巴斯克语、加泰罗尼亚语、弗里斯兰语、萨米语和意第绪语等,这些语言由约4000万人使用,但这些附加语言不被认为是官方语言(欧盟方面而言) ,欧盟法律不会被翻译成这些语言。
from datasets import load_dataset dataset = load_dataset('multi_eurlex', 'all_languages')
{ "celex_id": "31979D0509", "text": {"en": "COUNCIL DECISION of 24 May 1979 on financial aid from the Community for the eradication of African swine fever in Spain (79/509/EEC)\nTHE COUNCIL OF THE EUROPEAN COMMUNITIES\nHaving regard to the Treaty establishing the European Economic Community, and in particular Article 43 thereof,\nHaving regard to the proposal from the Commission (1),\nHaving regard to the opinion of the European Parliament (2),\nWhereas the Community should take all appropriate measures to protect itself against the appearance of African swine fever on its territory;\nWhereas to this end the Community has undertaken, and continues to undertake, action designed to contain outbreaks of this type of disease far from its frontiers by helping countries affected to reinforce their preventive measures ; whereas for this purpose Community subsidies have already been granted to Spain;\nWhereas these measures have unquestionably made an effective contribution to the protection of Community livestock, especially through the creation and maintenance of a buffer zone north of the river Ebro;\nWhereas, however, in the opinion of the Spanish authorities themselves, the measures so far implemented must be reinforced if the fundamental objective of eradicating the disease from the entire country is to be achieved;\nWhereas the Spanish authorities have asked the Community to contribute to the expenses necessary for the efficient implementation of a total eradication programme;\nWhereas a favourable response should be given to this request by granting aid to Spain, having regard to the undertaking given by that country to protect the Community against African swine fever and to eliminate completely this disease by the end of a five-year eradication plan;\nWhereas this eradication plan must include certain measures which guarantee the effectiveness of the action taken, and it must be possible to adapt these measures to developments in the situation by means of a procedure establishing close cooperation between the Member States and the Commission;\nWhereas it is necessary to keep the Member States regularly informed as to the progress of the action undertaken,", "es": "DECISIÓN DEL CONSEJO de 24 de mayo de 1979 sobre ayuda financiera de la Comunidad para la erradicación de la peste porcina africana en España (79/509/CEE)\nEL CONSEJO DE LAS COMUNIDADES EUROPEAS\nVeniendo en cuenta el Tratado constitutivo de la Comunidad Económica Europea y, en particular, Su artículo 43,\n Vista la propuesta de la Comisión (1),\n Visto el dictamen del Parlamento Europeo (2),\nConsiderando que la Comunidad debe tomar todas las medidas adecuadas para protegerse contra la aparición de la peste porcina africana en su territorio;\nConsiderando a tal fin que la Comunidad ha emprendido y sigue llevando a cabo acciones destinadas a contener los brotes de este tipo de enfermedades lejos de sus fronteras, ayudando a los países afectados a reforzar sus medidas preventivas; que a tal efecto ya se han concedido a España subvenciones comunitarias;\nQue estas medidas han contribuido sin duda alguna a la protección de la ganadería comunitaria, especialmente mediante la creación y mantenimiento de una zona tampón al norte del río Ebro;\nConsiderando, no obstante, , a juicio de las propias autoridades españolas, las medidas implementadas hasta ahora deben reforzarse si se quiere alcanzar el objetivo fundamental de erradicar la enfermedad en todo el país;\nConsiderando que las autoridades españolas han pedido a la Comunidad que contribuya a los gastos necesarios para la ejecución eficaz de un programa de erradicación total;\nConsiderando que conviene dar una respuesta favorable a esta solicitud concediendo una ayuda a España, habida cuenta del compromiso asumido por dicho país de proteger a la Comunidad contra la peste porcina africana y de eliminar completamente esta enfermedad al final de un plan de erradicación de cinco años;\nMientras que este plan de erradicación debe incluir e determinadas medidas que garanticen la eficacia de las acciones emprendidas, debiendo ser posible adaptar estas medidas a la evolución de la situación mediante un procedimiento que establezca una estrecha cooperación entre los Estados miembros y la Comisión;\nConsiderando que es necesario mantener el Los Estados miembros informados periódicamente sobre el progreso de las acciones emprendidas.", "de": "...", "bg": "..." }, "labels": [ 1, 13, 47 ] }
from datasets import load_dataset dataset = load_dataset('multi_eurlex', 'en')
{ "celex_id": "31979D0509", "text": "COUNCIL DECISION of 24 May 1979 on financial aid from the Community for the eradication of African swine fever in Spain (79/509/EEC)\nTHE COUNCIL OF THE EUROPEAN COMMUNITIES\nHaving regard to the Treaty establishing the European Economic Community, and in particular Article 43 thereof,\nHaving regard to the proposal from the Commission (1),\nHaving regard to the opinion of the European Parliament (2),\nWhereas the Community should take all appropriate measures to protect itself against the appearance of African swine fever on its territory;\nWhereas to this end the Community has undertaken, and continues to undertake, action designed to contain outbreaks of this type of disease far from its frontiers by helping countries affected to reinforce their preventive measures ; whereas for this purpose Community subsidies have already been granted to Spain;\nWhereas these measures have unquestionably made an effective contribution to the protection of Community livestock, especially through the creation and maintenance of a buffer zone north of the river Ebro;\nWhereas, however, in the opinion of the Spanish authorities themselves, the measures so far implemented must be reinforced if the fundamental objective of eradicating the disease from the entire country is to be achieved;\nWhereas the Spanish authorities have asked the Community to contribute to the expenses necessary for the efficient implementation of a total eradication programme;\nWhereas a favourable response should be given to this request by granting aid to Spain, having regard to the undertaking given by that country to protect the Community against African swine fever and to eliminate completely this disease by the end of a five-year eradication plan;\nWhereas this eradication plan must include certain measures which guarantee the effectiveness of the action taken, and it must be possible to adapt these measures to developments in the situation by means of a procedure establishing close cooperation between the Member States and the Commission;\nWhereas it is necessary to keep the Member States regularly informed as to the progress of the action undertaken,", "labels": [ 1, 13, 47 ] }
celex_id:(str)文档的官方ID。 CELEX号是Eur-Lex和CELLAR中所有出版物的唯一标识符。 text:(dict[str])一个包含23种语言的字典,每种语言对应一个文档的全部内容。 labels:(List[int])相关的EUROVOC概念(标签)。
celex_id:(str)文档的官方ID。 CELEX号是Eur-Lex和CELLAR中所有出版物的唯一标识符。 text:(str)跨语言的文档的全部内容。 labels:(List[int])相关的EUROVOC概念(标签)。
如果您想使用EUROVOC概念的描述符,类似于 Chalkidis et al. (2020) ,请下载相关的JSON文件 here 。然后您可以加载并使用它:
import json from datasets import load_dataset # Load the English part of the dataset dataset = load_dataset('multi_eurlex', 'en', split='train') # Load (label_id, descriptor) mapping with open('./eurovoc_descriptors.json') as jsonl_file: eurovoc_concepts = json.load(jsonl_file) # Get feature map info classlabel = dataset.features["labels"].feature # Retrieve IDs and descriptors from dataset for sample in dataset: print(f'DOCUMENT: {sample["celex_id"]}') # DOCUMENT: 32006D0213 for label_id in sample['labels']: print(f'LABEL: id:{label_id}, eurovoc_id: {classlabel.int2str(label_id)}, \ eurovoc_desc:{eurovoc_concepts[classlabel.int2str(label_id)]}') # LABEL: id: 1, eurovoc_id: '100160', eurovoc_desc: 'industry'
Language | ISO code | Member Countries where official | EU Speakers [1] | Number of Documents [2] |
English | en | United Kingdom (1973-2020), Ireland (1973), Malta (2004) | 13/ 51% | 55,000 / 5,000 / 5,000 |
German | de | Germany (1958), Belgium (1958), Luxembourg (1958) | 16/32% | 55,000 / 5,000 / 5,000 |
French | fr | France (1958), Belgium(1958), Luxembourg (1958) | 12/26% | 55,000 / 5,000 / 5,000 |
Italian | it | Italy (1958) | 13/16% | 55,000 / 5,000 / 5,000 |
Spanish | es | Spain (1986) | 8/15% | 52,785 / 5,000 / 5,000 |
Polish | pl | Poland (2004) | 8/9% | 23,197 / 5,000 / 5,000 |
Romanian | ro | Romania (2007) | 5/5% | 15,921 / 5,000 / 5,000 |
Dutch | nl | Netherlands (1958), Belgium (1958) | 4/5% | 55,000 / 5,000 / 5,000 |
Greek | el | Greece (1981), Cyprus (2008) | 3/4% | 55,000 / 5,000 / 5,000 |
Hungarian | hu | Hungary (2004) | 3/3% | 22,664 / 5,000 / 5,000 |
Portuguese | pt | Portugal (1986) | 2/3% | 23,188 / 5,000 / 5,000 |
Czech | cs | Czech Republic (2004) | 2/3% | 23,187 / 5,000 / 5,000 |
Swedish | sv | Sweden (1995) | 2/3% | 42,490 / 5,000 / 5,000 |
Bulgarian | bg | Bulgaria (2007) | 2/2% | 15,986 / 5,000 / 5,000 |
Danish | da | Denmark (1973) | 1/1% | 55,000 / 5,000 / 5,000 |
Finnish | fi | Finland (1995) | 1/1% | 42,497 / 5,000 / 5,000 |
Slovak | sk | Slovakia (2004) | 1/1% | 15,986 / 5,000 / 5,000 |
Lithuanian | lt | Lithuania (2004) | 1/1% | 23,188 / 5,000 / 5,000 |
Croatian | hr | Croatia (2013) | 1/1% | 7,944 / 2,500 / 5,000 |
Slovene | sl | Slovenia (2004) | <1/<1% | 23,184 / 5,000 / 5,000 |
Estonian | et | Estonia (2004) | <1/<1% | 23,126 / 5,000 / 5,000 |
Latvian | lv | Latvia (2004) | <1/<1% | 23,188 / 5,000 / 5,000 |
Maltese | mt | Malta (2004) | <1/<1% | 17,521 / 5,000 / 5,000 |
[1] 母语和欧盟总人口百分比(%)[2] 训练/开发/测试划分
该数据集由Chalkidis等人(2021年)策划。这些文档已由欧盟出版社( https://publications.europa.eu/en )进行了注释。
原始数据在EUR-LEX门户网站( https://eur-lex.europa.eu )以未处理的格式(HTML、XML、RDF)提供。文档从EUR-LEX门户网站以HTML格式下载。相关的EUROVOC概念从欧盟出版社的SPARQL端点下载( http://publications.europa.eu/webapi/rdf/sparql )。我们去除了HTML标记以提供纯文本格式的文档。我们根据从原始分配的标签到它们在第1至第3级祖先中的标签的分支回溯EUROVOC层次结构,推断了EUROVOC级别1-3的标签。
从哪些源语言生产而来?欧盟有24种官方语言。当新成员加入欧盟时,官方语言的集合通常会扩展,除非该语言已经包含在内。MultiEURLEX涵盖了来自七个语言系(日耳曼语、罗曼语、斯拉夫语、乌拉尔语、波罗的语、闪米特语、希腊语)的23种语言。所有官方语言的欧盟法律均以所有官方语言出版,除了基于资源原因(详见 https://europa.eu/european-union/about-eu/eu-languages_en )的爱尔兰语。这种广泛覆盖使得MultiEURLEX成为一个有价值的跨语言转移测试平台。所有语言都使用拉丁字母,除了保加利亚语(西里尔字母)和希腊语。欧盟国家还使用其他几种语言。欧盟国家还有超过60种其他土著区域或少数民族语言,例如巴斯克语、加泰罗尼亚语、弗里斯兰语、萨米语和意第绪语等,这些语言由约4000万人使用,但这些附加语言不被认为是官方语言(欧盟方面而言) ,欧盟法律不会被翻译成这些语言。
所有该数据集的文档均由欧盟出版社( https://publications.europa.eu/en )进行了多概念的EUROVOC注释( http://eurovoc.europa.eu/ )。EUROVOC拥有八个级别的概念。每个文档被分配一个或多个概念(标签)。如果一个文档被分配了一个概念,通常不会将该概念的祖先和后代分配给同一个文档。这些文档最初是在第3至第8级的概念上进行注释的。我们通过将每个分配的概念替换为来自第1、2或3级的祖先,为每个文档创建了三组替代标签。因此,我们为每个文档提供了四组金标签,分别对应于层次结构的前三个级别,以及原始稀疏标签分配。由于许多文档具有来自第三级的金概念,所以无法独立使用第4至第8级标签,如果丢弃第3级,则会造成许多文档的错误标注。
谁是标注者?欧盟出版社( https://publications.europa.eu/en )
MultiEURLEX涵盖了来自七个语言系(日耳曼语、罗曼语、斯拉夫语、乌拉尔语、波罗的语、闪米特语、希腊语)的23种语言。这并不意味着欧盟国家没有使用其他语言,尽管欧盟法律不会翻译成其他语言( https://europa.eu/european-union/about-eu/eu-languages_en )。
版权所有 欧洲联盟,1998-2021
来源: https://eur-lex.europa.eu/content/legal-notice/legal-notice.html 阅读更多: https://eur-lex.europa.eu/content/help/faq/reuse-contents-eurlex.html
Ilias Chalkidis,Manos Fergadiotis和Ion Androutsopoulos。 MultiEURLEX-一个多语言和多标签的法律文件分类数据集,用于零样本跨语言转移。2021年大会的新颖方法在自然语言处理。2021年在多米尼加共和国蓬塔卡纳举行
@InProceedings{chalkidis-etal-2021-multieurlex, author = {Chalkidis, Ilias and Fergadiotis, Manos and Androutsopoulos, Ion}, title = {MultiEURLEX -- A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer}, booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing}, year = {2021}, publisher = {Association for Computational Linguistics}, location = {Punta Cana, Dominican Republic}, url = {https://arxiv.org/abs/2109.00904} }
感谢 @iliaschalkidis 添加此数据集。