数据集:

qanastek/ANTILLES

语言:

fr

大小:

100K<n<1M

语言创建人:

found

源数据集:

original
英文

ANTILLES:一个开放的法语语言丰富的词性标注语料库

数据集概述

ANTILLES是一个基于 UD_French-GSD 的词性标注语料库,该语料库最初于2015年创建,基于 universal dependency treebank v2.0

最初,该语料库由400399个单词(16341个句子)组成,有17种不同的类别。现在,在应用我们的标签增强脚本transform.py后,我们获得了60种不同的类别,其中包含原始语料库中的性别、数目、语气、人称、时态或动词形式等语义信息,这些信息在不同的CoNLL-U字段中给出。

我们的标签基于 LIA_TAGG 统计POS标注器的详细程度,该标注器由 Frédéric Béchet 于2001年编写。

此工作根据 Creative Commons Attribution-ShareAlike 4.0 International License 进行许可。

支持的任务和排行榜

词性标注:数据集可用于训练词性标注模型。性能的评估是通过F1分数的高低来衡量的。一个经过训练的Flair Sequence-To-Sequence模型可以对维基百科段落中的标记进行词性标注,得到的F1分数(微平均)为0.952。

语言

数据集中的文本为法语,由 Wikipedia 用户口语使用。相关的 BCP-47 代码是fr。

加载数据集

HuggingFace

from datasets import load_dataset
dataset = load_dataset("qanastek/ANTILLES")
print(dataset)

FlairNLP

from flair.datasets import UniversalDependenciesCorpus
corpus: Corpus = UniversalDependenciesCorpus(
    data_folder='ANTILLES',
    train_file="train.conllu",
    test_file="test.conllu",
    dev_file="dev.conllu"
)

加载模型

Flair ( model )

from flair.models import SequenceTagger
tagger = SequenceTagger.load("qanastek/pos-french")

HuggingFace Spaces

数据集结构

数据实例

# sent_id = fr-ud-dev_00005
# text = Travail de trés grande qualité exécuté par un imprimeur artisan passionné.
1	Travail	travail	NMS	_	Gender=Masc|Number=Sing	0	root	_	wordform=travail
2	de	de	PREP	_	_	5	case	_	_
3	trés	trés	ADV	_	_	4	advmod	_	_
4	grande	grand	ADJFS	_	Gender=Fem|Number=Sing	5	amod	_	_
5	qualité	qualité	NFS	_	Gender=Fem|Number=Sing	1	nmod	_	_
6	exécuté	exécuter	VPPMS	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	1	acl	_	_
7	par	par	PREP	_	_	9	case	_	_
8	un	un	DINTMS	_	Definite=Ind|Gender=Masc|Number=Sing|PronType=Art	9	det	_	_
9	imprimeur	imprimeur	NMS	_	Gender=Masc|Number=Sing	6	obl:agent	_	_
10	artisan	artisan	NMS	_	Gender=Masc|Number=Sing	9	nmod	_	_
11	passionné	passionné	ADJMS	_	Gender=Masc|Number=Sing	9	amod	_	SpaceAfter=No
12	.	.	YPFOR	_	_	1	punct	_	_

数据字段

Abbreviation Description Examples # tokens
PREP Preposition de 63 738
AUX Auxiliary Verb est 12 886
ADV Adverb toujours 14 969
COSUB Subordinating conjunction que 3 007
COCO Coordinating Conjunction et 10 102
PART Demonstrative particle -t 93
PRON Pronoun qui ce quoi 667
PDEMMS Singular Masculine Demonstrative Pronoun ce 1 950
PDEMMP Plurial Masculine Demonstrative Pronoun ceux 108
PDEMFS Singular Feminine Demonstrative Pronoun cette 1 004
PDEMFP Plurial Feminine Demonstrative Pronoun celles 53
PINDMS Singular Masculine Indefinite Pronoun tout 961
PINDMP Plurial Masculine Indefinite Pronoun autres 89
PINDFS Singular Feminine Indefinite Pronoun chacune 136
PINDFP Plurial Feminine Indefinite Pronoun certaines 31
PROPN Proper noun houston 22 135
XFAMIL Last name levy 6 449
NUM Numerical Adjectives trentaine vingtaine 67
DINTMS Masculine Numerical Adjectives un 4 254
DINTFS Feminine Numerical Adjectives une 3 543
PPOBJMS Singular Masculine Pronoun complements of objects le lui 1 425
PPOBJMP Plurial Masculine Pronoun complements of objects eux y 212
PPOBJFS Singular Feminine Pronoun complements of objects moi la 358
PPOBJFP Plurial Feminine Pronoun complements of objects en y 70
PPER1S Personal Pronoun First Person Singular je 571
PPER2S Personal Pronoun Second Person Singular tu 19
PPER3MS Personal Pronoun Third Person Masculine Singular il 3 938
PPER3MP Personal Pronoun Third Person Masculine Plurial ils 513
PPER3FS Personal Pronoun Third Person Feminine Singular elle 992
PPER3FP Personal Pronoun Third Person Feminine Plurial elles 121
PREFS Reflexive Pronouns First Person of Singular me m' 120
PREF Reflexive Pronouns Third Person of Singular se s' 2 337
PREFP Reflexive Pronouns First / Second Person of Plurial nous vous 686
VERB Verb obtient 21 131
VPPMS Singular Masculine Participle Past Verb formulé 6 275
VPPMP Plurial Masculine Participle Past Verb classés 1 352
VPPFS Singular Feminine Participle Past Verb appelée 2 434
VPPFP Plurial Feminine Participle Past Verb sanctionnées 813
VPPRE Present participle étant 2
DET Determinant les l' 25 206
DETMS Singular Masculine Determinant les 15 444
DETFS Singular Feminine Determinant la 10 978
ADJ Adjective capable sérieux 1 075
ADJMS Singular Masculine Adjective grand important 8 338
ADJMP Plurial Masculine Adjective grands petits 3 274
ADJFS Singular Feminine Adjective franéaise petite 8 004
ADJFP Plurial Feminine Adjective légéres petites 3 041
NOUN Noun temps 1 389
NMS Singular Masculine Noun drapeau 29 698
NMP Plurial Masculine Noun journalistes 10 882
NFS Singular Feminine Noun téte 25 414
NFP Plurial Feminine Noun ondes 7 448
PREL Relative Pronoun qui dont 2 976
PRELMS Singular Masculine Relative Pronoun lequel 94
PRELMP Plurial Masculine Relative Pronoun lesquels 29
PRELFS Singular Feminine Relative Pronoun laquelle 70
PRELFP Plurial Feminine Relative Pronoun lesquelles 25
PINTFS Singular Feminine Interrogative Pronoun laquelle 3
INTJ Interjection merci bref 75
CHIF Numbers 1979 10 10 417
SYM Symbol é % 705
YPFOR Endpoint . 15 088
PUNCT Ponctuation : , 28 918
MOTINC Unknown words Technology Lady 2 022
X Typos & others sfeir 3D statu 175

数据划分

Train Dev Test
# Docs 14 449 1 476 416
Avg # Tokens / Doc 24.54 24.19 24.08

数据集创建

策划原理

[需要更多信息]

源数据

初始数据收集和规范化

[需要更多信息]

谁是源语言制作者?

[需要更多信息]

注释

注释过程

[需要更多信息]

谁是标注者?

[需要更多信息]

个人和敏感信息

该语料库不包含个人或敏感信息,因为它是基于维基百科文章内容创建的。

使用数据的注意事项

数据集的社会影响

[需要更多信息]

偏见讨论

语料库的性质引入了各种偏见,例如街道名称具有时间限制,因此可能引入作者或事件名称等命名实体。例如,法国20世纪之前不存在诸如Rue Victor-Hugo或Rue Pasteur之类的街道名称。

其他已知限制

[需要更多信息]

附加信息

数据集策划者

ANTILLES:Labrak Yanis, Dufour Richard

UD_FRENCH-GSD:de Marneffe Marie-Catherine, Guillaume Bruno, McDonald Ryan, Suhr Alane, Nivre Joakim, Grioni Matias, Dickerson Carly, Perrier Guy

Universal Dependency:Ryan McDonald,Joakim Nivre,Yvonne Quirmbach-Brundage,Yoav Goldberg,Dipanjan Das,Kuzman Ganchev,Keith Hall,Slav Petrov,Hao Zhang,Oscar Tackstrom,Claudia Bedini,Nuria Bertomeu Castello和Jungmee Lee

许可信息

For the following languages

  German, Spanish, French, Indonesian, Italian, Japanese, Korean and Brazilian
  Portuguese

we will distinguish between two portions of the data.

1. The underlying text for sentences that were annotated. This data Google
   asserts no ownership over and no copyright over. Some or all of these
   sentences may be copyrighted in some jurisdictions.  Where copyrighted,
   Google collected these sentences under exceptions to copyright or implied
   license rights.  GOOGLE MAKES THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY
   WARRANTY OF ANY KIND, WHETHER EXPRESS OR IMPLIED.

2. The annotations -- part-of-speech tags and dependency annotations. These are
   made available under a CC BY-SA 4.0. GOOGLE MAKES
   THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY WARRANTY OF ANY KIND, WHETHER
   EXPRESS OR IMPLIED. See attached LICENSE file for the text of CC BY-NC-SA.

Portions of the German data were sampled from the CoNLL 2006 Tiger Treebank
data. Hans Uszkoreit graciously gave permission to use the underlying
sentences in this data as part of this release.

Any use of the data should reference the above plus:

  Universal Dependency Annotation for Multilingual Parsing
  Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg,
  Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang,
  Oscar Tackstrom, Claudia Bedini, Nuria Bertomeu Castello and Jungmee Lee
  Proceedings of ACL 2013

引用信息

使用此模型时,请引用以下论文。

ANTILLES扩展语料库:

@inproceedings{labrak:hal-03696042,
  TITLE = {{ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus}},
  AUTHOR = {Labrak, Yanis and Dufour, Richard},
  URL = {https://hal.archives-ouvertes.fr/hal-03696042},
  BOOKTITLE = {{25th International Conference on Text, Speech and Dialogue (TSD)}},
  ADDRESS = {Brno, Czech Republic},
  PUBLISHER = {{Springer}},
  YEAR = {2022},
  MONTH = Sep,
  KEYWORDS = {Part-of-speech corpus ; POS tagging ; Open tools ; Word embeddings ; Bi-LSTM ; CRF ; Transformers},
  PDF = {https://hal.archives-ouvertes.fr/hal-03696042/file/ANTILLES_A_freNch_linguisTIcaLLy_Enriched_part_of_Speech_corpus.pdf},
  HAL_ID = {hal-03696042},
  HAL_VERSION = {v1},
}

UD_French-GSD语料库:

@misc{
    universaldependencies,
    title={UniversalDependencies/UD_French-GSD},
    url={https://github.com/UniversalDependencies/UD_French-GSD}, journal={GitHub},
    author={UniversalDependencies}
}

{U}niversal{D}ependency多语言解析注释:

@inproceedings{mcdonald-etal-2013-universal,
    title = "{U}niversal {D}ependency Annotation for Multilingual Parsing",
    author = {McDonald, Ryan  and
      Nivre, Joakim  and
      Quirmbach-Brundage, Yvonne  and
      Goldberg, Yoav  and
      Das, Dipanjan  and
      Ganchev, Kuzman  and
      Hall, Keith  and
      Petrov, Slav  and
      Zhang, Hao  and
      T{\"a}ckstr{\"o}m, Oscar  and
      Bedini, Claudia  and
      Bertomeu Castell{\'o}, N{\'u}ria  and
      Lee, Jungmee},
    booktitle = "Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = aug,
    year = "2013",
    address = "Sofia, Bulgaria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P13-2017",
    pages = "92--97",
}

LIA TAGG:

@techreport{LIA_TAGG,
  author = {Frédéric Béchet},
  title = {LIA_TAGG: a statistical POS tagger + syntactic bracketer},
  institution = {Aix-Marseille University & CNRS},
  year = {2001}
}