数据集:
qanastek/ANTILLES
ANTILLES是一个基于 UD_French-GSD 的词性标注语料库,该语料库最初于2015年创建,基于 universal dependency treebank v2.0 。
最初,该语料库由400399个单词(16341个句子)组成,有17种不同的类别。现在,在应用我们的标签增强脚本transform.py后,我们获得了60种不同的类别,其中包含原始语料库中的性别、数目、语气、人称、时态或动词形式等语义信息,这些信息在不同的CoNLL-U字段中给出。
我们的标签基于 LIA_TAGG 统计POS标注器的详细程度,该标注器由 Frédéric Béchet 于2001年编写。
此工作根据 Creative Commons Attribution-ShareAlike 4.0 International License 进行许可。
词性标注:数据集可用于训练词性标注模型。性能的评估是通过F1分数的高低来衡量的。一个经过训练的Flair Sequence-To-Sequence模型可以对维基百科段落中的标记进行词性标注,得到的F1分数(微平均)为0.952。
数据集中的文本为法语,由 Wikipedia 用户口语使用。相关的 BCP-47 代码是fr。
from datasets import load_dataset dataset = load_dataset("qanastek/ANTILLES") print(dataset)
from flair.datasets import UniversalDependenciesCorpus corpus: Corpus = UniversalDependenciesCorpus( data_folder='ANTILLES', train_file="train.conllu", test_file="test.conllu", dev_file="dev.conllu" )
from flair.models import SequenceTagger tagger = SequenceTagger.load("qanastek/pos-french")
# sent_id = fr-ud-dev_00005 # text = Travail de trés grande qualité exécuté par un imprimeur artisan passionné. 1 Travail travail NMS _ Gender=Masc|Number=Sing 0 root _ wordform=travail 2 de de PREP _ _ 5 case _ _ 3 trés trés ADV _ _ 4 advmod _ _ 4 grande grand ADJFS _ Gender=Fem|Number=Sing 5 amod _ _ 5 qualité qualité NFS _ Gender=Fem|Number=Sing 1 nmod _ _ 6 exécuté exécuter VPPMS _ Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 1 acl _ _ 7 par par PREP _ _ 9 case _ _ 8 un un DINTMS _ Definite=Ind|Gender=Masc|Number=Sing|PronType=Art 9 det _ _ 9 imprimeur imprimeur NMS _ Gender=Masc|Number=Sing 6 obl:agent _ _ 10 artisan artisan NMS _ Gender=Masc|Number=Sing 9 nmod _ _ 11 passionné passionné ADJMS _ Gender=Masc|Number=Sing 9 amod _ SpaceAfter=No 12 . . YPFOR _ _ 1 punct _ _
Abbreviation | Description | Examples | # tokens |
---|---|---|---|
PREP | Preposition | de | 63 738 |
AUX | Auxiliary Verb | est | 12 886 |
ADV | Adverb | toujours | 14 969 |
COSUB | Subordinating conjunction | que | 3 007 |
COCO | Coordinating Conjunction | et | 10 102 |
PART | Demonstrative particle | -t | 93 |
PRON | Pronoun | qui ce quoi | 667 |
PDEMMS | Singular Masculine Demonstrative Pronoun | ce | 1 950 |
PDEMMP | Plurial Masculine Demonstrative Pronoun | ceux | 108 |
PDEMFS | Singular Feminine Demonstrative Pronoun | cette | 1 004 |
PDEMFP | Plurial Feminine Demonstrative Pronoun | celles | 53 |
PINDMS | Singular Masculine Indefinite Pronoun | tout | 961 |
PINDMP | Plurial Masculine Indefinite Pronoun | autres | 89 |
PINDFS | Singular Feminine Indefinite Pronoun | chacune | 136 |
PINDFP | Plurial Feminine Indefinite Pronoun | certaines | 31 |
PROPN | Proper noun | houston | 22 135 |
XFAMIL | Last name | levy | 6 449 |
NUM | Numerical Adjectives | trentaine vingtaine | 67 |
DINTMS | Masculine Numerical Adjectives | un | 4 254 |
DINTFS | Feminine Numerical Adjectives | une | 3 543 |
PPOBJMS | Singular Masculine Pronoun complements of objects | le lui | 1 425 |
PPOBJMP | Plurial Masculine Pronoun complements of objects | eux y | 212 |
PPOBJFS | Singular Feminine Pronoun complements of objects | moi la | 358 |
PPOBJFP | Plurial Feminine Pronoun complements of objects | en y | 70 |
PPER1S | Personal Pronoun First Person Singular | je | 571 |
PPER2S | Personal Pronoun Second Person Singular | tu | 19 |
PPER3MS | Personal Pronoun Third Person Masculine Singular | il | 3 938 |
PPER3MP | Personal Pronoun Third Person Masculine Plurial | ils | 513 |
PPER3FS | Personal Pronoun Third Person Feminine Singular | elle | 992 |
PPER3FP | Personal Pronoun Third Person Feminine Plurial | elles | 121 |
PREFS | Reflexive Pronouns First Person of Singular | me m' | 120 |
PREF | Reflexive Pronouns Third Person of Singular | se s' | 2 337 |
PREFP | Reflexive Pronouns First / Second Person of Plurial | nous vous | 686 |
VERB | Verb | obtient | 21 131 |
VPPMS | Singular Masculine Participle Past Verb | formulé | 6 275 |
VPPMP | Plurial Masculine Participle Past Verb | classés | 1 352 |
VPPFS | Singular Feminine Participle Past Verb | appelée | 2 434 |
VPPFP | Plurial Feminine Participle Past Verb | sanctionnées | 813 |
VPPRE | Present participle | étant | 2 |
DET | Determinant | les l' | 25 206 |
DETMS | Singular Masculine Determinant | les | 15 444 |
DETFS | Singular Feminine Determinant | la | 10 978 |
ADJ | Adjective | capable sérieux | 1 075 |
ADJMS | Singular Masculine Adjective | grand important | 8 338 |
ADJMP | Plurial Masculine Adjective | grands petits | 3 274 |
ADJFS | Singular Feminine Adjective | franéaise petite | 8 004 |
ADJFP | Plurial Feminine Adjective | légéres petites | 3 041 |
NOUN | Noun | temps | 1 389 |
NMS | Singular Masculine Noun | drapeau | 29 698 |
NMP | Plurial Masculine Noun | journalistes | 10 882 |
NFS | Singular Feminine Noun | téte | 25 414 |
NFP | Plurial Feminine Noun | ondes | 7 448 |
PREL | Relative Pronoun | qui dont | 2 976 |
PRELMS | Singular Masculine Relative Pronoun | lequel | 94 |
PRELMP | Plurial Masculine Relative Pronoun | lesquels | 29 |
PRELFS | Singular Feminine Relative Pronoun | laquelle | 70 |
PRELFP | Plurial Feminine Relative Pronoun | lesquelles | 25 |
PINTFS | Singular Feminine Interrogative Pronoun | laquelle | 3 |
INTJ | Interjection | merci bref | 75 |
CHIF | Numbers | 1979 10 | 10 417 |
SYM | Symbol | é % | 705 |
YPFOR | Endpoint | . | 15 088 |
PUNCT | Ponctuation | : , | 28 918 |
MOTINC | Unknown words | Technology Lady | 2 022 |
X | Typos & others | sfeir 3D statu | 175 |
Train | Dev | Test | |
---|---|---|---|
# Docs | 14 449 | 1 476 | 416 |
Avg # Tokens / Doc | 24.54 | 24.19 | 24.08 |
[需要更多信息]
[需要更多信息]
谁是源语言制作者?[需要更多信息]
[需要更多信息]
谁是标注者?[需要更多信息]
该语料库不包含个人或敏感信息,因为它是基于维基百科文章内容创建的。
[需要更多信息]
语料库的性质引入了各种偏见,例如街道名称具有时间限制,因此可能引入作者或事件名称等命名实体。例如,法国20世纪之前不存在诸如Rue Victor-Hugo或Rue Pasteur之类的街道名称。
[需要更多信息]
ANTILLES:Labrak Yanis, Dufour Richard
UD_FRENCH-GSD:de Marneffe Marie-Catherine, Guillaume Bruno, McDonald Ryan, Suhr Alane, Nivre Joakim, Grioni Matias, Dickerson Carly, Perrier Guy
Universal Dependency:Ryan McDonald,Joakim Nivre,Yvonne Quirmbach-Brundage,Yoav Goldberg,Dipanjan Das,Kuzman Ganchev,Keith Hall,Slav Petrov,Hao Zhang,Oscar Tackstrom,Claudia Bedini,Nuria Bertomeu Castello和Jungmee Lee
For the following languages German, Spanish, French, Indonesian, Italian, Japanese, Korean and Brazilian Portuguese we will distinguish between two portions of the data. 1. The underlying text for sentences that were annotated. This data Google asserts no ownership over and no copyright over. Some or all of these sentences may be copyrighted in some jurisdictions. Where copyrighted, Google collected these sentences under exceptions to copyright or implied license rights. GOOGLE MAKES THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESS OR IMPLIED. 2. The annotations -- part-of-speech tags and dependency annotations. These are made available under a CC BY-SA 4.0. GOOGLE MAKES THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESS OR IMPLIED. See attached LICENSE file for the text of CC BY-NC-SA. Portions of the German data were sampled from the CoNLL 2006 Tiger Treebank data. Hans Uszkoreit graciously gave permission to use the underlying sentences in this data as part of this release. Any use of the data should reference the above plus: Universal Dependency Annotation for Multilingual Parsing Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Tackstrom, Claudia Bedini, Nuria Bertomeu Castello and Jungmee Lee Proceedings of ACL 2013
使用此模型时,请引用以下论文。
ANTILLES扩展语料库:
@inproceedings{labrak:hal-03696042, TITLE = {{ANTILLES: An Open French Linguistically Enriched Part-of-Speech Corpus}}, AUTHOR = {Labrak, Yanis and Dufour, Richard}, URL = {https://hal.archives-ouvertes.fr/hal-03696042}, BOOKTITLE = {{25th International Conference on Text, Speech and Dialogue (TSD)}}, ADDRESS = {Brno, Czech Republic}, PUBLISHER = {{Springer}}, YEAR = {2022}, MONTH = Sep, KEYWORDS = {Part-of-speech corpus ; POS tagging ; Open tools ; Word embeddings ; Bi-LSTM ; CRF ; Transformers}, PDF = {https://hal.archives-ouvertes.fr/hal-03696042/file/ANTILLES_A_freNch_linguisTIcaLLy_Enriched_part_of_Speech_corpus.pdf}, HAL_ID = {hal-03696042}, HAL_VERSION = {v1}, }
UD_French-GSD语料库:
@misc{ universaldependencies, title={UniversalDependencies/UD_French-GSD}, url={https://github.com/UniversalDependencies/UD_French-GSD}, journal={GitHub}, author={UniversalDependencies} }
{U}niversal{D}ependency多语言解析注释:
@inproceedings{mcdonald-etal-2013-universal, title = "{U}niversal {D}ependency Annotation for Multilingual Parsing", author = {McDonald, Ryan and Nivre, Joakim and Quirmbach-Brundage, Yvonne and Goldberg, Yoav and Das, Dipanjan and Ganchev, Kuzman and Hall, Keith and Petrov, Slav and Zhang, Hao and T{\"a}ckstr{\"o}m, Oscar and Bedini, Claudia and Bertomeu Castell{\'o}, N{\'u}ria and Lee, Jungmee}, booktitle = "Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)", month = aug, year = "2013", address = "Sofia, Bulgaria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P13-2017", pages = "92--97", }
LIA TAGG:
@techreport{LIA_TAGG, author = {Frédéric Béchet}, title = {LIA_TAGG: a statistical POS tagger + syntactic bracketer}, institution = {Aix-Marseille University & CNRS}, year = {2001} }