模型:

xlm-roberta-large-finetuned-conll03-english

任务:

标记分类

类库:

PyTorch Rust Transformers

语言:

multilingual

其他:

xlm-roberta AutoTrain Compatible

预印本库:

arxiv:1911.02116 arxiv:2008.03415 arxiv:1910.09700

模型介绍文件清单

英文

xlm-roberta-large-finetuned-conll03-english

模型细节

模型描述

XLM-RoBERTa 模型是由 Alexis Conneau、Kartikay Khandelwal、Naman Goyal、Vishrav Chaudhary、Guillaume Wenzek、Francisco Guzmán、Edouard Grave、Myle Ott、Luke Zettlemoyer 和 Veselin Stoyanov 于 Unsupervised Cross-lingual Representation Learning at Scale 提出的。它基于 Facebook 在 2019 年发布的 RoBERTa 模型。它是一个大型的多语言语言模型，使用了 2.5TB 的过滤后的 CommonCrawl 数据进行训练。该模型使用 conll2003 的英语数据集进行了微调。

开发者：请参见 associated paper
模型类型：多语言语言模型
语言（NLP）或国家（图像）：XLM-RoBERTa 是一个在 100 种不同语言上训练的多语言模型；有关完整列表，请参见 GitHub Repo ；模型在英语数据集上进行了微调
许可证：需要更多信息
相关模型： RoBERTa ， XLM
- 父模型： XLM-RoBERTa-large
获取更多信息的资源：- GitHub Repo - Associated Paper

用途

直接使用

该模型是一个语言模型。此模型可用于令牌分类，即在文本中为某些标记分配标签的自然语言理解任务。

下游使用

可能的下游应用包括命名实体识别（NER）和词性标注（PoS）。了解有关令牌分类和其他潜在的下游用途的更多信息，请参阅 Hugging Face token classification docs 。

没有范围的使用

该模型不应用于有意地创建对人类具有敌意或疏远的环境。

偏见、风险和限制

注意：读者应注意此模型生成的语言可能对某些人来说令人不安或冒犯，并可能传播历史和当前的刻板印象。

对语言模型进行了大量研究，探讨了偏见和公平性问题（参见，例如， Sheng et al. (2021) 和 Bender et al. (2021) ）。在与此模型相关的任务背景下， Mishra et al. (2020) 研究了英语中 NER 系统中的社会偏见，并发现现有的 NER 系统存在系统性偏见，无法识别来自不同人群的命名实体（尽管该论文没有研究 BERT）。例如，使用 Mishra et al. (2020) 的示例句子：

>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Alya told Jasmine that Andrew could pay with cash..")
[{'end': 2,
  'entity': 'I-PER',
  'index': 1,
  'score': 0.9997861,
  'start': 0,
  'word': '▁Al'},
 {'end': 4,
  'entity': 'I-PER',
  'index': 2,
  'score': 0.9998591,
  'start': 2,
  'word': 'ya'},
 {'end': 16,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.99995816,
  'start': 10,
  'word': '▁Jasmin'},
 {'end': 17,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.9999584,
  'start': 16,
  'word': 'e'},
 {'end': 29,
  'entity': 'I-PER',
  'index': 7,
  'score': 0.99998057,
  'start': 23,
  'word': '▁Andrew'}]

建议

用户（直接和下游使用者）应了解模型的风险、偏见和局限性。

训练

有关训练数据和训练过程的详细信息，请参见以下资源：

评估

评估细节请参阅 associated paper 。

环境影响

可以使用 Machine Learning Impact calculator 中提出的方法估算碳排放量。

硬件类型：500 个 32GB Nvidia V100 GPU（来自 associated paper ）
使用时间：需要更多信息
云服务提供商：需要更多信息
计算区域：需要更多信息
排放碳量：需要更多信息

技术规格

有关详细信息，请参见 associated paper 。

引用

BibTeX：

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

APA：

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.

模型卡片作者

本模型卡片由 Hugging Face 团队编写。

如何开始使用该模型

使用以下代码开始使用该模型。您可以在 NER 管道内直接使用该模型。

点击展开

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Hello I'm Omar and I live in Zürich.")

[{'end': 14,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.9999175,
  'start': 10,
  'word': '▁Omar'},
 {'end': 35,
  'entity': 'I-LOC',
  'index': 10,
  'score': 0.9999906,
  'start': 29,
  'word': '▁Zürich'}]

作者:

None

数据集大小:

4.18 GB

xlm-roberta-large-finetuned-conll03-english

目录

模型细节

模型描述

用途

直接使用

下游使用

没有范围的使用

偏见、风险和限制

建议

训练

评估

环境影响

技术规格

引用

模型卡片作者

如何开始使用该模型