数据集:

tner/wikiann

英文

"tner/wikiann" 数据集卡片

数据集概述

WikiAnn NER 数据集格式化为 TNER 项目的一部分。

  • 实体类型:LOC,ORG,PER

数据集结构

数据实例

日语训练集示例如下。

{
  'tokens': ['#', '#', 'ユ', 'リ', 'ウ', 'ス', '・', 'ベ', 'ー', 'リ', 'ッ', 'ク', '#', '1', '9','9','9'],
  'tags': [6, 6, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6]
}

标签ID

标签到ID的映射字典可以在 here 处找到。

{
    "B-LOC": 0,
    "B-ORG": 1,
    "B-PER": 2,
    "I-LOC": 3,
    "I-ORG": 4,
    "I-PER": 5,
    "O": 6
}

数据拆分

language train validation test
ace 100 100 100
bg 20000 10000 10000
da 20000 10000 10000
fur 100 100 100
ilo 100 100 100
lij 100 100 100
mzn 100 100 100
qu 100 100 100
su 100 100 100
vi 20000 10000 10000
af 5000 1000 1000
bh 100 100 100
de 20000 10000 10000
fy 1000 1000 1000
io 100 100 100
lmo 100 100 100
nap 100 100 100
rm 100 100 100
sv 20000 10000 10000
vls 100 100 100
als 100 100 100
bn 10000 1000 1000
diq 100 100 100
ga 1000 1000 1000
is 1000 1000 1000
ln 100 100 100
nds 100 100 100
ro 20000 10000 10000
sw 1000 1000 1000
vo 100 100 100
am 100 100 100
bo 100 100 100
dv 100 100 100
gan 100 100 100
it 20000 10000 10000
lt 10000 10000 10000
ne 100 100 100
ru 20000 10000 10000
szl 100 100 100
wa 100 100 100
an 1000 1000 1000
br 1000 1000 1000
el 20000 10000 10000
gd 100 100 100
ja 20000 10000 10000
lv 10000 10000 10000
nl 20000 10000 10000
rw 100 100 100
ta 15000 1000 1000
war 100 100 100
ang 100 100 100
bs 15000 1000 1000
eml 100 100 100
gl 15000 10000 10000
jbo 100 100 100
map-bms 100 100 100
nn 20000 1000 1000
sa 100 100 100
te 1000 1000 1000
wuu 100 100 100
ar 20000 10000 10000
ca 20000 10000 10000
en 20000 10000 10000
gn 100 100 100
jv 100 100 100
mg 100 100 100
no 20000 10000 10000
sah 100 100 100
tg 100 100 100
xmf 100 100 100
arc 100 100 100
cbk-zam 100 100 100
eo 15000 10000 10000
gu 100 100 100
ka 10000 10000 10000
mhr 100 100 100
nov 100 100 100
scn 100 100 100
th 20000 10000 10000
yi 100 100 100
arz 100 100 100
cdo 100 100 100
es 20000 10000 10000
hak 100 100 100
kk 1000 1000 1000
mi 100 100 100
oc 100 100 100
sco 100 100 100
tk 100 100 100
yo 100 100 100
as 100 100 100
ce 100 100 100
et 15000 10000 10000
he 20000 10000 10000
km 100 100 100
min 100 100 100
or 100 100 100
sd 100 100 100
tl 10000 1000 1000
zea 100 100 100
ast 1000 1000 1000
ceb 100 100 100
eu 10000 10000 10000
hi 5000 1000 1000
kn 100 100 100
mk 10000 1000 1000
os 100 100 100
sh 20000 10000 10000
tr 20000 10000 10000
zh-classical 100 100 100
ay 100 100 100
ckb 1000 1000 1000
ext 100 100 100
hr 20000 10000 10000
ko 20000 10000 10000
ml 10000 1000 1000
pa 100 100 100
si 100 100 100
tt 1000 1000 1000
zh-min-nan 100 100 100
az 10000 1000 1000
co 100 100 100
fa 20000 10000 10000
hsb 100 100 100
ksh 100 100 100
mn 100 100 100
pdc 100 100 100
simple 20000 1000 1000
ug 100 100 100
zh-yue 20000 10000 10000
ba 100 100 100
crh 100 100 100
fi 20000 10000 10000
hu 20000 10000 10000
ku 100 100 100
mr 5000 1000 1000
pl 20000 10000 10000
sk 20000 10000 10000
uk 20000 10000 10000
zh 20000 10000 10000
bar 100 100 100
cs 20000 10000 10000
fiu-vro 100 100 100
hy 15000 1000 1000
ky 100 100 100
ms 20000 1000 1000
pms 100 100 100
sl 15000 10000 10000
ur 20000 1000 1000
bat-smg 100 100 100
csb 100 100 100
fo 100 100 100
ia 100 100 100
la 5000 1000 1000
mt 100 100 100
pnb 100 100 100
so 100 100 100
uz 1000 1000 1000
be-x-old 5000 1000 1000
cv 100 100 100
fr 20000 10000 10000
id 20000 10000 10000
lb 5000 1000 1000
mwl 100 100 100
ps 100 100 100
sq 5000 1000 1000
vec 100 100 100
be 15000 1000 1000
cy 10000 1000 1000
frr 100 100 100
ig 100 100 100
li 100 100 100
my 100 100 100
pt 20000 10000 10000
sr 20000 10000 10000
vep 100 100 100

引用信息

@inproceedings{pan-etal-2017-cross,
    title = "Cross-lingual Name Tagging and Linking for 282 Languages",
    author = "Pan, Xiaoman  and
      Zhang, Boliang  and
      May, Jonathan  and
      Nothman, Joel  and
      Knight, Kevin  and
      Ji, Heng",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P17-1178",
    doi = "10.18653/v1/P17-1178",
    pages = "1946--1958",
    abstract = "The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating {``}silver-standard{''} annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.",
}