数据集:

GroNLP/ik-nlp-22_pestyle

任务:

翻译

语言:

en it

计算机处理:

translation

大小:

1K<n<10K

语言创建人:

found

源数据集:

original

许可:

other
英文

IK-NLP-22项目1的数据集卡片:后编辑文体研究

数据集摘要

该数据集包含从 FLORES-101 数据集中抽取的句子样本,这些句子由三位人类翻译员从头翻译或对现有自动翻译进行后编辑而成。翻译是针对英语-意大利语语言对进行的,并使用 PET 平台收集了翻译员的行为数据(键入、暂停、编辑时间)。

本数据集用于格罗宁根大学2022年自然语言处理课程的期末项目,该课程由 Arianna Bisazza Gabriele Sarti 教授, Anjali Nair 协助。

免责声明:由于尚未发布的结果,此存储库不提供直接的数据访问。因此,严禁共享或发布与该存储库相关的所有数据。学生将获得一个压缩的文件夹,其中包含该数据集的数据,以供选择基于此数据集的项目时使用。要使用?数据集加载数据集,请下载并解压提供的文件夹,并将其传递给load_dataset方法,如:datasets.load_dataset('GroNLP/ik-nlp-22_pestyle','full',data_dir ='path/to/unzipped/folder')。

语言

语言数据是英文(BCP-47 en)和意大利文(BCP-47 it)

数据集结构

数据实例

数据集包含四个配置:full、test_mask_subject、test_mask_modality、test_mask_time。full包含主要的训练集,其中所有字段都可用。其他三个test_mask_subject、test_mask_modality、test_mask_time,每个都包含一个测试集,其中删除了不同的字段,以避免在评估过程中泄漏信息。请参阅数据拆分部分中的更多详细信息。

数据字段

训练集中包含以下字段:

Field Description
item_id The sentence identifier. The first digits of the number represent the document containing the sentence, while the last digit of the number represents the sentence position inside the document. Documents can contain from 3 to 5 semantically-related sentences each.
subject_id The identifier for the translator performing the translation from scratch or post-editing task. Values: t1 , t2 or t3 .
modality The modality of the translation task. Values: ht (translation from scratch), pe1 (post-editing Google Translate translations), pe2 (post-editing 1237321 translations).
src_text The original source sentence extracted from Wikinews, wikibooks or wikivoyage.
mt_text Missing if tasktype is ht . Otherwise, contains the automatically-translated sentence before post-editing.
tgt_text Final sentence produced by the translator (either via translation from scratch of sl_text or post-editing mt_text )
edit_time Total editing time for the translation in seconds.
k_total Total number of keystrokes for the translation.
k_letter Total number of letter keystrokes for the translation.
k_digit Total number of digit keystrokes for the translation.
k_white Total number of whitespace keystrokes for the translation.
k_symbol Total number of symbol (punctuation, etc.) keystrokes for the translation.
k_nav Total number of navigation keystrokes (left-right arrows, mouse clicks) for the translation.
k_erase Total number of erase keystrokes (backspace, cancel) for the translation.
k_copy Total number of copy (Ctrl + C) actions during the translation.
k_cut Total number of cut (Ctrl + X) actions during the translation.
k_paste Total number of paste (Ctrl + V) actions during the translation.
n_pause_geq_300 Number of pauses of 300ms or more during the translation.
len_pause_geq_300 Total duration of pauses of 300ms or more, in milliseconds.
n_pause_geq_1000 Number of pauses of 1s or more during the translation.
len_pause_geq_1000 Total duration of pauses of 1000ms or more, in milliseconds.
num_annotations Number of times the translator focused the texbox for performing the translation of the sentence during the translation session. E.g. 1 means the translation was performed once and never revised.
n_insert Number of post-editing insertions (empty for modality ht ) computed using the 1238321 library.
n_delete Number of post-editing deletions (empty for modality ht ) computed using the 1238321 library.
n_substitute Number of post-editing substitutions (empty for modality ht ) computed using the 1238321 library.
n_shift Number of post-editing shifts (empty for modality ht ) computed using the 1238321 library.
bleu Sentence-level BLEU score between MT and post-edited fields (empty for modality ht ) computed using the 12312321 library with default parameters.
chrf Sentence-level chrF score between MT and post-edited fields (empty for modality ht ) computed using the 12312321 library with default parameters.
ter Sentence-level TER score between MT and post-edited fields (empty for modality ht ) computed using the 1238321 library.
aligned_edit Aligned visual representation of REF ( mt_text ), HYP ( tl_text ) and edit operations (I = Insertion, D = Deletion, S = Substitution) performed on the field. Replace \\n with \n to show the three aligned rows.

数据拆分

config train test
main 1170 120
训练集拆分

训练集共包含1170个三元组(或成对的,当从头翻译时执行)的注释,其中包含了翻译过程中产生的行为数据。下面是一个示例,主题t3对由系统2产生的机器翻译(任务类型pe2)进行后编辑的,摘自训练集。为了提供对内容的视觉理解,aligned_edit字段显示为三行。

{
    "item_id": 1072,
    "subject_id": "t3",
    "tasktype": "pe2",
    "src_text": "At the beginning dress was heavily influenced by the Byzantine culture in the east.",
    "mt_text": "All'inizio il vestito era fortemente influenzato dalla cultura bizantina dell'est.",
    "tgt+text": "Inizialmente, l'abbigliamento era fortemente influenzato dalla cultura bizantina orientale.",
    "edit_time": 45.687,
    "k_total": 51,
    "k_letter": 31,
    "k_digit": 0,
    "k_white": 2,
    "k_symbol": 3,
    "k_nav": 7,
    "k_erase": 3,
    "k_copy": 0,
    "k_cut": 0,
    "k_paste": 0,
    "n_pause_geq_300": 9,
    "len_pause_geq_300": 40032,
    "n_pause_geq_1000": 5,
    "len_pause_geq_1000": 38392,
    "num_annotations": 1,
    "n_insert": 0.0,
    "n_delete": 1.0,
    "n_substitute": 3.0,
    "n_shift": 0.0,
    "bleu": 47.99,
    "chrf": 62.05,
    "ter": 40.0,
    "aligned_edit: "REF:  all'inizio il            vestito         era fortemente influenzato dalla cultura bizantina dell'est.\\n
                    HYP:  ********** inizialmente, l'abbigliamento era fortemente influenzato dalla cultura bizantina orientale.\\n 
                    EVAL: D          S             S                                                                  S"
}

文本按原样提供,没有经过进一步的预处理或分词。

测试拆分

三个测试拆分(按配置划分)每个包含相同的120个条目,遵循与训练集相同的结构。每个测试拆分都省略了一些字段,以防止信息泄漏:

  • 在test_mask_subject中,subject_id字段不存在,用于主要任务后编辑文体。

  • 在test_mask_modality中,对于模态预测额外任务,省略以下字段:modality、mt_text、n_insert、n_delete、n_substitute、n_shift、ter、bleu、chrf、aligned_edit。

  • 在test_mask_time中,对于时间和暂停预测的额外任务,省略以下字段:edit_time、n_pause_geq_300、len_pause_geq_300、n_pause_geq_1000和len_pause_geq_1000。

数据集创建

该数据集将PET XML文件解析为CSV格式,使用的脚本是从[此链接]( https://github.com/antot/postediting_novel_frontiers )找到的脚本,由 Antonio Toral 进行了适应。

附加信息

数据集创建者

有关此?数据集版本的问题,请联系我们:ik-nlp-course@rug.nl。

许可信息

禁止共享或发布与此?数据集版本相关的数据。

引用信息

此数据集未提供引用信息。