该数据集包含从 FLORES-101 数据集中抽取的句子样本,这些句子由三位人类翻译员从头翻译或对现有自动翻译进行后编辑而成。翻译是针对英语-意大利语语言对进行的,并使用 PET 平台收集了翻译员的行为数据(键入、暂停、编辑时间)。
本数据集用于格罗宁根大学2022年自然语言处理课程的期末项目,该课程由 Arianna Bisazza 和 Gabriele Sarti 教授, Anjali Nair 协助。
免责声明:由于尚未发布的结果,此存储库不提供直接的数据访问。因此,严禁共享或发布与该存储库相关的所有数据。学生将获得一个压缩的文件夹,其中包含该数据集的数据,以供选择基于此数据集的项目时使用。要使用?数据集加载数据集,请下载并解压提供的文件夹,并将其传递给load_dataset方法,如:datasets.load_dataset('GroNLP/ik-nlp-22_pestyle','full',data_dir ='path/to/unzipped/folder')。
语言数据是英文(BCP-47 en)和意大利文(BCP-47 it)
数据集包含四个配置:full、test_mask_subject、test_mask_modality、test_mask_time。full包含主要的训练集,其中所有字段都可用。其他三个test_mask_subject、test_mask_modality、test_mask_time,每个都包含一个测试集,其中删除了不同的字段,以避免在评估过程中泄漏信息。请参阅数据拆分部分中的更多详细信息。
训练集中包含以下字段:
Field | Description |
---|---|
item_id | The sentence identifier. The first digits of the number represent the document containing the sentence, while the last digit of the number represents the sentence position inside the document. Documents can contain from 3 to 5 semantically-related sentences each. |
subject_id | The identifier for the translator performing the translation from scratch or post-editing task. Values: t1 , t2 or t3 . |
modality | The modality of the translation task. Values: ht (translation from scratch), pe1 (post-editing Google Translate translations), pe2 (post-editing 1237321 translations). |
src_text | The original source sentence extracted from Wikinews, wikibooks or wikivoyage. |
mt_text | Missing if tasktype is ht . Otherwise, contains the automatically-translated sentence before post-editing. |
tgt_text | Final sentence produced by the translator (either via translation from scratch of sl_text or post-editing mt_text ) |
edit_time | Total editing time for the translation in seconds. |
k_total | Total number of keystrokes for the translation. |
k_letter | Total number of letter keystrokes for the translation. |
k_digit | Total number of digit keystrokes for the translation. |
k_white | Total number of whitespace keystrokes for the translation. |
k_symbol | Total number of symbol (punctuation, etc.) keystrokes for the translation. |
k_nav | Total number of navigation keystrokes (left-right arrows, mouse clicks) for the translation. |
k_erase | Total number of erase keystrokes (backspace, cancel) for the translation. |
k_copy | Total number of copy (Ctrl + C) actions during the translation. |
k_cut | Total number of cut (Ctrl + X) actions during the translation. |
k_paste | Total number of paste (Ctrl + V) actions during the translation. |
n_pause_geq_300 | Number of pauses of 300ms or more during the translation. |
len_pause_geq_300 | Total duration of pauses of 300ms or more, in milliseconds. |
n_pause_geq_1000 | Number of pauses of 1s or more during the translation. |
len_pause_geq_1000 | Total duration of pauses of 1000ms or more, in milliseconds. |
num_annotations | Number of times the translator focused the texbox for performing the translation of the sentence during the translation session. E.g. 1 means the translation was performed once and never revised. |
n_insert | Number of post-editing insertions (empty for modality ht ) computed using the 1238321 library. |
n_delete | Number of post-editing deletions (empty for modality ht ) computed using the 1238321 library. |
n_substitute | Number of post-editing substitutions (empty for modality ht ) computed using the 1238321 library. |
n_shift | Number of post-editing shifts (empty for modality ht ) computed using the 1238321 library. |
bleu | Sentence-level BLEU score between MT and post-edited fields (empty for modality ht ) computed using the 12312321 library with default parameters. |
chrf | Sentence-level chrF score between MT and post-edited fields (empty for modality ht ) computed using the 12312321 library with default parameters. |
ter | Sentence-level TER score between MT and post-edited fields (empty for modality ht ) computed using the 1238321 library. |
aligned_edit | Aligned visual representation of REF ( mt_text ), HYP ( tl_text ) and edit operations (I = Insertion, D = Deletion, S = Substitution) performed on the field. Replace \\n with \n to show the three aligned rows. |
config | train | test |
---|---|---|
main | 1170 | 120 |
训练集共包含1170个三元组(或成对的,当从头翻译时执行)的注释,其中包含了翻译过程中产生的行为数据。下面是一个示例,主题t3对由系统2产生的机器翻译(任务类型pe2)进行后编辑的,摘自训练集。为了提供对内容的视觉理解,aligned_edit字段显示为三行。
{ "item_id": 1072, "subject_id": "t3", "tasktype": "pe2", "src_text": "At the beginning dress was heavily influenced by the Byzantine culture in the east.", "mt_text": "All'inizio il vestito era fortemente influenzato dalla cultura bizantina dell'est.", "tgt+text": "Inizialmente, l'abbigliamento era fortemente influenzato dalla cultura bizantina orientale.", "edit_time": 45.687, "k_total": 51, "k_letter": 31, "k_digit": 0, "k_white": 2, "k_symbol": 3, "k_nav": 7, "k_erase": 3, "k_copy": 0, "k_cut": 0, "k_paste": 0, "n_pause_geq_300": 9, "len_pause_geq_300": 40032, "n_pause_geq_1000": 5, "len_pause_geq_1000": 38392, "num_annotations": 1, "n_insert": 0.0, "n_delete": 1.0, "n_substitute": 3.0, "n_shift": 0.0, "bleu": 47.99, "chrf": 62.05, "ter": 40.0, "aligned_edit: "REF: all'inizio il vestito era fortemente influenzato dalla cultura bizantina dell'est.\\n HYP: ********** inizialmente, l'abbigliamento era fortemente influenzato dalla cultura bizantina orientale.\\n EVAL: D S S S" }
文本按原样提供,没有经过进一步的预处理或分词。
测试拆分三个测试拆分(按配置划分)每个包含相同的120个条目,遵循与训练集相同的结构。每个测试拆分都省略了一些字段,以防止信息泄漏:
在test_mask_subject中,subject_id字段不存在,用于主要任务后编辑文体。
在test_mask_modality中,对于模态预测额外任务,省略以下字段:modality、mt_text、n_insert、n_delete、n_substitute、n_shift、ter、bleu、chrf、aligned_edit。
在test_mask_time中,对于时间和暂停预测的额外任务,省略以下字段:edit_time、n_pause_geq_300、len_pause_geq_300、n_pause_geq_1000和len_pause_geq_1000。
该数据集将PET XML文件解析为CSV格式,使用的脚本是从[此链接]( https://github.com/antot/postediting_novel_frontiers )找到的脚本,由 Antonio Toral 进行了适应。
有关此?数据集版本的问题,请联系我们:ik-nlp-course@rug.nl。
禁止共享或发布与此?数据集版本相关的数据。
此数据集未提供引用信息。