数据集:

NLPC-UOM/sentence_alignment_dataset-Sinhala-Tamil-English

中文

Dataset summary

This is a gold-standard benchmark dataset for sentence alignment, between Sinhala-English-Tamil languages. Data had been crawled from the following news websites. The aligned documents annotated in the dataset NLPC-UOM/document_alignment_dataset-Sinhala-Tamil-English had been considered to annotate the aligned sentences.

News Source url
Army https://www.army.lk/
Hiru http://www.hirunews.lk
ITN https://www.newsfirst.lk
Newsfirst https://www.itnnews.lk

The aligned sentences have been manually annotated.

Dataset

The folder structure for each news source is as follows.

si-en
  |--army
      |--Sinhala
      |--English
      |--army.si-en
  |--hiru <br/>
      |--Sinhala 
      |--English 
      |--hiru.si-en
  |--itn 
      |--Sinhala 
      |--English 
      |--itn.si-en
  |--newsfirst
      |--Sinhala 
      |--English 
      |--newsfirst.si-en 
ta-en
si-ta

Sinhala/English/Tamil - contain the aligned documents in the two languages with respect to the news source. (army/hiru/itn/newsfirst) Aligned documents contain the same ID. army.si-en - golden aligned sentence alignment. Each sentence is referenced according to the languageprefix_fileid_sentenceId.

Citation Information

@article{fernando2022exploiting, title={Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages}, author={Fernando, Aloka and Ranathunga, Surangika and Sachintha, Dilan and Piyarathna, Lakmali and Rajitha, Charith}, journal={Knowledge and Information Systems}, pages={1--42}, year={2022}, publisher={Springer} }