数据集:

hrenwac_para

任务:

翻译

语言:

en hr

计算机处理:

translation

大小:

10K<n<100K

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original
中文

Dataset Card for hrenwac_para

Dataset Summary

The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor ( https://github.com/abumatran/spidextor ), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 80% and on the word level around 84%.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Dataset is bilingual with Croatian and English languages.

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Dataset is under the CC-BY-SA 3.0 license.

Citation Information

  @misc{11356/1058,
 title = {Croatian-English parallel corpus {hrenWaC} 2.0},
 author = {Ljube{\v s}i{\'c}, Nikola and Espl{\`a}-Gomis, Miquel and Ortiz Rojas, Sergio and Klubi{\v c}ka, Filip and Toral, Antonio},
 url = {http://hdl.handle.net/11356/1058},
 note = {Slovenian language resource repository {CLARIN}.{SI}},
 copyright = {{CLARIN}.{SI} User Licence for Internet Corpora},
 year = {2016} }

Contributions

Thanks to @IvanZidov for adding this dataset.