数据集:
ccaligned_multilingual
CCAligned 数据集包含与英语对齐的 137 种语言的平行或可比的网络文档对。这些网络文档对是通过对原始网络文档进行语言识别,并确保网页文档的 URL 中的语言代码相对应来构建的。这种模式匹配方法产生了超过1亿个与英语对齐的文档对。由于每个英语文档通常与不同目标语言的多个文档对齐,我们可以通过对英语文档进行连接,以获取直接对齐两个非英语文档的对齐文档(例如,阿拉伯语-法语)。该语料库是从 68 个 Commoncrawl 快照中创建的。
要加载不在配置中的语言,您只需要指定语言代码。您可以在 http://www.statmt.org/cc-aligned/ 中找到有效的语言。例如:
dataset = load_dataset("ccaligned_multilingual", language_code="fr_XX", type="documents")
或者
dataset = load_dataset("ccaligned_multilingual", language_code="fr_XX", type="sentences")
[需要更多信息]
数据集中的文本是用(137)种与英语对齐的多种语言编写的。
语言为 ak_GH 的文档类型示例:
{'Domain': 'islamhouse.com', 'Source_URL': 'https://islamhouse.com/en/audios/373088/', 'Target_URL': 'https://islamhouse.com/ak/audios/373088/', 'translation': {'ak_GH': "Ntwatiaa / wɔabɔ no tɔfa wɔ mu no te ase ma Umrah - Arab kasa|Islamhouse.com|Follow us:|facebook|twitter|taepe|Titles All|Fie wibesite|kasa nyina|Buukuu edi adanse ma prente|Nhyehyɛmu|Nyim/sua Islam|Curriculums|Nyina ndeɛma|Nyina ndeɛma (295)|Buukuu/ nwoma (2)|sini / muuvi (31)|ɔdio (262)|Aɛn websideNew!|Kɔ wura kramosom mu seisei|Ebio|figa/kaasɛ|Farebae|AKAkan|Kratafa titriw|kasa interface( anyimu) : Akan|Kasa ma no mu-nsɛm : Arab kasa|ɔdio|Ntwatiaa / wɔabɔ no tɔfa wɔ mu no te ase ma Umrah|play|pause|stop|mute|unmute|max volume|Kasakyerɛ ni :|Farebae:|17 / 11 / 1432 , 15/10/2011|Nhyehyɛmu:|Jurisprudence/ Esum Nimdea|Som|Hajj na Umrah|Jurisprudence/ Esum Nimdea|Som|Hajj na Umrah|Mmira ma Hajj na Umrah|nkyerɛmu|kasamu /sɛntɛns ma te ase na Umrah wɔ ... mu no hann ma no Quran na Sunnah na te ase ma no nana na no kasamu /sɛntɛns ma bi ma no emerging yi adu obusuani|Akenkane we ye di ko kasa bi su (36)|Afar - Qafár afa|Akan|Amhari ne - አማርኛ|Arab kasa - عربي|Assamese - অসমীয়া|Bengali - বাংলা|Maldive - ދިވެހި|Greek - Ελληνικά|English ( brofo kasa) - English|Persian - فارسی|Fula - pulla|French - Français|Hausa - Hausa|Kurdish - كوردی سۆرانی|Uganda ne - Oluganda|Mandinka - Mandinko|Malayalam - മലയാളം|Nepali - नेपाली|Portuguese - Português|Russian - Русский|Sango - Sango|Sinhalese - සිංහල|Somali - Soomaali|Albania ne - Shqip|Swahili - Kiswahili|Telugu - తెలుగు ప్రజలు|Tajik - Тоҷикӣ|Thai - ไทย|Tagalog - Tagalog|Turkish - Türkçe|Uyghur - ئۇيغۇرچە|Urdu - اردو|Uzbeck ne - Ўзбек тили|Vietnamese - Việt Nam|Wolof - Wolof|Chine ne - 中文|Soma kɔ bi kyerɛ adwen kɔ wɛb ebusuapanin|Soma kɔ ne kɔ hom adamfo|Soma kɔ bi kyerɛ adwen kɔ wɛb ebusuapanin|Nsɔwso fael (1)|1|الموجز في فقه العمرة|MP3 14.7 MB|Enoumah ebatahu|Rituals/Esom ajomadie ewu Hajji mmire .. 1434 AH [01] no fapemso Enum|Fiidbak/ Ye hiya wu jun kyiri|Lenke de yɛe|kɔntakt yɛn|Aɛn webside|Qura'an Kro kronkrom|Balagh|wɔ mfinimfin Dowload faele|Yɛ atuu bra Islam mu afei|Tsin de yɛe ewu|Anaa bomu/combine hɛn melin liste|© Islamhouse Website/ Islam dan webi site|×|×|Yi mu kasa|", 'en_XX': 'SUMMARY in the jurisprudence of Umrah - Arabic - Abdul Aziz Bin Marzooq Al-Turaifi|Islamhouse.com|Follow us:|facebook|twitter|QuranEnc.com|HadeethEnc.com|Type|Titles All|Home Page|All Languages|Categories|Know about Islam|All items|All items (4057)|Books (701)|Articles (548)|Fatawa (370)|Videos (1853)|Audios (416)|Posters (98)|Greeting cards (22)|Favorites (25)|Applications (21)|Desktop Applications (3)|To convert to Islam now !|More|Figures|Sources|Curriculums|Our Services|QuranEnc.com|HadeethEnc.com|ENEnglish|Main Page|Interface Language : English|Language of the content : Arabic|Audios|تعريب عنوان المادة|SUMMARY in the jurisprudence of Umrah|play|pause|stop|mute|unmute|max volume|Lecturer : Abdul Aziz Bin Marzooq Al-Turaifi|Sources:|AlRaya Islamic Recoding in Riyadh|17 / 11 / 1432 , 15/10/2011|Categories:|Islamic Fiqh|Fiqh of Worship|Hajj and Umrah|Islamic Fiqh|Fiqh of Worship|Hajj and Umrah|Pilgrimage and Umrah|Description|SUMMARY in jurisprudence of Umrah: A statement of jurisprudence and Umrah in the light of the Quran and Sunnah and understanding of the Ancestors and the statement of some of the emerging issues related to them.|This page translated into (36)|Afar - Qafár afa|Akane - Akan|Amharic - አማርኛ|Arabic - عربي|Assamese - অসমীয়া|Bengali - বাংলা|Maldivi - ދިވެހި|Greek - Ελληνικά|English|Persian - فارسی|Fula - pulla|French - Français|Hausa - Hausa|kurdish - كوردی سۆرانی|Ugandan - Oluganda|Mandinka - Mandinko|Malayalam - മലയാളം|Nepali - नेपाली|Portuguese - Português|Russian - Русский|Sango - Yanga ti Sango|Sinhalese - සිංහල|Somali - Soomaali|Albanian - Shqip|Swahili - Kiswahili|Telugu - తెలుగు|Tajik - Тоҷикӣ|Thai - ไทย|Tagalog - Tagalog|Turkish - Türkçe|Uyghur - ئۇيغۇرچە|Urdu - اردو|Uzbek - Ўзбек тили|Vietnamese - Việt Nam|Wolof - Wolof|Chinese - 中文|Send a comment to Webmaster|Send to a friend?|Send a comment to Webmaster|Attachments (1)|1|الموجز في فقه العمرة|MP3 14.7 MB|The relevant Material|The rituals of the pilgrimage season .. 1434 AH [ 01] the fifth pillar|The Quality of the Accepted Hajj (Piligrimage) and Its Limitations|Easy Path to the Rules of the Rites of Hajj|A Call to the Pilgrims of the Scared House of Allah|More|feedback|Important links|Contact us|Privacy policy|Islam Q&A|Learning Arabic Language|About Us|Convert To Islam|Noble Quran encyclopedia|IslamHouse.com Reader|Encyclopedia of Translated Prophetic Hadiths|Our Services|The Quran|Balagh|Center for downloading files|To embrace Islam now...|Follow us through|Or join our mailing list.|© Islamhouse Website|×|×|Choose language|'}}
语言为 ak_GH 的句子类型示例:
{'LASER_similarity': 1.4549942016601562, 'translation': {'ak_GH': 'Salah (nyamefere) ye Mmerebeia', 'en_XX': 'What he dislikes when fasting (10)'}}
对于文档类型:
对于句子类型:
某些小配置的拆分大小:
name | train |
---|---|
documents-zz_TR | 41 |
sentences-zz_TR | 34 |
documents-tz_MA | 4 |
sentences-tz_MA | 33 |
documents-ak_GH | 249 |
sentences-ak_GH | 478 |
[需要更多信息]
[需要更多信息]
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
谁是注释者?[需要更多信息]
数据集包含在线捐赠自己声音的人类。您同意不尝试确定数据集中发言者的身份。
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@inproceedings{elkishky_ccaligned_2020, author = {El-Kishky, Ahmed and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Koehn, Philipp}, booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)}, month = {November}, title = {{CCAligned}: A Massive Collection of Cross-lingual Web-Document Pairs}, year = {2020} address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.480", doi = "10.18653/v1/2020.emnlp-main.480", pages = "5960--5969" }
感谢 @gchhablani 添加了该数据集。