数据集:

ai4bharat/samanantar

中文

Dataset Card for Samanantar

Dataset Summary

Samanantar is the largest publicly available parallel corpora collection for Indic language: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.

The corpus has 49.6M sentence pairs between English to Indian Languages.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Samanantar contains parallel sentences between English ( en ) and 11 Indic language:

  • Assamese ( as ),
  • Bengali ( bn ),
  • Gujarati ( gu ),
  • Hindi ( hi ),
  • Kannada ( kn ),
  • Malayalam ( ml ),
  • Marathi ( mr ),
  • Odia ( or ),
  • Punjabi ( pa ),
  • Tamil ( ta ) and
  • Telugu ( te ).

Dataset Structure

Data Instances

{
  'idx': 0,
  'src': 'Prime Minister Narendra Modi met Her Majesty Queen Maxima of the Kingdom of the Netherlands today.',
  'tgt': 'নতুন দিল্লিতে সোমবার প্রধানমন্ত্রী শ্রী নরেন্দ্র মোদীর সঙ্গে নেদারন্যান্ডসের মহারানী ম্যাক্সিমা সাক্ষাৎ করেন।',
  'data_source': 'pmi'
}

Data Fields

  • idx (int): ID.
  • src (string): Sentence in source language (English).
  • tgt (string): Sentence in destination language (one of the 11 Indic languages).
  • data_source (string): Source of the data. For created data sources, depending on the destination language, it might be one of:
    • anuvaad_catchnews
    • anuvaad_DD_National
    • anuvaad_DD_sports
    • anuvaad_drivespark
    • anuvaad_dw
    • anuvaad_financialexpress
    • anuvaad-general_corpus
    • anuvaad_goodreturns
    • anuvaad_indianexpress
    • anuvaad_mykhel
    • anuvaad_nativeplanet
    • anuvaad_newsonair
    • anuvaad_nouns_dictionary
    • anuvaad_ocr
    • anuvaad_oneindia
    • anuvaad_pib
    • anuvaad_pib_archives
    • anuvaad_prothomalo
    • anuvaad_timesofindia
    • asianetnews
    • betterindia
    • bridge
    • business_standard
    • catchnews
    • coursera
    • dd_national
    • dd_sports
    • dwnews
    • drivespark
    • fin_express
    • goodreturns
    • gu_govt
    • jagran-business
    • jagran-education
    • jagran-sports
    • ie_business
    • ie_education
    • ie_entertainment
    • ie_general
    • ie_lifestyle
    • ie_news
    • ie_sports
    • ie_tech
    • indiccorp
    • jagran-entertainment
    • jagran-lifestyle
    • jagran-news
    • jagran-tech
    • khan_academy
    • Kurzgesagt
    • marketfeed
    • mykhel
    • nativeplanet
    • nptel
    • ocr
    • oneindia
    • pa_govt
    • pmi
    • pranabmukherjee
    • sakshi
    • sentinel
    • thewire
    • toi
    • tribune
    • vsauce
    • wikipedia
    • zeebiz

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Creative Commons Attribution-NonCommercial 4.0 International .

Citation Information

@misc{ramesh2021samanantar,
      title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
      author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2021},
      eprint={2104.05596},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions

Thanks to @albertvillanova for adding this dataset.