数据集:

Finnish-NLP/mc4_fi_cleaned

中文

Dataset Card for mC4 Finnish Cleaned

Dataset Summary

mC4 Finnish cleaned is cleaned version of the original mC4 Finnish split.

Supported Tasks and Leaderboards

mC4 Finnish is mainly intended to pretrain Finnish language models and word representations.

Languages

Finnish

Dataset Structure

Data Instances

[Needs More Information]

Data Fields

The data have several fields:

  • url: url of the source as a string
  • text: text content as a string
  • timestamp: timestamp as a string
  • perplexity_kenlm_full: perplexity of the text calculated by KenLM model

Data Splits

Train Validation

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

[Needs More Information]

Citation Information

[Needs More Information]