数据集:

Nan-Do/OpenSubtitlesJapanese

中文

The dataset contains (almost) the entire OpenSubtittles database for Japanese:

  • Over 7000 tv shows and/or movies.
  • The subtittles are human generated.
  • The dataset has been parsed, cleaned and converted to UTF-8.

File contents:

  • OpenSubtitles.parquet: The text and the time data.

  • OpenSubtitles_meta.parquet: The existing metadata for each title.

  • OpenSubtitles-OA.parquet: The dataset coded with two columns SOURCE(the name of the movie/tv show), and TEXT (the subtittles) following the Open Assistant rules.

    Both tables can be joined by the ID column. (The value can be NULL in the meta table).