数据集:

imvladikon/hebrew_speech_kan

中文

Dataset Card for Dataset Name

Dataset Summary

Hebrew Dataset for ASR

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure

Data Instances

{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/8ce7402f6482c6053251d7f3000eec88668c994beb48b7ca7352e77ef810a0b6/train/e429593fede945c185897e378a5839f4198.wav',
  'array': array([-0.00265503, -0.0018158 , -0.00149536, ..., -0.00135803,
         -0.00231934, -0.00190735]),
  'sampling_rate': 16000},
 'sentence': 'היא מבינה אותי יותר מכל אחד אחר'}

Data Fields

[More Information Needed]

Data Splits

train validation
number of samples 8000 2000
hours 6.92 1.73

Dataset Creation

Curation Rationale

scraped data from youtube (channel כאן) with removing outliers (by length and ratio between length of the audio and sentences)

Source Data

Initial Data Collection and Normalization Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@misc{imvladikon2022hebrew_speech_kan,
  author = {Gurevich, Vladimir},
  title = {Hebrew Speech Recognition Dataset: Kan},
  year = {2022},
  howpublished = \url{https://huggingface.co/datasets/imvladikon/hebrew_speech_kan},
}

Contributions

[More Information Needed]