数据集:

NbAiLab/NPSC

任务:

自动语音识别

音频分类

语言:

计算机处理:

monolingual

大小:

size_categories:2G<n<1B

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

其他:

speech-modeling

许可:

cc0-1.0

数据集介绍文件清单

英文

NbAiLab/NPSC 数据集卡片

《挪威国会演讲语料库（NPSC）》于2019-2021年由挪威国家图书馆的挪威语言银行制作而成。NPSC包含来自挪威议会——议会的演讲录音以及对应的挪威博克麦尔语和挪威尼诺尔斯克语的正字法转录。所有转录都是由经过培训的语言学家或文献学家手动完成的，并且随后对手动转录进行校对，以确保一致性和准确性。整个议会会议的讲演都被转录到数据集中。

该代码库以?数据集格式包含了NPSC的一个版本。请注意，官方发布的数据集版本（位于 the repository of the Norwegian Language Bank ）相比此处的版本包含了更多信息，包括单词层面的元数据、说话者的元数据以及详细的文档。

如何使用

# Loads the 16K Bokmål corpus in streaming mode
from datasets import load_dataset
data = load_dataset("NbAiLab/NPSC", config="16K_mp3_bokmaal", streaming=True)

数据集摘要

NPSC数据集包含了具有语言训练数据的JSON行。数据加载器将向此结构添加音频数据。以下是一个示例JSON对象：

{
"sentence_id": 49853,
"sentence_order": 0,
"speaker_id": 32,
"meeting_date": "20170110",
"speaker_name": "Olemic Thommessen",
"sentence_text": "Stortingets møte er lovlig satt",
"sentence_language_code": "nb-NO",
"text": "Stortingets møte er lovlig satt",
"start_time": 320246, 
"end_time": 323590,
"normsentence_text": "Stortingets møte er lovlig satt",
"transsentence_text": "Stortingets møte er lovleg sett",
"translated": 1,
"audio": {"path": "audio/20170110-095504_320246_323590.wav","array": [.......]}
}

数据字段

Key	Type	Description
sentence_id:	Integer	Unique identifier of the sentence
sentence_order	Integer	A number indicating the order of the sentences in the meeting
speaker_id	Integer	The ID of the speaker. This can be linked to the original dataset containing thorough demographic and dialectal information about the speaker.
meeting_date	String	The date for the meeting in the format yyyymmdd
speaker_name	String	Name of the speaker. All speakers were members of the Norwegian Parliament or members of the Norwegian Government at the meeting date
sentence_text	String	The sentence text. The transcribed text string of the sentence in non-normalized form. This is the text of the manual transcriptions, without any postprocessing (apart from corrections of known errors). It may contain interrupted words, non-standard words and function words with a pronunciation deviating from the written form. Detailed metadata about the words in the sentence can be found in the word-tokenized version of the corpus in the official release of the dataset.
sentence_language_code	String	The language code of the sentence. The following alternatives exists in the file: ['nb-NO'. 'nn-NO', 'en-US']
text	String	sentence text. This is a copy of "sentence_text". It is included here to make it more convenient to interleave with other datasets.
start_time	Integer	The start time of the sentence in milliseconds. This time is relative to the start of audiofile of the entire meeting, which can be accessed in the official release
end_time	Integer	End time. See comment above.
normsentence_text	String	Normalized sentence text. In this version of the transcription, numbers and dates are written in digits on standardized formats, and common abbreviations are used. These modifications to the original transcriptions are produced automatically using normalization grammars
transsentence_text	String	Translated sentence text. Whenever the original transcription is in Bokmål (nb-NO), this field contains a machine-translated version in Nynorsk (nn-NO), and vice versa
translated	Integer	A flag indicating whether a machine-translated version has been produced or not. Sentences in en-US have not been translated
audio	Array	The dataloader will encode the accociated audio files and provide them as an array containing 'path', 'sound array','sampling_rate'

初始数据收集

数据集的创建过程在我们的论文中有详细描述。

统计数据

Feature	Value
Duration, pauses included	140,3 hours
Duration, pauses not included	125,7 hours
Word count	1,2 million
Sentence count	64.531
Language distribution	Nynorsk: 12,8%
Bokmål: 87,2%
Gender distribution	Female: 38,3%
Male: 61.7%

使用数据时的注意事项

此语料库包含语音数据。所有录音都是来自议会成员在公众场合的表达，可以无限制地分发。

数据集创建者和管理者

数据集的内容由挪威国家图书馆的挪威语言银行（Språkbanken）创建。Javier de la Rosa、Freddy Wetjen、Per Egil Kummervold和Andre Kaasen为将其转为HuggingFace数据集做出了贡献。感谢HuggingFace团队的协助。

许可证

音频和转录以 CC-ZERO-license 发布。HuggingFace数据集的策划发布以 CC-BY-SA-3-license 发布。

引用信息

如果您使用此数据集，请参考以下文章和本页面以获取详细信息：

@inproceedings{solberg2022norwegian,
  title={The Norwegian Parliamentary Speech Corpus},
  author={Solberg, Per Erik and Ortiz, Pablo},
  booktitle={Proceedings of the 13th Language Resources and Evaluation Conference},
  url={http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.106.pdf},
  year={2022}
}

作者:

NbAiLab

数据集大小:

4.57 GB