英文

Hansard讲话的数据集卡片

数据集概述

包含从1979年5月至2020年7月英国下议院的每篇讲话的数据集。引用于数据集主页。

如果您在数据集中发现任何错误,请与我联系。公众Hansard记录的完整性有时值得怀疑,虽然我已经改进了它,但数据是“原样”呈现的。

支持的任务和排行榜

  • 文本分类:该数据集可用于将不同时期或不同类型的各种文本(从演讲中转录)进行分类。
  • 语言建模:该数据集可用于历史文本的语言模型的训练或评估。

语言

en:GB

数据集结构

数据实例

{
  'id': 'uk.org.publicwhip/debate/1979-05-17a.390.0', 
  'speech': "Since the Minister for Consumer Affairs said earlier that the bread price rise would be allowed, in view of developing unemployment in the baking industry, and since the Mother's Pride bakery in my constituency is about to close, will the right hon. Gentleman give us a firm assurance that there will be an early debate on the future of the industry, so that the Government may announce that, thanks to the price rise, those workers will not now be put out of work?", 
  'display_as': 'Eric Heffer', 
  'party': 'Labour', 
  'constituency': 'Liverpool, Walton', 
  'mnis_id': '725', 
  'date': '1979-05-17', 
  'time': '', 
  'colnum': '390', 
  'speech_class': 'Speech', 
  'major_heading': 'BUSINESS OF THE HOUSE', 
  'minor_heading': '', 
  'oral_heading': '', 
  'year': '1979', 
  'hansard_membership_id': '5612', 
  'speakerid': 'uk.org.publicwhip/member/11615', 
  'person_id': '', 
  'speakername': 'Mr. Heffer', 
  'url': '', 
  'government_posts': [], 
  'opposition_posts': [], 
  'parliamentary_posts': ['Member, Labour Party National Executive Committee']
}

数据字段

Variable Description
id The ID as assigned by mysociety
speech The text of the speech
display_as The standardised name of the MP.
party The party an MP is member of at time of speech
constituency Constituency represented by MP at time of speech
mnis_id The MP's Members Name Information Service number
date Date of speech
time Time of speech
colnum Column number in hansard record
speech_class Type of speech
major_heading Major debate heading
minor_heading Minor debate heading
oral_heading Oral debate heading
year Year of speech
hansard_membership_id ID used by mysociety
speakerid ID used by mysociety
person_id ID used by mysociety
speakername MP name as appeared in Hansard record for speech
url link to speech
government_posts Government posts held by MP (list)
opposition_posts Opposition posts held by MP (list)
parliamentary_posts Parliamentary posts held by MP (list)

数据拆分

训练集:2694375

数据集创建

策划理由

该数据集包含了在英国下议院发表的所有演讲,可用于多项深度学习任务,如检测语言和社会观点在40多年间的变化。该数据集还提供了更贴近一个精英英国机构所使用的口语。

源数据

初始数据收集和规范化

该数据集是通过获取来自 data.parliament.uk 的数据创建的。没有进行规范化。

谁是源语言的生成者?

[N/A]

注释

注释过程

谁是注释者?

[N/A]

个人和敏感信息

这是公开信息,不应包含任何个人和敏感信息。

使用数据的注意事项

数据的社会影响

该数据集的目的是了解语言使用和社会观点随时间的变化。

偏见讨论

因为这个数据集跨越了漫长的时间段,可能包含在现代社会不可接受的语言和观点。

其他已知限制

[需要更多信息]

其他信息

数据集策划者

该数据集是在 parlparse 的基础上由 Evan Odell 创建的。

许可信息

知识共享署名4.0国际许可证

引用信息

@misc{odell, evan_2021, 
title={Hansard Speeches 1979-2021: Version 3.1.0}, 
DOI={10.5281/zenodo.4843485}, 
abstractNote={<p>Full details are available at <a href="https://evanodell.com/projects/datasets/hansard-data">https://evanodell.com/projects/datasets/hansard-data</a></p> <p><strong>Version 3.1.0 contains the following changes:</strong></p> <p>- Coverage up to the end of April 2021</p>}, 
note={This release is an update of previously released datasets. See full documentation for details.}, 
publisher={Zenodo}, 
author={Odell, Evan}, 
year={2021}, 
month={May} }

感谢 @shamikbose 添加了这个数据集。