数据集:
jamescalam/youtube-transcriptions
语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
afl-3.0YouTube转录数据集包含使用 OpenAI's Whisper (大量)转录的技术教程(当前来自 James Briggs , Daniel Bourke 和 AI Coffee Break )。每一行代表着与视频URL和时间戳相对应的大致一句长度的文本块。
请注意,数据集中的每个项目仅包含短文本块。对于大多数用例,您可能需要合并多行文本以创建更多实质性的文本块,如果需要这样做,这个代码片段将会有所帮助:
from datasets import load_dataset # first download the dataset data = load_dataset( 'jamescalam/youtube-transcriptions', split='train' ) new_data = [] # this will store adjusted data window = 6 # number of sentences to combine stride = 3 # number of sentences to 'stride' over, used to create overlap for i in range(0, len(data), stride): i_end = min(len(data)-1, i+window) if data[i]['title'] != data[i_end]['title']: # in this case we skip this entry as we have start/end of two videos continue # create larger text chunk text = ' '.join(data[i:i_end]['text']) # add to adjusted data list new_data.append({ 'start': data[i]['start'], 'end': data[i_end]['end'], 'title': data[i]['title'], 'text': text, 'id': data[i]['id'], 'url': data[i]['url'], 'published': data[i]['published'] })