数据集:

jamescalam/youtube-transcriptions

英文

YouTube转录数据集包含使用 OpenAI's Whisper (大量)转录的技术教程(当前来自 James Briggs Daniel Bourke AI Coffee Break )。每一行代表着与视频URL和时间戳相对应的大致一句长度的文本块。

请注意,数据集中的每个项目仅包含短文本块。对于大多数用例,您可能需要合并多行文本以创建更多实质性的文本块,如果需要这样做,这个代码片段将会有所帮助:

from datasets import load_dataset

# first download the dataset
data = load_dataset(
    'jamescalam/youtube-transcriptions',
    split='train'
)

new_data = []  # this will store adjusted data

window = 6  # number of sentences to combine
stride = 3  # number of sentences to 'stride' over, used to create overlap

for i in range(0, len(data), stride):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    # create larger text chunk
    text = ' '.join(data[i:i_end]['text'])
    # add to adjusted data list
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published']
    })