数据集:

jamescalam/youtube-transcriptions

中文

The YouTube transcriptions dataset contains technical tutorials (currently from James Briggs , Daniel Bourke , and AI Coffee Break ) transcribed using OpenAI's Whisper (large). Each row represents roughly a sentence-length chunk of text alongside the video URL and timestamp.

Note that each item in the dataset contains just a short chunk of text. For most use cases you will likely need to merge multiple rows to create more substantial chunks of text, if you need to do that, this code snippet will help:

from datasets import load_dataset

# first download the dataset
data = load_dataset(
    'jamescalam/youtube-transcriptions',
    split='train'
)

new_data = []  # this will store adjusted data

window = 6  # number of sentences to combine
stride = 3  # number of sentences to 'stride' over, used to create overlap

for i in range(0, len(data), stride):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    # create larger text chunk
    text = ' '.join(data[i:i_end]['text'])
    # add to adjusted data list
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published']
    })