数据集:

Antreas/TALI

中文

Dataset Card for "TALI-large"

Dataset Description

Abstract

TALI is a large-scale, tetramodal dataset designed to facilitate a shift from unimodal and duomodal to tetramodal research in deep learning. It aligns text, video, images, and audio, providing a rich resource for innovative self-supervised learning tasks and multimodal research. TALI enables exploration of how different modalities and data/model scaling affect downstream performance, with the aim of inspiring diverse research ideas and enhancing understanding of model capabilities and robustness in deep learning.

Brief Description

TALI (Temporally and semantically Aligned Audio, Language and Images) is a dataset that uses the Wikipedia Image Text (WIT) captions and article titles to search Youtube for videos that match the captions. It then downloads the video, audio, and subtitles from these videos. The result is a rich multimodal dataset that has multiple caption types related to both the WiT Images, and the Youtube videos. This enables learning to take place between either temporally or semantically aligned text, images, audio and video.

Dataset Information

Modalities

The TALI dataset consists of the following modalities:

  • Image:
  • Wikipedia caption image
  • Randomly sampled image from youtube video
  • Text
  • Wikipedia Caption Text
  • Wikipedia Title Text
  • Wikipedia Main Body Text
  • YouTube Subtitle Text
  • YouTube Description Text
  • YouTube Title Text
  • Audio
  • YouTube Content Audio
  • Video
  • YouTube Content Video
  • Dataset Variants

    The TALI dataset comes in three variants that differ in the training set size:

    • TALI-small: Contains about 1.3 million 30-second video clips, aligned with 120K WiT entries.
    • TALI-base: Contains about 6.5 million 30-second video clips, aligned with 120K WiT entries.
    • TALI-big: Contains about 13 million 30-second video clips, aligned with 120K WiT entries.

    The validation and test sets remain consistent across all three variants at about 80K Videos aligned to 8K wikipedia entries (10 subclips for each Wikipedia entry) each.

    Dataset Statistics

    TBA

    Dataset Creation

    The TALI dataset was created by starting from the WiT dataset and using either the context_page_description or page_title as a source-query to search YouTube for video that were creative commons opted-in, and, not age restricted. The top 100 result titles were returned and compared with the source-query using the CLIP text embeddings of the largest CLIP model available. The top-1 title’s video based on the CLIP ranking was chosen and downloaded. The video was broken into 30-second segments and the top-10 segments for eachvideo were chosen based on the distance between the CLIP image embedding of the first image of each segment and the video’s title text. The image, audio, and subtitle frames were extracted from these segments. At sampling time, one of these 10 segments is randomly selected, and a 10-second segment is chosen out of the 30-second clip. The result is 200 video frames (spread throughout the 10-second segment), and 160000 audio frames (10 seconds).

    Dataset Use

    TALI is designed for use in a wide range of multimodal research tasks, including but not limited to:

    • Multimodal understanding and reasoning
    • Self-supervised learning
    • Multimodal alignment and translation
    • Multimodal summarization
    • Multimodal question answering

    Dataset Curators: Antreas Antoniou

    Citation Information: TBA Contributions: Thanks to all contributors including data curators, annotators, and software developers.