数据集:

DISCOX/DISCO-10M

许可:

cc-by-4.0

数字对象标识符:

10.57967/hf/0754

其他:

music

预印本库:

arxiv:2306.13512

大小:

10M<n<100M

语言:

en
中文

Getting Started

You can download the dataset using HuggingFace:

from datasets import load_dataset
ds = load_dataset("DISCOX/DISCO-10M")

Dataset Structure

The dataset contains the following features:

{
 'video_url_youtube',
 'video_title_youtube',
 'track_name_spotify',
 'video_duration_youtube_sec',
 'preview_url_spotify',
 'video_view_count_youtube',
 'video_thumbnail_url_youtube',
 'search_query_youtube',
 'video_description_youtube',
 'track_id_spotify',
 'album_id_spotify',
 'artist_id_spotify',
 'track_duration_spotify_ms',
 'primary_artist_name_spotify',
 'track_release_date_spotify',
 'explicit_content_spotify',
 'similarity_duration',
 'similarity_query_video_title',
 'similarity_query_description',
 'similarity_audio',
 'audio_embedding_spotify',
 'audio_embedding_youtube',
}

What is DISCO-10M?

DISCO-10M is a music dataset created to democratize research on large-scale machine learning models for music.

The dataset contains no music due to copyright laws. The audio embedding features were computed using Laion-CLAP , and can be used instead of the raw audio for many down-stream tasks. In case the raw audio is needed, it can be downloaded from the provided Spotify preview URL or via the YouTube link. DISCO-10M was created by collecting a list of 400,000 artist IDs and 2.6M track IDs from Spotify, and collecting YouTube video links that match the track duration, artist name, and track names. These matches were computed using the following three similarity metrics:

  • Duration similarity: 1 - abs(track_duration_spotify - video_duration_youtube) / max(track_duration_spotify, video_duration_youtube)
  • Text similarity is calculated using the cosine similarity between the embedding of the search query and the embedding of the video title, as well as the search query embedding and the video description embedding. Embeddings are computed using Sentence Bert .
  • Audio similarity is calculated using the cosine similarity between the Spotify preview snippet audio embedding and the YouTube audio embedding.

For DISCO-10M we only keep samples that return true for: duration_similarity > 0.25 and (description_similarity > 0.65 or title_similarity > 0.65) and audio_similarity > 0.4

We offer three subsets based on DISCO-10M:

  • DISCO-10K-random : a small subset of random samples from the entire dataset.
  • DISCO-200K-random : a subset of random samples, useful for a light-weight and representative analysis of the entire dataset.
  • DISCO-200K-high-quality : a subset of samples which were filtered more strictly to ensure a higher quality match between Spotify tracks and YouTube videos.

To cite our work, please refer to our paper here .