
The AudioSet dataset

The AudioSet dataset is a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. There are 2,084,320 YouTube videos containing 527 labels.

Dataset Statistics

AudioSet Sample

Reference paper


Dataset Usage

Info about Dataset

AudioSet dataset for download in two formats:

Text (csv) files describing, for each segment, the YouTube video ID, start time, end time, and one or more labels.

128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files.

The labels are taken from the AudioSet ontology which can be downloaded from our AudioSet GitHub repository (

The dataset is made available by Google Inc. under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, while the ontology is available under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Dataset split

The dataset is divided in three disjoint sets: a balanced evaluation set, a balanced training set, and an unbalanced training set. In the balanced evaluation and training sets, each class has the same number of examples. The unbalanced training set contains the remainder of annotated segments. contains 20,383 segments from distinct videos, providing at least 59 examples for each of the 527 sound classes that are used. Because of label co-occurrence, many classes have more examples. contains 22,176 segments from distinct videos chosen with the same criteria: providing at least 59 examples per class with the fewest number of total segments. contains 2,042,985 segments from distinct videos, representing the remainder of the dataset.

Each csv file has a three-line header with each line starting with “#”, and with the first two lines indicating the creation time and general statistics. Each subsequent line has columns defined by the third header line

The total size of the features is 2.4 gigabytes. They are stored in 12,228 TensorFlow record files, sharded by the first two characters of the YouTube video ID, and packaged as a tar.gz file.

The labels are stored as integer indices. They are mapped to sound classes via class_labels_indices.csv. The first line defines the column names: index,mid,display_name. Subsequent lines describe the mapping for each class. For example:0,/m/09x0r,“Speech”,which means that “labels” with value 0 indicate segments labeled with “Speech”.

Download Features

To download the features, you have the following options:

Manually download the tar.gz file from one of (depending on region):

Use gsutil rsync, with the command: gsutil rsync -d -r features gs://{region}_audioset/youtube_corpus/v1/features

Where {region} is one of “eu”, “us” or “asia”. For example: gsutil rsync -d -r features gs://us_audioset/youtube_corpus/v1/features

You can use the YouTube-8M ( starter code to train models on the released features from both AudioSet as well as YouTube-8M( The code can be found in the YouTube-8M GitHub repository.


Getting Datasets

Get the datasets as described above

Make sure you have the bleeding edge version of Theano, or run

pip install --upgrade --no-deps git+git://

If you would like to work with your existing working environment, it should satisfy the following requirements:

Python 3 and dependencies On Mac, can be installed with brew install python3 On Ubuntu/Debian, can be installed with apt-get install python3 Dependencies can be installed with pip install -r youtube-dl==2017.9.15 pafy== multiprocessing-logging==0.2.4 sox==1.3.0 sk-video==1.1.8 PySoundFile==0.9.0.post1 ffmpeg On Mac, can be installed with brew install ffmpeg On Ubuntu/Debian, can be installed with apt-get install ffmpeg sox On Mac, can be installed with brew install sox On Ubuntu/Debian, can be installed with apt-get install sox

clone audiosetdl Modules and scripts for downloading Google’s AudioSet dataset, a dataset of ~2.1 million annotated segments from YouTube videos

As a single script

This can be run as a batch of SLURM jobs



The initial AudioSet release included 128-dimensional embeddings of each AudioSet segment produced from a VGG-like audio classification model that was trained on a large YouTube dataset (a preliminary version of what later became YouTube-8M).

Google provides a TensorFlow definition of this model, which they call VGGish, as well as supporting code to extract input features for the model from audio waveforms and to post-process the model embedding output into the same format as the released embedding features.

Installation VGGish depends on the following Python packages:

numpy scipy resampy tensorflow six

These are all easily installable via, e.g., pip install numpy (as in the example command sequence below).

Any reasonably recent version of these packages should work. TensorFlow should be at least version 1.0. We have tested with Python 2.7.6 and 3.4.3 on an Ubuntu-like system with NumPy v1.13.1, SciPy v0.19.1, resampy v0.1.5, TensorFlow v1.2.1, and Six v1.10.0.

VGGish also requires downloading two data files:

VGGish model checkpoint, in TensorFlow checkpoint format. Embedding PCA parameters, in NumPy compressed archive format. After downloading these files into the same directory as this README, the installation can be tested by running python which runs a known signal through the model and checks the output.


VGG16 pretrained model.

How to run the project:

Cases where videos cannot be downloaded