|
--- |
|
license: mit |
|
inference: false |
|
--- |
|
|
|
# Introduction |
|
|
|
**Music2Vec** is accepted as 2-page abstract in Late Breaking Demos (LBD) at the ISMIR 2022. |
|
It is a completely unsupervised model trained on 1000 hour music audios. |
|
Our model is SOTA-comparable on multiple MIR tasks even under probing settings, while keeping fine-tunable on a single 2080Ti. |
|
Larger models trained with more data are on the way~ |
|
|
|
# Model Usage |
|
|
|
## Huggingface Loading |
|
|
|
```python |
|
from transformers import Wav2Vec2Processor, Data2VecAudioModel |
|
import torch |
|
from torch import nn |
|
from datasets import load_dataset |
|
|
|
# load demo audio and set processor |
|
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") |
|
dataset = dataset.sort("id") |
|
sampling_rate = dataset.features["audio"].sampling_rate |
|
processor = Wav2Vec2Processor.from_pretrained("facebook/data2vec-audio-base-960h") |
|
|
|
# loading our model weights |
|
model = Data2VecAudioModel.from_pretrained("m-a-p/music2vec-v1") |
|
|
|
|
|
# audio file is decoded on the fly |
|
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") |
|
with torch.no_grad(): |
|
outputs = model(**inputs, output_hidden_states=True) |
|
|
|
# take a look at the output shape, there are 13 layers of representation |
|
# each layer performs differently in different downstream tasks, you should choose empirically |
|
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze() |
|
print(all_layer_hidden_states.shape) # [13 layer, 292 timestep, 768 feature_dim] |
|
|
|
# for utterance level classification tasks, you can simply reduce the representation in time |
|
time_reduced_hidden_states = all_layer_hidden_states.mean(-2) |
|
print(time_reduced_hidden_states.shape) # [13, 768] |
|
|
|
# you can even use a learnable weighted average representation |
|
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1) |
|
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states).squeeze() |
|
print(weighted_avg_hidden_states.shape) # [768] |
|
``` |
|
|
|
Our model is based on the [data2vec audio model](https://huggingface.co/docs/transformers/model_doc/data2vec#transformers.Data2VecAudioModel). |
|
|
|
# Citation |
|
|
|
The paper can be found at [ISMIR](https://ismir2022program.ismir.net/lbd_410.html). |
|
|
|
```shell |
|
@article{li2022map, |
|
title={MAP-Music2Vec: A Simple and Effective Baseline for Self-Supervised Music Audio Representation Learning}, |
|
author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others}, |
|
journal={arXiv preprint arXiv:2212.02508}, |
|
year={2022} |
|
} |
|
|
|
``` |
|
|
|
|