CALM - Contrastive Alignment of Language and Music

July, 2022

TL;DR

CALM is a multimodal learning framework that aims to learn a joint representation of language and music. The framework is based on the CLIP model and it is trained on a large-scale dataset of music and tags. Find out more at https://github.com/ALM-LAB/CALM

Why this project?

The project has started as an hackathon project during the 1st Sound of AI Hackathon. The project has been nominated as the most marketable solution by the hackathon jury.
The aim of the project is to learn a joint representation of language and music to enable multimodal learning tasks (e.g., text-to-music retrieval and zero-shot music classification).

How it works?

The framework leverages the contrastive learning paradigm to align the language and music modalities. The framework is based on the CLIP model and it is trained on a large-scale dataset of music and tags.
The framework is composed of two main components:

Music embedding model: using ViT - Vision Transformer architecture, the mel-spectrogram of the music is encoded into a fixed-length vector.
Language embedding model: using BERT architecture, the captions (obtained from the tags) are encoded into a fixed-length vector.

The two models are trained in a contrastive learning fashion, where the goal is to maximize the cosine similarity between the embeddings of the same music and the goal is to minimize the cosine similarity between the embeddings of different music recordings (i.e., the embeddings of the music and the captions are aligned).