CALM - Contrastive Alignment of Language and Music
CALM is a multimodal learning framework that aims to learn a joint representation of language and music. The framework is based on the CLIP model and it is trained on a large-scale dataset of music and tags. Find out more at https://github.com/ALM-LAB/CALM
Why this project?
The project has started as an hackathon project during the 1st Sound of AI Hackathon. The project has been nominated as the most marketable solution by the hackathon jury.
The aim of the project is to learn a joint representation of language and music to enable multimodal learning tasks (e.g., text-to-music retrieval and zero-shot music classification).
How it works?
The framework leverages the contrastive learning paradigm to align the language and music modalities. The framework is based on the CLIP model and it is trained on a large-scale dataset of music and tags.
The framework is composed of two main components:
- Music embedding model: using ViT - Vision Transformer architecture, the mel-spectrogram of the music is encoded into a fixed-length vector.
- Language embedding model: using BERT architecture, the captions (obtained from the tags) are encoded into a fixed-length vector.