CALM is a multimodal learning framework that aims to learn a joint representation of language and music. The framework is based on the CLIP model and it is trained on a large-scale dataset of music and tags. Find out more at https://github.com/ALM-LAB/CALM
The project has started as an hackathon project during the 1st Sound of AI Hackathon. The project has been nominated as the most marketable solution by the hackathon jury.
The aim of the project is to learn a joint representation of language and music to enable multimodal learning tasks (e.g., text-to-music retrieval and zero-shot music classification).
The framework leverages the contrastive learning paradigm to align the language and music modalities. The framework is based on the CLIP model and it is trained on a large-scale dataset of music and tags.
The framework is composed of two main components: