Multimodal

Bridging Music & Text with Pre-trained Models for Music Captioning and QA

07/2023 – present Supervised by Dr Emmanouil Benetos, Centre for Digital Music, Queen Mary University of London Developed Music Instruct (MI) query-response dataset based on captions & well-designed prompts to GPT-4. Achieved cutting-edge performance in question answering on both MusicQA and Music Instruct datasets. Employed instruct fine-tuning techniques on MI to attain state-of-the-art (SOTA) results in captioning.

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen LLaMA language model, bridging the gap between music audio and textual contexts.