datasets

Bridging Music & Text with Pre-trained Models for Music Captioning and QA

07/2023 – present Supervised by Dr Emmanouil Benetos, Centre for Digital Music, Queen Mary University of London Developed Music Instruct (MI) query-response dataset based on captions & well-designed prompts to GPT-4. Achieved cutting-edge performance in question answering on both MusicQA and Music Instruct datasets. Employed instruct fine-tuning techniques on MI to attain state-of-the-art (SOTA) results in captioning.

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

01/2023 – 06/2023 Supervised by Dr Emmanouil Benetos, Centre for Digital Music, Queen Mary University of London Designing the downstream tasks, datasets, evaluation metrics and state-of-the-art results. Implementing the mir_eval metrics with torchmetrics and developing utilisation for sequential tasks. Establishing a fair, reproducible and universal music information retrieval benchmark for future work. MARBLE website.

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen LLaMA language model, bridging the gap between music audio and textual contexts.

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Abstract: In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.

Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgptCCF none

Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the “ear” by transcribing the audio, while GPT-4 serves as the “brain,” acting as an annotator with a strong performance for contextualized output selection and correction.