07/2023 – present
Supervised by Dr Emmanouil Benetos, Centre for Digital Music, Queen Mary University of London
Developed Music Instruct (MI) query-response dataset based on captions & well-designed prompts to GPT-4. Achieved cutting-edge performance in question answering on both MusicQA and Music Instruct datasets. Employed instruct fine-tuning techniques on MI to attain state-of-the-art (SOTA) results in captioning.
08/2022 – 05/2023
Supervised by Dr Emmanouil Benetos, Centre for Digital Music, Queen Mary University of London
Built self-supervised learning systems, acquiring 50k+ downloading of checkpoints on Huggingface. Replaced the pseudo-tag from MFCCs to Chroma music features for harmonic information. Utilising deep features like Encodec instead of k-means for scaling up models to 1 B parameters.
Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen LLaMA language model, bridging the gap between music audio and textual contexts.
Abstract: Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is primarily due to the distinctive challenges associated with modelling musical knowledge, particularly its tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Abstract: Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Nevertheless, research exploring the effectiveness of applying speech SSL models to music recordings has been limited. We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.
Abstract: Self-supervised learning technique is an under-explored topic for music audio due to the challenge of designing an appropriate training paradigm. We hence propose MAP-MERT, a large-scale music audio pre-trained model for general music understanding. We achieve performance that is comparable to the state-of-the-art pre-trained model Jukebox using less than 2% of parameters.
Paper link.
Presented by DMRN workshop 2022.
Abstract: The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter.
09/2021 – 07/2022
Research Assistant, Supervised by Prof. Richard Stern, Carnegie Mellon University
Constructed 2-layer learnable front ends in Temporal Modulation Neural Network (TMNN) that combines Mel-like data-driven front ends and temporal modulation filters. Examined the proposed front ends surpass state-of-the-art (SOTA) methods on the MagnaTagATune dataset in automatic music tagging, and they are also helpful for keyword spotting on speech commands. Analysis of the model performance among tags with different genres and instrument tags.