MIR | MIRer

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised TrainingCCF none

08/2022 – 05/2023 Supervised by Dr Emmanouil Benetos, Centre for Digital Music, Queen Mary University of London Built self-supervised learning systems, acquiring 50k+ downloading of checkpoints on Huggingface. Replaced the pseudo-tag from MFCCs to Chroma music features for harmonic information. Utilising deep features like Encodec instead of k-means for scaling up models to 1 B parameters.

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen LLaMA language model, bridging the gap between music audio and textual contexts.

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Abstract: In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised TrainingCCF none

Abstract: Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is primarily due to the distinctive challenges associated with modelling musical knowledge, particularly its tonal and pitched characteristics of music. To address this research gap, we propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.

On the effectiveness of speech self-supervised learning for musicCCF none

Abstract: Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Nevertheless, research exploring the effectiveness of applying speech SSL models to music recordings has been limited. We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.

Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgptCCF none

Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the “ear” by transcribing the audio, while GPT-4 serves as the “brain,” acting as an annotator with a strong performance for contextualized output selection and correction.

Large-Scale Pretrained Model for Self-Supervised Music Audio Representation Learning

Abstract: Self-supervised learning technique is an under-explored topic for music audio due to the challenge of designing an appropriate training paradigm. We hence propose MAP-MERT, a large-scale music audio pre-trained model for general music understanding. We achieve performance that is comparable to the state-of-the-art pre-trained model Jukebox using less than 2% of parameters. Paper link. Presented by DMRN workshop 2022.

Learnable Front Ends Based on Temporal Modulation for Music TaggingCCF none

Abstract: While end-to-end systems are becoming popular in auditory signal processing including automatic music tagging, models using raw audio as input needs a large amount of data and computational resources without domain knowledge. Inspired by the fact that temporal modulation is regarded as an essential component in auditory perception, we introduce the Temporal Modulation Neural Network (TMNN) that combines Mel-like data-driven front ends and temporal modulation filters with a simple ResNet back end.

Map-music2vec: A simple and effective baseline for self-supervised music audio representation learning

Abstract: The deep learning community has witnessed an exponentially growing interest in self-supervised learning (SSL). However, it still remains unexplored how to build a framework for learning useful representations of raw music waveforms in a self-supervised manner. In this work, we design Music2Vec, a framework exploring different SSL algorithmic components and tricks for music audio recordings. Our model achieves comparable results to the state-of-the-art (SOTA) music SSL model Jukebox, despite being significantly smaller with less than 2% of parameters of the latter.

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised TrainingCCF none

09/2021 – 07/2022 Research Assistant, Supervised by Prof. Richard Stern, Carnegie Mellon University Constructed 2-layer learnable front ends in Temporal Modulation Neural Network (TMNN) that combines Mel-like data-driven front ends and temporal modulation filters. Examined the proposed front ends surpass state-of-the-art (SOTA) methods on the MagnaTagATune dataset in automatic music tagging, and they are also helpful for keyword spotting on speech commands. Analysis of the model performance among tags with different genres and instrument tags.