datasets | MIRer

🎉Accepted by NeuriPS2025🎉 MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Abstract: We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech.

🎉Accepted by NeuriPS2025🎉 Omnibench: Towards the future of universal omni-language models

Abstract: Recent advancements in multimodal large language models (MLLMs) have focused on integrating multiple modalities, yet their ability to simultaneously process and reason across different inputs remains underexplored. We introduce OmniBench, a novel benchmark designed to evaluate models’ ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define language models capable of such tri-modal processing as omni-language models (OLMs). OmniBench features high-quality human annotations that require integrated understanding across all modalities.

CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

Abstract: Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks.

Audio-flan: A preliminary release

Abstract: Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation.

Supergpqa: Scaling llm evaluation across 285 graduate disciplines

Abstract: Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.

Foundation models for music: A survey

Abstract: In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development.

Map-neo: Highly capable and transparent bilingual large language model series

Abstract: Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.

Bridging Music & Text with Pre-trained Models for Music Captioning and QA

07/2023 – present Supervised by Dr Emmanouil Benetos, Centre for Digital Music, Queen Mary University of London Developed Music Instruct (MI) query-response dataset based on captions & well-designed prompts to GPT-4. Achieved cutting-edge performance in question answering on both MusicQA and Music Instruct datasets. Employed instruct fine-tuning techniques on MI to attain state-of-the-art (SOTA) results in captioning.

MARBLE: Music Audio Representation Benchmark for Universal Evaluation

01/2023 – 06/2023 Supervised by Dr Emmanouil Benetos, Centre for Digital Music, Queen Mary University of London Designing the downstream tasks, datasets, evaluation metrics and state-of-the-art results. Implementing the mir_eval metrics with torchmetrics and developing utilisation for sequential tasks. Establishing a fair, reproducible and universal music information retrieval benchmark for future work. MARBLE website.

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains relatively unexplored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with the frozen LLaMA language model, bridging the gap between music audio and textual contexts.