HomeArtificial IntelligenceOmni-R1: Advancing Audio Query Answering with Textual content-Pushed Reinforcement Studying and Auto-Generated...

Omni-R1: Advancing Audio Query Answering with Textual content-Pushed Reinforcement Studying and Auto-Generated Knowledge


Current developments have proven that RL can considerably improve the reasoning skills of LLMs. Constructing on this progress, the research goals to enhance Audio LLMs—fashions that course of audio and textual content to carry out duties like query answering. The MMAU benchmark is a extensively used dataset designed to guage these fashions, that includes multiple-choice questions on sounds, speech, and music, a few of which require exterior information. A previous method, R1-AQA, used GRPO (Group Relative Coverage Optimization) to fine-tune the Qwen2-Audio mannequin on the AVQA dataset, reaching state-of-the-art (SOTA) outcomes on MMAU. Impressed by this, the authors utilized GRPO to fine-tune Qwen2.5-Omni-7B, a more recent multimodal mannequin, additional enhancing efficiency. Moreover, they launched a technique to routinely generate audio QA information, resulting in even higher outcomes.

In comparison with strategies like SARI, which makes use of a extra complicated mixture of supervised fine-tuning and RL with structured reasoning, the authors’ method is less complicated, relying solely on RL with out express reasoning steps. In addition they carried out experiments with text-only inputs to research the position of GRPO in efficiency positive factors. Surprisingly, fine-tuning the fashions utilizing simply textual content information yielded practically the identical enhancements as coaching with audio and textual content. This discovering means that GRPO primarily enhances the mannequin’s reasoning skill by way of textual content, considerably contributing to its improved efficiency in audio QA duties. 

Researchers from MIT CSAIL, Goethe College, IBM Analysis, and others introduce Omni-R1, a fine-tuned model of the multi-modal LLM Qwen2.5-Omni utilizing the GRPO reinforcement studying technique. Educated on the AVQA dataset, Omni-R1 units new state-of-the-art outcomes on the MMAU benchmark throughout all audio classes. Surprisingly, a lot of the advance stems from enhanced text-based reasoning fairly than audio enter. Advantageous-tuning with text-only information additionally led to notable efficiency positive factors. Moreover, the staff generated large-scale audio QA datasets utilizing ChatGPT, additional boosting accuracy. Their work highlights the numerous affect of textual content reasoning in audio LLM efficiency and guarantees the general public launch of all assets. 

The Omni-R1 mannequin fine-tunes Qwen2.5-Omni utilizing the GRPO reinforcement studying technique with a easy immediate format that permits direct reply choice, making it memory-efficient for 48GB GPUs. GRPO avoids a worth operate by evaluating grouped outputs utilizing a reward based mostly solely on reply correctness. Researchers used audio captions from Qwen-2 Audio to increase coaching information and prompted ChatGPT to generate new question-answer pairs. This technique produced two datasets—AVQA-GPT and VGGS-GPT—protecting 40k and 182k audios, respectively. Coaching on these routinely generated datasets improved efficiency, with VGGS-GPT serving to Omni-R1 obtain state-of-the-art accuracy on the MMAU benchmark. 

The researchers fine-tuned Qwen2.5-Omni utilizing GRPO on AVQA, AVQA-GPT, and VGGS-GPT datasets. Outcomes present notable efficiency positive factors, with one of the best common rating of 71.3% on the MAU Take a look at-mini from VGGS-GPT. Qwen2.5-Omni outperformed baselines, together with SARI, and confirmed sturdy reasoning even with out audio, suggesting sturdy text-based understanding. GRPO fine-tuning improved Qwen2-Audio extra considerably attributable to its weaker preliminary textual content reasoning. Surprisingly, fine-tuning with out audio boosted efficiency, whereas text-only datasets like ARC-Simple yielded comparable outcomes. Enhancements primarily stem from enhanced textual content reasoning, although audio-based fine-tuning stays barely superior for optimum efficiency.

In conclusion, Omni-R1 is an Audio LLM developed by fine-tuning Qwen2.5-Omni utilizing the GRPO reinforcement studying technique for enhanced audio query answering. Omni-R1 achieves new state-of-the-art outcomes on the MMAU benchmark throughout sounds, speech, music, and general efficiency. Two new large-scale datasets, AVQA-GPT and VGGS-GPT, have been created utilizing routinely generated questions, additional boosting mannequin accuracy. Experiments present that GRPO primarily enhances text-based reasoning, considerably contributing to efficiency. Surprisingly, fine-tuning with solely textual content (with out audio) improved audio-based efficiency, highlighting the worth of sturdy base language understanding. These findings provide cost-effective methods for creating audio-capable language fashions. 


Try the Paper. All credit score for this analysis goes to the researchers of this venture. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 95k+ ML SubReddit.


Sana Hassan, a consulting intern at Marktechpost and dual-degree scholar at IIT Madras, is captivated with making use of know-how and AI to handle real-world challenges. With a eager curiosity in fixing sensible issues, he brings a contemporary perspective to the intersection of AI and real-life options.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments