Tongyi Qianwen Unveils Qwen2.5-Omni, Leading Multimodal AI Model
Tongyi Qianwen has introduced Qwen2.5-Omni, the latest model in the Qwen family, designed as an end-to-end multimodal flagship model. This advanced model is capable of handling various input formats, including text, images, audio, and video, and can generate real-time streaming responses, as well as text and voice synthesis outputs. This positions Qwen2.5-Omni at the forefront of multimodal AI technology, offering a comprehensive solution for applications that require sophisticated interaction and response generation across multiple modalities.
The model features a novel Thinker-Talker architecture, which is an end-to-end multimodal model designed to support text, image, audio, and video cross-modal understanding. It also introduces a new position encoding technology called TMRoPE (Time-aligned Multimodal RoPE), which achieves precise synchronization of video and audio inputs through time-axis alignment. This architecture supports fully real-time interaction, allowing for block input and immediate output, and generates natural and stable voice outputs that surpass many existing streaming and non-streaming alternatives.
Qwen2.5-Omni demonstrates superior performance compared to similarly sized single-modal models. In benchmark tests, it outperforms Qwen2-Audio in audio capabilities and matches the performance of Qwen2.5-VL-7B. The model also excels in end-to-end voice command following, achieving results comparable to text input processing in benchmarks such as MMLU for general knowledge understanding and GSM8K for mathematical reasoning.
The model's architecture consists of a dual-core Thinker-Talker design. The Thinker module acts as the brain, processing text, audio, and video inputs to generate high-level semantic representations and corresponding text content. The Talker module functions like a vocal organ, receiving real-time semantic representations and text from the Thinker and smoothly synthesizing discrete speech units. The Thinker is based on a Transformer decoder architecture, integrating audio and image encoders for feature extraction, while the Talker uses a dual-track autoregressive Transformer decoder design. This design allows the Talker to directly receive high-dimensional representations from the Thinker during training and inference, sharing all historical context information to form an end-to-end unified model architecture.
Qwen2.5-Omni's performance is superior in various modalities, including image, audio, and audiovisual tasks, compared to similarly sized single-modal models and closed-source models such as Qwen2.5-VL-7B, Qwen2-Audio, and Gemini-1.5-pro. In the multimodal task OmniBench, Qwen2.5-Omni achieves state-of-the-art performance. Additionally, in single-modal tasks, it excels in areas such as speech recognition (Common Voice), translation (CoVoST2), audio understanding (MMAU), image reasoning (MMMU, MMStar), video understanding (MVBench), and speech generation (Seed-tts-eval and subjective natural listening).
Tongyi Qianwen is eager to receive feedback and see the innovative applications developed using Qwen2.5-Omni. In the near future, the company plans to enhance the model's ability to follow voice commands and improve its audiovisual collaborative understanding capabilities. The ultimate goal is to continuously expand the boundaries of multimodal capabilities to develop a comprehensive general-purpose model.
