
Alibaba's new open-source AI processes multimodal inputs in real time
What's the story
Alibaba has launched a new artificial intelligence (AI) model, Qwen3-Omni. The Chinese tech giant is positioning it as the first "natively end-to-end omni-modal AI," capable of processing text, images, audio, and video inputs in one go. This move comes as a direct challenge to US tech giants like OpenAI and Google. The Qwen3-Omni model can take inputs in multiple formats but only provides outputs in text and audio form.
Model features
Qwen3-Omni can process multiple input modalities simultaneously
Unlike other models that add speech or vision to text-first systems, Qwen3-Omni integrates all modalities from the beginning. This lets it take inputs and give outputs while staying responsive in real time. The model is available for download, modification, and deployment under an enterprise-friendly Apache 2.0 license even for commercial use. This makes it more accessible than proprietary models like OpenAI's GPT-4o and Google's Gemini 2.5 Pro which require payment for usage.
Technical specs
The model supports 119 languages in text
Qwen3-Omni uses a 'Thinker-Talker' architecture, with the 'Thinker' handling reasoning and multimodal understanding while the 'Talker' produces natural speech in audio. Both rely on Mixture-of-Experts (MoE) designs for high concurrency and fast inference. The model supports 119 languages in text, 19 for speech input, and 10 for speech output. It also has a free quota of one million tokens across all modalities valid for three months after activation.
Benchmark results
Qwen3-Omni has outperformed peers in several benchmarks
Qwen3-Omni has outperformed its peers in 22 out of 36 benchmarks. It excels in text and reasoning tasks, speech and audio processing, image and vision recognition as well as video understanding. Alibaba Cloud has highlighted a wide range of potential use cases for Qwen3-Omni including multilingual transcription/translation, audio captioning, OCR (optical character recognition), music tagging and video understanding. The model can be fine-tuned by developers from conversation style to persona using system prompts.