LOADING...
Summarize
Alibaba cutting NVIDIA GPU use for AI models by 80%
Alibaba has introduced Aegaeon computing pooling solution

Alibaba cutting NVIDIA GPU use for AI models by 80%

Oct 18, 2025
04:10 pm

What's the story

Alibaba Group Holding has unveiled a revolutionary computing pooling solution, Aegaeon, which reduces the need for NVIDIA graphics processing units (GPUs) in its artificial intelligence (AI) models by 82%. The innovative system was beta tested in Alibaba Cloud's model marketplace for over three months. During this period, it cut down the number of required NVIDIA GPUs from 1,192 to just 213.

Innovation details

Aegaeon can serve dozens of models simultaneously

The research paper on Aegaeon was presented at the 31st Symposium on Operating Systems Principles (SOSP) in Seoul, South Korea. The study highlights that Alibaba Cloud's system can serve dozens of models with up to 72 billion parameters each. This is a major leap forward in managing and optimizing resources for AI workloads, especially given the high demand for NVIDIA GPUs in this field.

Cost efficiency

System improves efficiency by pooling GPU power

The researchers from Peking University and Alibaba Cloud, including Alibaba's Chief Technology Officer Zhou Jingren, have highlighted Aegaeon's role in tackling the high costs of serving concurrent large language model (LLM) workloads. The system is a major step toward improving efficiency by pooling GPU power. It allows one GPU to serve multiple models at once, thus reducing resource allocation inefficiencies that often plague cloud service providers like Alibaba Cloud and ByteDance's Volcano Engine.

Model demand

Aegaeon addresses resource inefficiency issue

The researchers also noted that a few models, such as Alibaba's Qwen and DeepSeek, are more popular for inference than others. This leads to resource inefficiency, with 17.7% of GPUs allocated to serve only 1.35% of requests in Alibaba Cloud's marketplace. The Aegaeon system could be the answer to this problem by optimizing GPU usage across different models and their respective workloads.