Summarize

Google's new AI model runs on phones—and without internet connection

By Mudit Dube

Jun 27, 2025

08:56 am

What's the story

Google has unveiled its latest artificial intelligence (AI) model, Gemma 3n. The new addition to the Gemma family of open AI models was previewed last month during Google I/O. Unlike Gemini, which is a closed proprietary system, Gemma is designed for developers to download and modify as per their needs. The latest version can now handle inputs such as images, audio, and video natively to generate text outputs.

Model features

Runs on devices with as little as 2GB memory

Gemma 3n can run on devices with as little as 2GB of memory, making it highly accessible. It is said to be better at tasks like coding and reasoning than its predecessors. The model comes in two sizes based on effective parameters: E2B and E4B. While their raw parameter counts are 5B and 8B respectively, architectural innovations allow them to run with a memory footprint comparable to traditional models requiring more resources.

Tech advancements

Supports multilinguality with text understanding in 140 languages

At its core, Gemma 3n features novel components like the MatFormer architecture for compute flexibility, Per Layer Embeddings (PLE) for memory efficiency, and new audio and MobileNet-v5 based vision encoders optimized for on-device use cases. The model also supports multilinguality with text understanding in 140 languages and multimodal understanding of 35 languages. It shows improvements across math, coding, reasoning tasks as well.

Efficiency boost

Efficiency of Gemma 3n comes from a new architecture

The efficiency of Gemma 3n comes from a new architecture called MatFormer. This allows a single model to run at different sizes for different tasks. The larger E4B model is the first one with under 10B parameters to break a LMArena score of 1,300, showcasing its advanced capabilities.

Advanced features

Audio and vision capabilities of the model

Gemma 3n's audio capabilities include on-device speech-to-text and translation, using an encoder that can process speech in fine detail. The vision side of things is powered by a new encoder called MobileNet-V5, which is much faster and more efficient than its predecessor. It can process video at up to 60FPS on a Google Pixel device.