Google's latest AI model creates text like an image generator

By Mudit Dube

Jun 11, 2026

10:57 am

What's the story

Google DeepMind has launched a new member of its Gemma 4 open model family, called DiffusionGemma. The new model is different from the rest in the lineup as it doesn't generate outputs linearly like most AI models. Instead, it can produce an entire block of text in parallel, making it faster and more efficient on local hardware such as NVIDIA DGX or a gaming GPU.

Innovative technique

How DiffusionGemma works

Unlike most AI models that are autoregressive and generate text one token at a time, DiffusionGemma uses an image generation-like approach.

It starts with static and denoises it to create the desired content.

The model runs over a field of placeholder tokens multiple times to generate likely tokens, which are then used to improve estimation of others.

Finally, it finalizes its token outputs in one large block—the "denoised" text canvas.

Speed boost

Only 3.8 billion are activated during inference

With a total of 26 billion parameters, DiffusionGemma is fairly large among Google's open models. However, only 3.8 billion are activated during inference, making it fit in the 18GB RAM allotment of a high-end GPU.

In tests with an RTX 5090, DiffusionGemma produces around 700 tokens per second. With a single NVIDIA H100 AI accelerator, it can produce over 1,000 tokens per second—four times that of similarly sized autoregressive Gemma models.

Task performance

The model's approach offers advantages in various tasks

The text generation approach of DiffusionGemma shifts the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel.

Google says this gives a noticeable advantage in non-linear tasks such as in-line editing, molecular sequencing, and mathematical graphing.

The model was also tuned to solve Sudoku puzzles, a notoriously challenging task for standard autoregressive AI models due to token interdependence.

Model restrictions

Google isn't using DiffusionGemma in Gemini models yet

Despite its speed, Google isn't using DiffusionGemma in big cloud-based Gemini models due to a few drawbacks.

These include a higher error rate and resource wastage when the desired output is only a few tokens long.

However, the efficiency gain for local processing makes this an appealing avenue of experimentation.

Google has also started implementing Multi-Token Prediction (MTP) drafters to use otherwise wasted compute cycles for predicting possible tokens and increasing speed.