LOADING...
Microsoft takes on OpenAI, Google with 3 new AI models
The models are available on Microsoft Foundry

Microsoft takes on OpenAI, Google with 3 new AI models

Apr 03, 2026
10:26 am

What's the story

Microsoft has unveiled three new in-house AI models, including a speech transcription system, a voice generation engine, and an image creator. The models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—are now available on Microsoft Foundry and the newly launched MAI Playground. This move marks Microsoft's direct competition with OpenAI, Google, and other leading labs in model development as well as distribution.

Model capabilities

MAI models address 3 key areas of enterprise AI

The newly launched AI models cover three major areas of enterprise AI: speech-to-text conversion, realistic human voice generation, and image creation. Mustafa Suleyman, who heads Microsoft's superintelligence team, expressed excitement over their launch. He claimed that MAI-Transcribe-1 is the best transcription model in the world today and can be delivered with half the GPUs as compared to state-of-the-art competitors.

Model performance

MAI-Transcribe-1 is the best speech-to-text model, claims Suleyman

MAI-Transcribe-1, Microsoft's speech-to-text model, has the lowest average Word Error Rate on the FLEURS benchmark across 25 languages. It beats OpenAI's Whisper-large-v3 in all languages and Google's Gemini 3.1 Flash Lite on 22 of 25 languages. The model employs a transformer-based text decoder with a bi-directional audio encoder and supports MP3, WAV, and FLAC files up to 200MB in size.

Advertisement

Model features

MAI-Voice-1 can create custom voices from just a few seconds

MAI-Voice-1, Microsoft's text-to-speech model, can generate a minute of natural-sounding audio in just a second. It maintains speaker identity across long-form content and now offers custom voice creation from just a few seconds of audio through Microsoft Foundry. Meanwhile, MAI-Image-2 has debuted as one of the top three model families on Arena.ai leaderboard and delivers at least two times faster generation times on Foundry and Copilot than its predecessor.

Advertisement

Market strategy

Models priced competitively to lower Microsoft's cost of goods sold

The launch of these models comes at a time when Microsoft is under pressure to show that its massive investments in AI infrastructure will translate into revenue. The models are priced competitively and are designed to lower Microsoft's own cost of goods sold. Suleyman sees this as his first response to the market pressure, which has been building as investors question when AI spending will start generating returns.

Competitive edge

MAI-Transcribe-1 beats Whisper and Gemini Flash Lite in accuracy

MAI-Transcribe-1 directly competes with OpenAI's Whisper models in the open-source community, claiming higher accuracy on all 25 benchmarked languages. It also beats Google's Gemini 3.1 Flash on 22 of 25 languages. Meanwhile, MAI-Voice-1's ability to clone voices from seconds of audio and generate 60 seconds of audio in one second puts it up against ElevenLabs, Resemble AI, and other voice AI start-ups.

Advertisement