Microsoft unveils three new in-house foundational models to challenge OpenAI and Google

Marijan Hassan - Tech Journalist
Apr 9
2 min read

In a major move toward AI independence, Microsoft has launched three new proprietary foundational models designed to compete directly with offerings from Google and its own close partner, OpenAI. The new "MAI" (Microsoft AI) suite, comprised of MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, was officially released on April 2, 2026, signaling a shift in the tech giant's strategy to own the entire AI stack.

Announced by Mustafa Suleyman, CEO of Microsoft AI, the models are now available through the Microsoft Foundry platform and the MAI Playground. This launch follows a strategic pivot in late 2025 where Microsoft began ramping up in-house development to reduce reliance on external partnerships for its core consumer and enterprise features.

The MAI foundational trio

The new models target three of the most commercially critical AI modalities:

MAI-Transcribe-1 (Speech-to-Text): This model claims "enterprise-grade accuracy" across the top 25 global languages. Microsoft reports a word error rate (WER) of less than 4%, outperforming both GPT-Transcribe (4.2%) and Gemini 3.1 Flash (4.9%). It is designed for high-speed batch transcription, reportedly operating 2.5x faster than previous Azure offerings.
MAI-Voice-1 (Speech Generation): A highly expressive voice model capable of generating 60 seconds of natural, nuanced audio in just one second on a single GPU. It introduces a "Custom Voice" feature that allows developers to create secure voice clones using only a few seconds of reference audio.
MAI-Image-2 (Visual Generation): Microsoft’s second-generation image model focuses on "photorealistic" accuracy, emphasizing natural lighting, skin textures, and clear in-image text for diagrams. It debuted in the top three of the Arena.ai leaderboard, offering double the generation speed of its predecessor on Copilot.

"Better, faster, cheaper"

Microsoft is positioning the MAI family as a high-performance, cost-effective alternative for developers. During the launch, Suleyman emphasized that these models were developed with "Humanist AI" principles, optimizing for practical, real-world communication rather than raw parameter count.

Pricing for the new models is aggressively competitive:

Transcription: $0.36 per hour.
Voice: $22 per 1 million characters.
Images: $5 per 1M input tokens and $33 per 1M output tokens.

Strategic independence

The launch of the MAI suite comes amid a complex "co-opetition" between Microsoft and OpenAI. While Microsoft recently participated in a $122 billion funding round for OpenAI, the release of these in-house models suggests Microsoft is building a "Plan B" to protect its margins and innovation cycles.

By integrating these models directly into Copilot, Bing, and PowerPoint, Microsoft is effectively replacing third-party technology with its own proprietary "agent factory." This move is expected to significantly reduce the multi-billion-dollar compute bill Microsoft pays to host partner models on its Azure infrastructure.