Technology

Mira Murati's Startup Achieves Faster AI Conversations: What It Means

Martin HollowayPublished 2d ago5 min readBased on 3 sources
Reading level
Mira Murati's Startup Achieves Faster AI Conversations: What It Means

Mira Murati's Startup Achieves Faster AI Conversations: What It Means

Thinking Machines Lab, a new AI company founded by Mira Murati (the former chief technology officer at OpenAI), announced its first conversational AI models on Monday. The headline: these models respond in 0.40 seconds — roughly three times faster than ChatGPT's voice mode and twice as fast as Google's competing system. The startup published technical details over the weekend and demonstrated performance that outpaces current offerings from the industry's largest players.

To put those numbers in perspective, Murati's team achieved 400 milliseconds — the fastest response times the conversational AI industry has seen to date. OpenAI's GPT-realtime-2.0 averages 1.18 seconds. Google's Gemini Live runs at 0.94 seconds. That gap may sound small, but it changes how natural a conversation feels.

How They Built It Faster

The company's TML-Interaction-Small model is built on a Mixture-of-Experts architecture — think of it as a large brain that selectively activates only the parts it needs at any moment. The full model contains 276 billion parameters, or individual learned weights, but uses only 12 billion parameters at a time. This selective activation is what lets it run in real time on today's hardware.

The technical trick that unlocks speed is called full-duplex communication. In plain language: the system can listen and speak simultaneously. Most AI voice assistants today must finish one thing before they do the next, creating an awkward pause-and-respond rhythm. Thinking Machines Lab's approach lets the AI keep listening while it talks, which means you can interrupt it, ask clarifying questions, and have exchanges that actually feel like a conversation rather than turn-taking with a tool.

The company also uses a two-model approach: one model handles real-time chat, while a second model runs in the background to handle heavier lifting like web searches or complex reasoning. Think of it like having a quick-witted friend in the foreground who keeps the conversation moving, with a research assistant quietly working behind the scenes.

On published benchmarks, the model scored 77.8 on the FD-bench interaction quality test, substantially ahead of competitors, which scored between 46 and 54. It also outperformed OpenAI on the Audio MultiChallenge test.

Why This Matters

The latency problem has constrained voice AI adoption. Customer service, classroom collaboration, and real-time work sessions all require the ability to interrupt and riff back and forth. With current voice models forcing you to wait 1 to 1.2 seconds between turns, the experience feels stilted rather than natural. At 0.4 seconds, something shifts.

We have seen this pattern before, when smartphone makers competed on camera shutter speed and app launch times. Those seemed like small engineering victories, but they changed how people actually used their phones. Response latency in conversational AI sits in the same category: marginal improvements that reshape interaction patterns.

Thinking Machines Lab plans a limited research preview in the coming months, though the company has not announced a commercial timeline. The team also acknowledged a real constraint: their larger, more capable models still run too slowly for real-time use. Those bigger versions are planned for release later in 2026 as the company improves computational efficiency.

What Gets Traded Away

The focus on speed requires compromises worth examining. The 12-billion active parameter limit means these models have less raw reasoning power than the largest language models, which can activate hundreds of billions or even trillions of parameters to work through complex questions.

The dual-model architecture addresses this tradeoff by handling quick responses in real time while offloading deeper reasoning to the background system. In a sense, it mirrors what humans actually do in conversation: we acknowledge a complex question immediately and keep talking while processing the answer in parallel. The challenge is that this coordination between two models introduces complexity — the fast model must decide when to hand off to the slow one, integrate results when they return, and manage what the user expects to happen during pauses.

What Comes Next

The broader shift happening in conversational AI is away from trying to make one general-purpose model do everything, and toward building specialized systems optimized for specific patterns of interaction. Thinking Machines Lab represents one approach: accept some limits on reasoning power in exchange for latency that enables new use cases.

Whether this approach wins in the market depends on whether speed actually translates to users preferring it, and whether the company can scale it to larger, more capable models without losing the speed advantage that defines it. The research preview will provide the first real-world data on both questions.

The technical achievements are clear and measurable. Market validation — whether customers actually prefer this — remains to be seen.

Mira Murati's Startup Achieves Faster AI Conversations: What It Means | The Brief