OpenAI's Former CTO Unveils Prototype for an AI That's Always 'Present' | Hao's Deep Dive on Papers

Tencent's cutting-edge tech paper analysis column, seeking AI's certainty at the intersection of code and commerce.

Text by Bo Yang

Edited by Xu Qingyang

On May 11, Thinking Machines released a new model called the Interaction Model. This AI lab, founded by former OpenAI CTO Mira Murati, previously published the OPD distillation paradigm that profoundly influenced DeepSeek V4. This time, they claim their newly released model represents the next-generation paradigm for human-computer interaction.

Their starting point for the argument is communications theory.

In 1991, Herbert Clark and Susan Brennan, in their seminal paper "Grounding in Communication," proposed three fundamental conditions for effective human communication. Thinking Machines adopted these three conditions as a diagnostic framework to systematically check the state of current AI interaction systems.

Copresence: Both parties share the same perceptual field. What you see, hear, and experience in your environment can also be perceived by the other party.

Contemporality: Reception is nearly synchronous with sending. As you speak, the other party is processing what you say simultaneously; there is no gap where they "wait for you to finish before starting to understand."

Simultaneity: Both parties can send and receive information at the same time. While you are speaking, the other party can simultaneously offer real-time feedback like micro-expressions, nods, and interjections.

These three conditions are naturally met in face-to-face conversation. When you chat with a friend in a café, you share the same physical space (copresence), they listen and understand as soon as you speak (contemporality), and they might frown or nod in real-time to signal they are "following along" or "disagree" (simultaneity).

Thinking Machines' diagnostic conclusion is that current AI systems fail the first two conditions completely, and the third has seen partial progress in recent full-duplex voice models but is still incomplete.

AI Has Never Truly Been "Present"

Thinking Machines argues that the biggest way current AI fails the definition of presence is that all dialogue systems are built on the concept of a turn.

The user finishes a segment of speech, the model processes it, the model outputs a reply. One turn ends, the next begins. This structure fundamentally severs copresence.

First, it lacks copresence. The AI only perceives you when you actively provide input. When you are not speaking, your world does not exist for it. You furrow your brow, walk to a window, a bad news alert pops up on your screen—it knows none of this. Its perceptual field is limited to that narrow channel you "actively push" through a keyboard or microphone.

Second, it lacks contemporality. The model must wait for you to "finish speaking" before it begins processing. Voice Activity Detection (VAD) needs to detect a sufficiently long silence to determine your turn is over. In this gap of "waiting for you to finish," the model has no real-time understanding of what you are currently saying.

Thinking Machines used an analogy in their blog. Imagine you and a colleague are discussing a critical disagreement, but you can only communicate via email. You write and send, then wait for a reply. The other person writes and sends, then waits for your next message. No one thinks this method is suitable for solving complex collaborative problems.

But this is the interaction model of all current AI systems.

The third necessary condition, simultaneity, is where the fastest progress has been made in the last two years. Real-time voice AIs are already trying to allow systems to send and receive simultaneously. OpenAI released GPT-Realtime-2 on May 7, and ByteDance's Seeduplex has been fully rolled out on Doubao. But a closer look at the architectures reveals that each implementation's depth of simultaneity differs.

And they all have only addressed simulataneity, leaving the first two conditions untouched.

Full-Duplex at the Communication Layer, But the Model Layer Still Waits for You to Finish

GPT-Realtime-2 is the voice model OpenAI launched 4 days before Thinking Machines' release, and it's their strongest real-time interaction solution to date. Let's look at what it does.

It boasts GPT-5 level reasoning capabilities, a 128K context window, and, most importantly, improved parallel tool calling, allowing you to control a system and call tools via voice. Consequently, it scores 15.2% higher than its predecessor on Big Bench Audio, making it very strong as a voice model in its own right.

But here we only care about one question: how far does it go on the three conditions?

Let's look at the architecture first. The OpenAI Realtime API is built on WebSocket, a full-duplex communication protocol. Your audio stream is continuously sent to the server, and the AI's audio stream is continuously returned to you; both directions are open simultaneously. So simultaneity is solved at the communication layer—you can speak while the AI is talking, and the AI can continue outputting while you speak. The channel is bidirectional; there is no restriction that "one party must wait for the other to finish before speaking."

The problem lies with the model behind the channel.

Although WebSocket continuously receives your audio, the model is not "listening all the time." On the server side, a VAD module sits between you and the model, acting as a gatekeeper. VAD's job is to determine if "the user has finished speaking." Only when it detects a sufficiently long silence, concluding that your turn is over, is the model awakened to process what you just said.

To put it another way: the channel is like a two-way highway where cars can travel in both directions at any time. But the model is like a toll booth at the end of the highway; it doesn't open the gate when a car arrives, but waits until all cars are assembled (you've finished speaking) before processing them all at once.

What about interruptions? If you speak while the AI is talking, VAD detects new voice activity, the system cancels the AI's current output, waits for you to finish, and then triggers a new generation cycle.

Note this process: the interruption is triggered by VAD, not because the model itself realizes you've started speaking. The model is externally notified to "stop," and then waits for enough new input to accumulate before restarting.

Despite having a foundational level of simulaneity, it hasn't solved the old turn-based problem; contemporality is completely unaddressed.

Model-Level Full-Duplex, But Still Doesn't Know What You Look Like

ByteDance's Seeduplex, launched in April 2025, goes a step further than OpenAI. It is a large voice model that achieves full-duplex at the model level.

GPT-Realtime-2's simultaneity relies on the communication layer—WebSocket allows bidirectional simultaneous transmission—but the model itself still "waits for you to finish before thinking." Seeduplex pushes simultaneity into the model's internals.

Its three-stream architecture (listening stream, speaking stream, control stream), coupled with R-PEC (Relative Position Encoding), allows the model to genuinely process input and output simultaneously. The listening stream continuously parses what you're saying, the speaking stream simultaneously generates a response, and the control stream handles real-time arbitration between the two.

The result is a 50% reduction in false interruption rates and a 40% drop in the proportion of talk-overs compared to half-duplex models.

This is a tangible step forward in simultaneity. GPT-Realtime-2's interruption mechanism is "cancel and restart"—the AI is stopped, waits for you to finish, and regenerates a turn. Seeduplex's interruption is continuous; the AI listens while speaking, and if it judges that you want to interject, it smoothly yields the floor without the disruptive "cancel-wait-restart" process. It's an upgrade from a walkie-talkie to a telephone.

Its three-stream architecture (listening/speaking/control streams) plus the R-PEC mechanism enable the model to truly send and receive information at the same time. This isn't the false concurrency of the communication layer, but simultaneous processing of input and output streams within the model itself. The result is a 50% reduction in false interruptions and a 40% drop in talk-overs compared to half-duplex models. In terms of the three conditions, it fills in the missing piece of simultaneity.

But what about copresence and contemporality? Just like GPT Realtime, they remain untouched.

Both are purely voice models with no visual input. When you're not speaking, you still don't exist to them. R-PEC is relative temporal encoding; it knows a token in the listening stream is "before" or "after" a token in the speaking stream, but it doesn't have an absolute clock to anchor each position to a specific moment in the real world.

It knows the sequence, but it doesn't possess a continuous sense of presence. When there's no voice activity, the three streams have nothing to process, leaving the model in an idle state.

So here's an analogy: OpenAI Realtime-2 is a walkie-talkie with an interrupt button. You press the button and it stops to listen. Seeduplex is a real telephone where two people can talk simultaneously without confusion.

But what Thinking Machines wants to create is a face-to-face encounter.

A face-to-face encounter means that even when no one is speaking, two people share the same space, the same span of time, the same silence.

Welding Interactivity into the Model

Walkie-talkies and telephones each only address one of the three conditions. Thinking Machines aims to remedy all three. How?

Let's start with the first condition: copresence.

Copresence: Exposing the AI to the Full Range of Modalities You Experience

The AI needs a perceptual bandwidth equal to your own. It needs to see what you can see and hear what you can hear.

So, they trained a multimodal model. But to satisfy contemporality, they didn't choose the mainstream path of adding encoder scaffolding to a voice model to achieve multimodal functionality. Instead, they retrained a unified model from scratch.

Contemporality requires processing across different modalities to be time-synchronized. If a system needs to align multiple modal streams—video frames, audio clips, text tokens—to the same representation space with temporal precision, any latency jitter from plug-in components will destroy that alignment.

For example, vision goes through one independent encoder (like ViT), audio through another (like Whisper), and text through a third. Each encoder has a different processing delay; vision might take 80ms, audio 40ms, and text is almost instantaneous.

These latency differences seem small but can have fatal effects in subsequent stages.

This is the reason Thinking Machines emphasizes in their technical documentation that "interactivity must be part of the model itself, not assembled through external scaffolding."

Internalizing all time-sensitive functions into the model and joint training from scratch is not an aesthetic preference but an engineering necessity.

Their specific method: Audio input uses a lightweight dMel (Mel spectrogram) embedding layer for minimal preprocessing; video input cuts images into 40x40 patches and encodes them with hMLP (Hierarchical MLP); text uses standard embeddings. All components are jointly trained from scratch with the main Transformer using Encoder-free Early Fusion.

The result is that the path for all modalities from input to the Transformer is compressed to a minimum, with delays made as uniform as possible.

Here, unified representation is not an independent innovation but an enabling condition; it ensures modalities don't slow each other down, providing the precision basis for the next step: temporal anchoring.

Of course, aside from this, another reason for training a model from scratch is Thinking Machines' belief that interaction capability itself grows with model capability, but scaffolding does not.

Only by creating a unified model to enjoy this growth can interaction scale up.

Contemporality: Giving the Model a Continuous Internal Clock

Contemporality is the most crucial point in this architecture.

The model needs a continuous internal clock, rather than being event-awakened, for it to be continuously "present."

Current language models are passive in the temporal dimension. Their sense of time is event-driven. They wake up when something happens, and sleep when it doesn't.

Thinking Machines flipped this paradigm. Their Interaction Model operates on 200ms micro-turns. Every 200ms, the model processes a set of input tokens and generates a set of output tokens. Whether you are speaking or not, whether an "event" has occurred, this 200ms heartbeat never stops.

Why 200ms? Because this is the minimum meaningful feedback interval in human conversation. Research in conversation analysis shows that 200ms is roughly the shortest time for a person to produce a backchannel feedback ("uh-huh," "right," "and then?"). Below this interval, feedback seems unnatural; above it, the other person feels you are "not listening."

In each 200ms micro-turn, the model first reads in all input tokens (from various modalities), then generates the tokens it should output. Input and output are interleaved into a continuous sequence.

Silence is not blank. If you say nothing in a particular 200ms slice, the model still processes that silence (silent mel features in the audio stream, and your current image in the video stream). Silence, overlap, and interruption are all preserved in the context.

This enables capabilities previously impossible.

In a TimeSpeak test requiring the user to say "remind me at the 30th second," the Interaction Model achieved a macro accuracy of 64.7%, while GPT-4o realtime scored only 4.3%. In a CueSpeak test of "say the answer when you see me raise my hand," the Interaction Model scored 81.7% compared to GPT-4o realtime's 2.9%. The difference is an order of magnitude because GPT-4o realtime has no internal clock; it doesn't know where "the 30th second" is.

To make the 200ms heartbeat work at an engineering level, Thinking Machines did two things.

Trainer-sampler alignment. This architecture mandates that the temporal resolution during training must be exactly the same as during inference. If the model sees variable-length chunks during training but must strictly output every 200ms during inference, the model's learned sense of time will be distorted. To solve this, they achieved bitwise alignment between training and inference. The additional overhead is less than 5%.

Streaming Sessions. Traditional LLM inference frameworks allocate memory for each request and perform a prefill step. A 200ms chunk means 5 small prefills per second; the overhead of a traditional framework would be amplified to an unacceptable level. So, they redesigned the inference architecture. The client sends a chunk every 200ms, and the inference server appends the chunk to a persistent sequence in GPU memory, avoiding repeated allocation. This compresses the memory access cost to make it truly runnable.

Simultaneity: Making Input and Output Truly Parallel

Simultaneity requires the system to be able to receive and send at the same time.

A standard autoregressive Transformer does one thing at a time: either it reads in a prompt (prefill), or it generates a reply token by token (decode). The decode phase does not accept new input. The result is that if the AI is talking, it is not listening to you. The two are mutually exclusive in time.

By slicing time into discrete 200ms blocks, Thinking Machines' execution sequence within each block is "read first, then write." The model first processes all input tokens accumulated over these 200ms (what you said, how your expression changed), then generates the tokens it should output for these 200ms.

From inside the model, input and output are still sequential, consistent with traditional models.

But at the scale of human perception, 200ms is too short for you to perceive the "read-then-write" gap. You feel the AI is responding at the same time you are speaking. Because the delay from you starting to speak to the AI's response is at most 200ms, and humans' temporal resolution in conversational contexts is about 200-300ms, you simply cannot feel the latency.

This is not true parallelism at the physical layer, but rapid alternation below the human perception threshold, which achieves an effect equivalent to concurrency.

This differs from Seeduplex's duplex architecture. Seeduplex's three-stream architecture performs true parallel processing inside the model, with listening and speaking streams running simultaneously. Thinking Machines slices time fine enough so that serial execution becomes experientially equivalent to parallel.

The latter has an additional benefit: because the model reads all inputs before generating outputs within each micro-turn, it possesses a complete perception of the "global state at that moment." In contrast, Seeduplex's parallel listen/speak streams require an extra control stream to arbitrate conflicts. The micro-turn structure naturally avoids this problem because each time block has only one decision point.

But simultaneity introduces an engineering tension. If the model needs to produce meaningful output every 200ms to maintain a sense of presence, it cannot spend too much time "thinking." You cannot ask someone to maintain constant eye contact while solving calculus. Maintaining presence and performing deep thinking are naturally in conflict over computational resources.

Therefore, Thinking Machines opted for a dual-model architecture.

The Interaction Model (TML-Interaction-Small) is a 276B parameter Mixture-of-Experts (MoE) model, activating only 12B parameters per inference. Its responsibility is to maintain the heartbeat, responding every 200ms, sustaining a continuous multimodal bidirectional flow, and handling dialogue management and immediate replies. It performs near GPT-4o levels on standard benchmarks but doesn't do deep reasoning that requires prolonged thought.

The Background Model is responsible for deep work. When the Interaction Model judges that "this question requires looking up information or reasoning," it asynchronously delegates the task to the Background Model. The Background Model produces a streaming return, and the Interaction Model naturally weaves the result into the conversation at an opportune moment. The two models share the complete conversation context.

In the Interaction Model, the heartbeat is rigid (must respond every 200ms), while thinking is elastic (can take 2 seconds or 20 seconds). Optimizing them separately is far more efficient than forcing a single model to satisfy both.

Bolstered by this dual architecture, the Interaction Model achieves higher conversation quality. On the FD-bench V1.5 conversational quality score, the Interaction Model scored 77.8, compared to GPT-4o realtime's 46.8, GPT-4's 48.3, and Gemini 3's 54.3.

Why Is "Copresence" the Next Generation of Interaction?

Now that we've covered the technical architecture, let's discuss the philosophy.

Beyond "good communication" as defined by communications theory, what else does "copresence" offer that makes Thinking Machines believe it represents the next generation of interaction?

First is the temporality and proactive potential inherent in "copresence."

An AI that is always present offers a better experience because it can remember things we've discussed earlier, respond at any time, and gather information from us 24/7 and respond accordingly.

The importance of these features has been fully demonstrated by the popularity of projects like OpenClaw. It created temporality and proactivity through a heartbeat, and gained the foundation for self-evolution and runtime accumulation through memory.

Yet, its channels for gathering information to be proactive are still very narrow, requiring you to initiate via a turn-based prompt. Paired with this broader, more temporally anchored interaction system, its convenience and capability spectrum would undoubtedly see a massive upgrade.

But besides this, what else makes it the next generation of interaction?

Thinking Machines has its own reflections on this, citing the works of two philosophers in its references.

Friedrich Hayek, in his 1945 work "The Use of Knowledge in Society," pointed out that the most important knowledge in society isn't explicit knowledge that can be written in textbooks, but the dispersed knowledge of the particular circumstances of time and place held by each individual. This knowledge is tacit and cannot be collected and aggregated by a central planner.

James Scott, in "Seeing Like a State," developed this concept into mētis (practical wisdom), a kind of knowledge that can only be acquired through personal presence and long-term immersion. A veteran farmer's understanding of his own plot of land, the information a doctor reads from a patient's expression upon entering a ward.

This knowledge cannot be encoded into text; it can only be observed and absorbed in a copresent context.

Mapping this to human-computer interaction: the current model requires humans to encode their needs into language and "push" it to the AI, compressing vague intentions and complex situations into a text segment, sending it off, and waiting for a response.

However, a person's understanding of their own needs is inherently incomplete. You furrow your brow at the code on the screen, knowing "something is wrong here," but unable to articulate exactly what. Your hesitation, pauses, and head-shaking while staring at a whiteboard corner during a discussion with a colleague—these are all information.

A copresent AI can capture this information. Not because it can read minds, but because it continuously exists in your perceptual field. It doesn't need you to "speak it out" to sense your state. It gleans the kind of contextual knowledge Hayek described from your silence, gaze, and expression changes.

In the blog post for the May 7 release of GPT-Realtime-2, OpenAI admitted that the model's advantages are not obvious when users interact "interactively, synchronously, hands-on-keyboard," and that autonomous agent systems can better leverage its capabilities.

This reveals that in current AI interaction, humans are pushed out of the loop, not because humans are useless, but because the pipeline is too narrow for human tacit knowledge to fit through.

"Faster" just optimizes the pipeline's efficiency. "Copresence" opens up an entirely new information channel. What flows through this channel is information that cannot be actively encoded and sent, but can only be perceived by sharing the same space and time.

This is why Thinking Machines' path diverges from mainstream AI companies today.

While OpenAI, Anthropic, and Google are all pushing models toward a "background asynchronous execution" agent paradigm, Thinking Machines has gone in the opposite direction. Not removing people from the loop to hand tasks over to AI, but letting AI enter the human's loop, continuously present.

Of course, Thinking Machines' system isn't perfect. The cost of a 276B MoE model is not something everyone can afford. The 200ms micro-turn demands far more from inference infrastructure than current mainstream solutions. Training everything from scratch means there are no off-the-shelf pretrained encoders to reuse.

But if this thesis holds true, continuous presence isn't just an upgrade in interaction experience; it's an expansion of the boundaries of AI intelligence.

Once AI enters the complete human loop, it might, for the first time, gain access to "genuine workflow."

OpenAI's Former CTO Unveils Prototype for an AI That's Always 'Present' | Hao's Deep Dive on Papers

Related Articles

分享網址