How Do You Evaluate the Interaction Model Recently Released by Thinking Machines? - wangleineo's Answer

We can't underestimate TML. The On-Policy Distillation they proposed last year has essentially become the new industry consensus now.

DeepSeek v4 is currently training with it, and many of the latest frontier models from major labs are also using this method to improve training efficiency and model performance.

The Interaction Model they just released is an even more systematic design:

It combines On-Policy Distillation with other techniques like multi-agent training and process reward models to create a complete system architecture.

Its main logic is to shift from simply relying on scaling up compute power to pursuing more efficient interaction mechanisms. This way, models can learn more complex and deeper reasoning abilities with limited computational resources.

This idea is very compelling. Many labs are racking their brains trying to achieve better results with less computation, and TML has clearly found a viable path with very sophisticated engineering.

We can't underestimate TML. The On-Policy Distillation they proposed last year has essentially become the new industry consensus now. DeepSeek v4 is currently training with it, and many of the latest frontier models from major labs are also using this method to improve training efficiency and model performance.

This just-released Interaction Model is also well worth discussing; it potentially opens a new paradigm for AI models.

First, the biggest innovation of this model is "streaming interaction." The large language models we're used to, whether text-based or speech end-to-end models, are "Turn-Based." We send a request, and the model replies. This goes without saying for text, and for end-to-end speech models, they generally rely on a voice-activity-detection (VAD) model to detect when the user stops speaking and segment turns.

Diagram comparing traditional Turn-Based interaction with TML's Streaming interaction

TML's new model, however, is Streaming. The user's tokens flow continuously to the model, and the model's tokens also flow back without interruption:

GIF illustrating the continuous, bi-directional token flow in the Interaction Model

The interaction mode of typical LLMs is like traditional HTTP, where one Request corresponds to one Response. This Interaction Model, however, is like a WebSocket, with continuous, bi-directional data streams.

The advantage of this is that the model can react instantly to what happens during the interaction. For example, TML co-founder Liyi Weng demonstrated the model instantly counting the number of animals in a story:

Video: Count The Animals

You can imagine many similar application scenarios, such as:

"Please provide simultaneous interpretation as I speak."
"Give live commentary on a sports match."
"Interrupt me immediately when I make a mistake."

The method for achieving "streaming interaction" isn't the complete elimination of turns, but rather slicing turns into very fine granularity, with each Micro-Turn containing only 200ms of data:

Technical diagram showing a 200ms Micro-Turn window for audio streaming

However, this ultra-frequent Micro-turn approach requires a prefill phase for each turn and demands very low latency. TML designed a "streaming session" where the client sends each 200-millisecond data chunk as a separate request, and the inference server appends these chunks to a persistent sequence in GPU memory. This avoids frequent memory operations.

The model's second characteristic is that it is multimodal and full-duplex. The input can be speech, video, or text. It uses Early-Fusion without a heavy encoder layer:

Architecture diagram showing multimodal Early-Fusion for speech, video, and text inputs

Compared to voice full-duplex models like GPT-realtime, adding vision allows it to handle a broader range of tasks.

The third feature is that it is not a single monolithic model, but a system with two cooperating models:

System diagram showing the front-end and back-end model architecture

This system has a front-end and a back-end: the front-end streaming model is fast and responsible for interaction and some simple real-time thinking; the back-end traditional large model handles complex problems and tool calling. This model coordination is much like the advisor mode in Claude Code:

Advisor Mode: Ask a higher-level model for help when stuck

Similar to the Advisor mode, the front-end model must also possess a certain level of intelligence, at least knowing when it needs to call the back-end model. Therefore, this interactive small model isn't exactly small:

The current TML-Interaction-Small is a 276B parameter MoE with 12B active.

This model is very suitable for voice assistants handling complex tasks: the front-end model maintains continuous communication with the user, making the user feel the AI assistant is always online, while the back-end model does the heavy lifting and then seamlessly delivers the result through the front-end model.

Last year, after experiencing the Sesame voice model, I felt the movie "HER" had been realized; now, this interactive model with vision and a front-end/back-end architecture seems like the completed version of HER.

This model far outperforms other similar models in interactivity:

Benchmark chart showing TML model outperforming GPT-realtime and Gemini in interactivity metrics

The official article concludes with a small demo showing several tasks requiring real-time reactions:

The user continuously changes hand gestures, and the model must accurately state how many fingers the user is holding up.
Real-time currency conversion, converting each phrase of "XX dollars" into euros.
Starting a stopwatch and reporting to the user how long an event took.

Only TML's model could complete these tasks; other models, including GPT-realtime and Gemini, all failed.

Previous voice full-duplex models often compromised on intelligence. For example, online, someone tested the IQ of the ChatGPT voice model with various questions, and the text model typically wouldn't make those simple mistakes. Will TML's dual-model architecture resolve this issue (with the front-end handling fast response and the back-end handling intelligence)?

Another thing I'm curious about is how TML will productize this model. Will it be packaged as a personal AI companion like Samantha in "HER"? Or will it provide voice Agent services for businesses? The answer should be revealed soon.

At least from the style of the blog, it seems more like emphasizing engineering combinations rather than the superiority of any specific algorithm. This includes but is not limited to:

Implementing dynamic wake words (similar to a reasoning version of Siri or Alexa wake-up).
Omni-level intent recognition.
Setting up a separation between front model and back model to achieve background reasoning sessions (including background contemplation, background timing).
Transforming user-model dual-stream interaction into standard Transformer sequential input through 200ms chunks.
Achieving efficient inference through queuing within fixed GPU memory (276BA12B, 4.3% activation, on par with the largest Qwen3.5). Ultimately reducing FD-bench latency to 0.4 seconds, a significant step forward in usability compared to previous results.

This shows that in TM's view, these engineering implementations are harder muscles than pure model scores.

After experiencing the Sesame voice model last year, I felt the movie HER had already been realized; now, this interaction model with vision and a front-end/back-end split seems to be the completed form of HER.

This model far outperforms other similar models in interactivity:

Benchmark chart showing the Interaction Model's superior performance

The official article ends with a small demo showcasing several tasks requiring instant reactions:

The user constantly changes hand gestures, and the model must accurately say how many fingers the user is holding up.
Real-time currency conversion, converting every phrase of XX dollars into euros.
Start a stopwatch and report how much time something took.

Only TML's model could complete these tasks; other models, including GPT-realtime and Gemini, all failed.

Previous voice full-duplex models often made compromises in intelligence. For example, some online have tested the IQ of ChatGPT's voice model with various questions, and text models typically wouldn't make those low-level mistakes. Could TML's dual-model architecture solve this problem, with the front-end handling rapid responses and the back-end handling intelligence?

Another thing I'm curious about is how TML will productize this model. Will they package it as a personal AI companion like Samantha in HER? Or will they offer voice agent services for businesses? The answer should be revealed soon.

How Do You Evaluate the Interaction Model Recently Released by Thinking Machines? - wangleineo's Answer

Related Articles

分享網址