Goodbye to walkie-talkie AI!
MianBi Intelligence's MiniCPM-o 4.5 full-modal model enables instant, free conversation with AI.
MiniCPM-o 4.5, with only 9B parameters, achieves full-duplex interaction on the device side, allowing you to see, listen, and speak actively, interact with AI anytime, interrupt anytime, and respond instantly.
Currently, when interacting with an AI assistant, you must finish speaking first, it processes, it replies, and then you speak again.
This fragmented interaction experience has become history with MiniCPM-o 4.5.
As a native full-duplex full-modal large model, it packs industry-leading visual and speech capabilities into a 9B parameter size and has learned to 'multi-task' like humans.
It can continue listening to your interruptions while speaking, and can proactively initiate commentary while observing video streams. This capability for instant, free conversation gives AI a true interactive soul.
Full-Duplex Reconstruction of Human-Computer Interaction Sensory Experience
MiniCPM-o 4.5 introduces a full-duplex multimodal real-time streaming mechanism, making the input and output of visual, audio, and text like three parallel highways that do not block each other.
Even as the model is滔滔不绝地 explaining a complex physical concept to you, its 'eyes' are still watching changes in the video stream, and its 'ears' are still capturing your sudden questions.
This experience is no longer issuing instructions to a machine, but communicating with a mentally agile partner.
To achieve this smoothness akin to human instinct, MiniCPM-o 4.5 employs an extremely sophisticated time-division multiplexing mechanism.
It synchronizes all input and output streams on a millisecond-level timeline, slicing the parallel full-modal streams into tiny periodic time slices.
The backbone of the language model rapidly switches processing tasks within these extremely short time slices, presenting a perfect fusion of 'seeing, listening, and speaking' simultaneously on a macro level.
Proactive interaction is the most captivating feature brought by this technological revolution.
Previous models were like puppets that would never move unless poked, relying on external tools like VAD (Voice Activity Detection) to determine if the user had stopped speaking, deciding whether to start replying.
MiniCPM-o 4.5 internalizes this judgment as an intuition of the model.
It continuously monitors video and audio streams at a frequency of 1Hz, and its brain makes decisions every second: does the current scene require me to speak?
This high-frequency autonomous decision-making capability gives it the spirit of being 'all-seeing and all-hearing'.
When you are busy in the kitchen wearing smart glasses, holding a bottle of expired soy sauce and hesitating, without you needing to issue the instruction 'Hey, help me look at this', MiniCPM-o 4.5 captures this detail through the video stream and will proactively remind you: 'Pay attention to the expiration date, that bottle of soy sauce is no longer edible.'
This leap from passive response to proactive care elevates AI's presence from a tool to a partner.
Vocal expressiveness is another piece of the puzzle for building a real sense of interaction.
Dry, mechanical electronic sounds cannot carry complex communication emotions.
MiniCPM-o 4.5 has undergone a comprehensive upgrade in speech generation through new data methods. It no longer just reads text aloud but can automatically select the most appropriate tone and voice based on context.
For common issues like voice drift and tone discontinuity in long speech synthesis, the model uses an interleaved modeling method for text and speech tokens. The design supports full-duplex real-time generation, ensuring that even for long speeches exceeding 1 minute, the stability, human-like quality, and expressiveness of its voice remain consistent.
It also has voice cloning capability. With just a simple reference audio clip, MiniCPM-o 4.5 can quickly capture its voice characteristics and perfectly replicate them in subsequent conversations.
Thus, you can specify it to use a specific cloned voice for role-playing. No complex fine-tuning is needed; a simple prompt during the inference stage can complete customization.
This performs more excellently and flexibly than many dedicated TTS (Text-to-Speech) tools on the market.
High-Density Explosion with Small Parameters
The power of MiniCPM-o 4.5 stems from an extremely efficient end-to-end architecture design.
In a model with only 9B parameters in total, it integrates top-tier achievements from multiple fields such as SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B.
It does not adopt a loose plug-in combination but tightly stitches the encoders and decoders of each modality with the large language model through dense features.
This end-to-end design philosophy solves the information loss problem in traditional multimodal systems.
In non-end-to-end systems, visual signals often need to be translated into text descriptions first before being handed over to the language model for processing. This 'translation' process inevitably loses details.
MiniCPM-o 4.5's visual and auditory signals directly enter the brain of the language model in the form of features, achieving lossless information flow.
This high-density capability is particularly astonishing in visual understanding.
MiniCPM-o 4.5 achieved an average score of 77.6 in OpenCompass, a comprehensive evaluation covering 8 mainstream benchmarks.
This score not only surpasses large proprietary models like GPT-4o and Gemini 2.0 Pro but also approaches the level of Gemini 2.5 Flash.
For a model that can run on the device side, this is an almost unimaginable cross-level challenge.
The delicacy of visual processing directly determines the upper limit of the model in practical applications.
MiniCPM-o 4.5 supports processing high-resolution images up to 1.8 million pixels and can parse them at any aspect ratio.
Whether it's a long receipt, a wide panoramic view, or a document dense with details, it can handle them with ease.
On the OmniDocBench list, it achieved SOTA (State-of-the-Art) performance in end-to-end English document parsing tasks, leaving behind Gemini-3 Flash, GPT-5, and the dedicated OCR model DeepSeek-OCR 2.
Video understanding capability is also a major highlight of this upgrade.
The model can efficiently process high-frame-rate video streams up to 10fps (Frames Per Second). What it sees is no longer a series of intermittent slides, but a smooth, continuous dynamic world.
This high-refresh visual understanding capability is the foundation for achieving proactive interaction. Only by seeing clearly and keeping up can it react at the most appropriate moment.
The visual encoder and audio encoder act like two sensitive tentacles, continuously feeding the captured external world to the intermediate LLM (Large Language Model).
Tests on MMHal-Bench show that its behavioral performance is very credible, with a low hallucination rate, reaching a level comparable to Gemini 2.5 Flash.
It supports multilingual capabilities in over 30 languages, enabling it to cross cultural barriers and play a role in broader global scenarios.
Text capabilities are also top-tier among models of the same parameter size.
The characteristic of the 'Almighty Small Cannon' reflects MianBi Intelligence's relentless pursuit of model 'energy efficiency ratio'. It maintains SOTA-level multimodal performance with lower VRAM usage and faster response speed, achieving higher inference efficiency and lower inference cost.
At the golden size of 9B, it covers comprehensive capabilities such as visual understanding, document parsing, voice conversation, and voice cloning, achieving All in One.
Streaming Full-Modality for Inclusive Terminals
MiniCPM-o 4.5 has done enough work on ease of use and deployment flexibility, making it a practical tool ready to enter thousands of households.
For developers and geeks, MiniCPM-o 4.5 provides an extremely rich 'way to use'.
It perfectly supports llama.cpp and Ollama, allowing efficient inference on ordinary personal computers or even high-performance mobile phones using the CPU.
To adapt to the VRAM limitations of different hardware, the official provides 16 different sizes of int4 and GGUF format quantized models.
Whether your device is a top-tier workstation or an old laptop from a few years ago, you can always find a suitable version to run.
In high-throughput scenarios in production environments, support for vLLM and SGLang ensures that the model can perform large-scale services in a memory-efficient manner.
For users who want to deploy on domestic chips, support for FlagOS has broken down the barriers of multiple domestic chips, achieving cross-platform end-to-end inference performance improvement.
To allow developers to experience the charm of full-duplex live streaming as soon as possible, MianBi Intelligence also open-sourced a high-performance inference framework named llama.cpp-omni, along with a WebRTC Demo.
Localized deployment capabilities have irreplaceable advantages for privacy protection and response speed.
Your video streams and voice data do not need to be uploaded to the cloud; all processing is completed locally.
For users who want the model to better understand specific domain knowledge, support for LLaMA-Factory makes fine-tuning simple and direct.
Whether it's professional terminology in vertical industries or interaction logic in specific scenarios, it can be quickly adapted through low-cost fine-tuning.
MiniCPM-o 4.5 shows us a new direction for the evolution of AI hardware forms.
It can be the soul of smart glasses, telling you in real-time what you see; it can be the brain of a robot, autonomously navigating in complex environments and communicating with people; it can be the core of a car assistant, providing truly thoughtful proactive suggestions during driving.
Free Experience:
https://minicpm-omni.openbmb.cn/
https://huggingface.co/spaces/openbmb/MiniCPM-o-4_5-Demo
References:
https://github.com/OpenBMB/MiniCPM-o
https://huggingface.co/openbmb/MiniCPM-o-4_5