Unbelievable… FaceWall Had AI Write a Training Framework, and It Trained the Strongest 1B Model by Itself: MiniCPM5-1B

FaceWall has just released MiniCPM5-1B, which has become the most powerful on-device text large model in the 1B parameter class.

On the Artificial Analysis Intelligence Index (AA Intelligence Index), MiniCPM5-1B scored 17.9 points, ranking first among all small-scale models.

What makes this release particularly special is that the Base Model version of MiniCPM5-1B was trained using a training framework called ForgeTrain, which was entirely written by an AI.

Moreover, this framework runs 10% faster on an Nvidia H100 than Nvidia's own Megatron framework.

An AI-written framework trained the strongest model of its size, and it is even faster than Nvidia's own. This is a key direction FaceWall is actively pursuing:

Using AI to build AI.

MiniCPM5-1B

In terms of performance, MiniCPM5-1B comprehensively surpasses models of the same size, such as Qwen3.5-0.8B, Qwen3-0.6B, and LFM2.5-1.2B-Thinking, across dimensions including comprehensive knowledge, mathematical reasoning, code programming, logical reasoning, and tool calling.

Moreover, MiniCPM5-1B even defeated Qwen3.5-2B (which scored 16.3 points), a model with double the parameters, to take first place among small models on the Artificial Analysis Intelligence Index, continuing the "Little Steel Cannon" series' tradition of punching above its weight.

This is even more intuitive when placing the intelligence index and parameter count on the same chart: MiniCPM5-1B lands in the top-left "best quadrant," with the smallest size and the highest score.

MiniCPM5-1B has once again pushed the upper limit of model intelligence density: with only a 1B parameter scale, it has surpassed all models with fewer than 2B parameters on the internationally renowned AA-Index. Compared to Qwen3.5-2B released three months ago, MiniCPM5-1B not only achieves better performance but has also halved the parameter count.

This further validates the density law that FaceWall has been consistently observing:

The intelligence density of large models is continuously increasing, roughly doubling every 3.5 months. Smaller models are carrying higher intelligence density.

In terms of inference efficiency, at equivalent intelligence levels, MiniCPM5-1B consumes the fewest output tokens.

Detailed scores on other benchmarks are as follows, covering 13 evaluations including GDPval-AA, Terminal-Bench Hard, SciCode, IFBench, and GPQA Diamond:

← Swipe left and right to view all →

Being on-device friendly is a tradition for the "Little Steel Cannon" series, and this time, the deployment threshold for MiniCPM5-1B is so low that it's practically non-existent:

FP16: Weights are about 2GB, suitable for GPUs or high-end laptops, with zero quantization loss. INT8: About 1GB, can run on laptops and edge boxes, with nearly no loss. INT4: About 0.5GB, works on phones, tablets, and vehicles, with nearly no loss.

It can run on a CPU, and it can run in a browser too. Using FaceWall's previously released self-developed CPU inference framework, ArcLight, paired with the INT4 quantized version of MiniCPM5-1B, you can run inference offline in any browser.

A capable 1B model can also drive various applications, such as a "desktop pet" that everyone can keep. FaceWall conveniently created a desktop pet project (based on secondary development of clawd-on-desk) this time, turning MiniCPM5-1B into a little AI pet that lives on your desktop.

With its 1B size, it can run on practically any hardware, making it a pet that "everyone can afford to keep."

In terms of deployment and fine-tuning, model inference supports mainstream frameworks like vLLM, SGLang, llama.cpp, Ollama, LM Studio, and MLX. Fine-tuning supports LLaMA-Factory, ms-swift, unsloth, xtuner, and TRL+PEFT.

FaceWall has even thoughtfully provided Claude Code skills, allowing you to throw it to `cc` to complete the deployment and fine-tuning of FaceWall models with a single click.

Data Governance

A key reason why MiniCPM5-1B has achieved such outstanding results at the 1B scale lies in its data governance.

FaceWall designed a tiered data governance approach, classifying data quality into five levels from L0 to L4, filtering and refining it stage by stage. It's not simply a matter of "the more data, the better;" each level has different granularity strategies for cleaning, deduplication, and synthesis.

This core dataset, Ultra-FineWeb-L3, has also been open-sourced along with the model this time.

It used the trillion-token high-quality data Ultra-FineWeb from the MiniCPM4 training process (an L2-level fine-screened dataset) as its seed. On this basis, various styles and forms of synthesis and enhancement were carried out, forming the crucial training fuel for the annealing phase of MiniCPM5-1B.

Ultra-FineWeb-L3 Data Scale: Total volume exceeds 1T Tokens. English: 680B+ Tokens. Chinese: 410B+ Tokens, the largest open-source pre-training synthetic dataset for Chinese.

For teams working on Chinese models, this dataset is immensely valuable. Moreover, the technical report on data governance has also been publicly released (see links at the end of the article).

ForgeTrain

So how was MiniCPM5-1B trained?

This is another major highlight: The Base Model version of the new model did not use Megatron or any other human-written framework during the pre-training stage. It used ForgeTrain, a training framework entirely written by an AI.

To put it in terms you might be more familiar with and that are indeed more accurate:

The world's first, fully AI-written, production-grade large model training framework.

The term "production-grade" is crucial.

Nvidia Labs previously released VibeTensor (with participation from Tianqi Chen, Yangqing Jia, and others), claiming to be "the first deep learning system entirely generated by AI."

However, it is 1.7 to 6.2 times slower than PyTorch, and Nvidia officially labeled it "do not use in production environments." Fundamentally, it remains a research prototype.

ForgeTrain, on the other hand, ran 10% faster than Megatron on Nvidia H100s with perfectly aligned accuracy (human and machine evaluations match the original model), and after running continuously for several days, it stably completed model training.

It is faster than Nvidia's own framework; it's not just "usable," it's "better to use."

Throughout the entire coding process, no human intervention was involved. After pressing start... the AI wrote it by itself for a day or two. Based on pre-defined acceptance criteria, once the AI confirmed it passed, a human could simply take it and use it directly.

It is said that within FaceWall, they have already used the same method to tackle the MindSpeed framework for an 8B model on Huawei Ascend, as well as MOE and other more complex architectures. This technology is highly reusable, and scaling it to larger models might only take a month or two.

Forge Engineering

Behind ForgeTrain's impressive achievements is a programming paradigm that FaceWall calls Forge Engineering, namely a custom-built software programming paradigm.

It might sound unfamiliar, but the core idea is simple:

Traditional training frameworks like Megatron aim to support various architectures like Qwen, DeepSeek, and MOE all at once, cramming everything into a single framework. Like a generic smartphone trying to meet everyone's needs, it inherently requires compromises.

But what if the cost of AI writing code approaches zero?

The model architectures of Qwen and DeepSeek differ significantly. There's no need to force a single universal solution; instead, you can write a separate one from scratch for each and then optimize them individually to the extreme. This is exactly what was done: MiniCPM was written from scratch.

All code is custom-built on-site for specific needs.

To use an analogy, current generic frameworks are like an Apple iPhone—one product designed to serve everyone. The future of Forge Engineering is akin to having Steve Jobs sit next to you, crafting a unique phone that perfectly meets your personal needs.

OpenAI previously proposed a similar concept called Harness Engineering, which also automates the evaluation process. But Forge Engineering goes a step further: all code is left to the AI, built on demand, and discarded after use. With the same acceptance criteria, changing the scenario or the chip allows the AI to forge a brand new implementation.

Regarding ForgeTrain's development process, FaceWall has made public a three-step methodology:

STEP 1: Set the Exam Outline. Gather key data from existing frameworks like Megatron to define the acceptance standards. STEP 2: Ensure a Passing Grade First. Have the AI write a framework under these standard constraints that produces training results perfectly identical to the original. STEP 3: From Passing to Surpassing. Lift the restrictions and let the AI iterate and optimize freely until it outperforms Megatron.

Before Claude Code and Codex introduced the /goal feature, I often dabbled in this way too, though I mainly used it for writing engineering code or training small models—I never really thought of using it to develop a better training framework...

Using AI to Build AI

Behind ForgeTrain and Forge Engineering is a fundamentally new R&D paradigm: using AI to build AI.

In my view, this matter has reached its most critical juncture.

Regardless of whether the scaling law has hit a wall, or whether compute, data, or power is reaching its limit, one variable in the formula has been underappreciated: the R&D cycle.

AI writes code 10 to 100 times more efficiently than humans. Applying this efficiency to AI R&D itself could compress the development cycle from 18 months down to 6 months, or even 1 month or 1 day.

To this end, FaceWall has also proposed a tiered system from L1 to L5, benchmarking against OpenAI's five-level AGI classification:

L1: Suggesting Ideas (GitHub Copilot). L2: Assisting R&D (Claude Code, Codex, Cursor). L3: End-to-End Closed-Loop Delivery (ForgeTrain). L4: Recursive Self-Improvement, where AI improves AI itself. L5: Autonomous Exploration, where AI defines its own research direction.

Currently, generic AI coding is roughly at the L3 level, but the pursuit of "using AI to build AI" is a step behind, roughly at a stage where L2 has just stabilized and L3 is about to ignite. ForgeTrain represents a concrete realization of L3.

The Big Three are also exploring this direction: Claude solved an open math problem in one hour that humans couldn't crack; OpenAI's team of 3 wrote a million lines of production-grade software with AI assistance; DeepMind had AI independently write a doctoral-level math paper.

Anthropic's CEO Dario even stated bluntly: Automating AI research is the strongest accelerator on the AGI timeline.

For China, this direction could be especially important. High-end chips continue to be restricted, with the ratio of China-US accelerator cards roughly at 10:1. Relying solely on piling up computing power is simply not viable.

Since the number of chips cannot be changed, the focus must shift to increasing the R&D efficiency of each chip. Using AI to build AI might currently be the most realistic pathway.

Domestic Hardware Adaptation

ForgeTrain has already been adapted for Huawei Ascend and has successfully trained MiniCPM5-1B on Huawei Ascend.

We know that Nvidia's true moat is arguably the CUDA software ecosystem. Jensen Huang has repeatedly emphasized that "Nvidia is essentially a software company." The developer ecosystem, algorithm libraries, and training frameworks built up over more than a decade make it incredibly hard to leave Nvidia once you start using it.

While Huawei chips have progressed rapidly in hardware, the software ecosystem has always been its biggest shortcoming. Every lab, every business unit has its own set of tools, leaving users unsure which one to use. Trying to accomplish something on a Huawei card often involves missing pieces here and there.

It's not that no one has previously thought about solving this problem. Compilation frameworks like TVM worked on it for five to ten years with the goal of "write code once, run on all chips." The reality, however, was that it only achieved "runnable," far from "runs well." After all, the combinations of chip types and algorithms are simply too numerous; getting a universal solution to optimize every combination adequately remains extremely challenging.

Now, large models have provided a new idea: since AI writing code incurs almost zero cost anymore, there is no need to maintain a clunky, one-size-fits-all framework. Instead, simply custom-build a dedicated implementation for each chip and each model on-site, which can even achieve optimal performance.

FaceWall's plan is: Within this year, rewrite all the poorly functioning software in each stage of large model training (pre-training, fine-tuning, reinforcement learning, quantization deployment, inference) entirely with AI.

Given a new model, simply tell the system what needs to be trained, and the system will generate a corresponding framework for you.

One could say, for making good use of domestic chips,

ForgeTrain is perhaps the first step.

Unbelievable… FaceWall Had AI Write a Training Framework, and It Trained the Strongest 1B Model by Itself: MiniCPM5-1B

Related Articles

分享網址