Newly Open-Sourced Small Model with Under 1B Active Parameters Outperforms GPT-5 High-End Version in Math

Today I want to talk about the powerful small-parameter model ZAYA1-8B, which Zyphra just open-sourced. This little thing is quite interesting, and it was trained using AMD chips.

When we usually see news about large models, we tend to think that a model isn't a model unless its parameter count is in the tens of billions. ZAYA1-8B uses a Mixture of Experts architecture, and when it's actually running, it activates fewer than 1 billion parameters.

Open-source address: https://huggingface.co/Zyphra/ZAYA1-8B

Yet with such a small size, when it comes to solving math problems and writing code, it manages to outperform behemoths that have dozens of times more parameters.

For example, on the HMMT, a quite difficult math competition dataset, its score reached 89.6, decisively surpassing famous closed-source models like the high-end version of GPT-5 and Claude 4.5 Sonnet.

The reason such a model with so few parameters performs so strongly is mainly because Zyphra focused intensely on one point in its design: squeezing every ounce of intelligence from every unit of compute and every parameter.

They made three quite clever modifications. One is the CCA attention mechanism. Simply put, it's like installing a "filter" on the model to screen out useless information and keep only the essence.

Combined with its Mixture of Experts architecture, it's like assembling an AI team of specialists. When faced with a math problem, it calls the math expert; for coding tasks, it summons the programmer expert. Each does its own job, so reasoning efficiency is naturally high.

Next, they swapped the router that selects experts from a simple linear judgment to a small multi-layer perceptron network. This way, the model doesn't get flustered and make mistakes when deciding who should handle the task.

They also added a learnable residual scaling gate, which fixed the issue of numerical divergence caused by the model being too deep, at an extremely small cost. With these three key moves, the foundation of the entire model became particularly streamlined and capable.

Let's talk about its background. This point might be unknown to many non-hardcore enthusiasts, but it is truly significant. In the past, training a model of this caliber almost always required relying on NVIDIA.

After all, only a few companies dominate the GPU market. But Zaya1-8B is an outlier; it was trained entirely on AMD hardware.

Using 1,024 AMD MI300X GPUs, they managed to train this powerful model to completion. This also shows that AMD's AI ecosystem is now maturing nicely.

In the future, when we work on AI training, we will have another choice. For us users, this is definitely good news—after all, competition drives cost-performance ratios.

However, what truly transformed this model was the set of extremely meticulous but also extremely effective post-training procedures they implemented afterward.

There were five steps in total, each providing the model with specialized tutoring. Initially, they taught it the basics of chatting and following instructions. Then, they started feeding it logic problems, teaching it to synthesize multiple candidate answers on its own.

Steps three and four were somewhat like an athlete undergoing a grueling conditioning camp, using reinforcement learning to dynamically adjust the difficulty of problems and relentlessly drill the core domains of math and coding.

After these steps were done, they finally used a little human feedback to polish its appearance—things like whether its speech was pleasant, or whether its writing had style, and so on.

The effect after these five steps was very clear: its math and coding abilities skyrocketed, and along with that, its scores for multiple-choice questions and essay writing also rose.

Honestly, this time Zaya1-8B has truly done small-parameter models proud. A model's strength shouldn't only be judged by its parameter count; it also depends on architecture and efficiency.

For those of us who want to run a high-performance model locally, or for friends who are more cost-sensitive, it's a new option worth trying out.

Currently, ZAYA1-8B is open-sourced and supports the Apache 2.0 license. This means we can use it directly for commercial purposes, like developing a mobile online assistant, with no problem at all.

Newly Open-Sourced Small Model with Under 1B Active Parameters Outperforms GPT-5 High-End Version in Math

Related Articles

分享網址