Zhipu's New Model Also Uses DeepSeek's MLA, Runs on Apple M5

By Meng Chen from Aofei TempleQuantumBit | Official Account QbitAI

After Zhipu AI's IPO, it has released another new achievement.

The open-source lightweight large language model GLM-4.7-Flash directly replaces the previous generation GLM-4.5-Flash, with its API available for free calling.

This is a 30B total parameter, 3B activation parameter Mixture of Experts (MoE) architecture model, officially positioned as a "local programming and intelligent agent assistant."

In the SWE-bench Verified code repair test, GLM-4.7-Flash scored 59.2 points, significantly outperforming similarly sized models like Qwen3-30B and GPT-OSS-20B in evaluations such as "The Last Exam for Humanity."

As a lightweight version of the flagship model GLM-4.7 released last December, GLM-4.7-Flash inherits the core capabilities of the GLM-4 series in coding and reasoning, while being specifically optimized for efficiency.

In addition to programming, the official also recommends using this model for creative writing, translation, long-context tasks, and even role-playing scenarios.

30B Parameters with Only 3B Activated, MLA Architecture Makes Its Debut

GLM-4.7-Flash continues the design of the series' "hybrid thinking model."

With a total parameter count of 30 billion, only about 3 billion parameters are activated during actual inference, significantly reducing computational overhead while maintaining capabilities.

The context window supports up to 200K, allowing for both cloud API calls and local deployment.

The official technical report has not yet been released; more details need to be extracted from the configuration files.

Developers have noticed an important detail: the GLM team has adopted the MLA (Multi-head Latent Attention) architecture for the first time. This architecture was first used and validated by DeepSeek-v2, and now Zhipu has followed suit.

From a structural perspective, the depth of GLM-4.7-Flash is similar to GLM-4.5 Air and Qwen3-30B-A3B, but the number of experts differs—it uses 64 experts instead of 128, and only 5 are activated (including shared experts).

Within less than 12 hours of release, mainstream platforms like HuggingFace and vLLM provided day-zero support.

The official also provided support for Huawei Ascend NPU in the first time.

For local deployment, developers have tested it and found that on a MacBook with 32GB unified memory and an M5 chip, it can achieve a speed of 43 tokens/s.

On the official API platform, the basic version of GLM-4.7-Flash is completely free (limited to 1 concurrent), and the high-speed version GLM-4.7-FlashX is also very affordable.

Compared to similar models, it has advantages in context length support and output token price, but latency and throughput still need optimization.

HuggingFace:https://huggingface.co/zai-org/GLM-4.7-Flash

Reference Links:[1] https://x.com/Zai_org/status/2013261304060866758

— End —

Zhipu's New Model Also Uses DeepSeek's MLA, Runs on Apple M5

By Meng Chen from Aofei TempleQuantumBit | Official Account QbitAI

30B Parameters with Only 3B Activated, MLA Architecture Makes Its Debut

Related Articles

分享網址