Zhipu's New Model Also Uses DeepSeek's MLA, Runs on Apple M5

By Meng Chen from Aofei TempleQuantumBit | Official Account QbitAI

After Zhipu AI's IPO, it has released another new achievement.

The open-source lightweight large language model GLM-4.7-Flash directly replaces the previous generation GLM-4.5-Flash, with its API available for free calling.

Image

This is a 30B total parameter, 3B activation parameter Mixture of Experts (MoE) architecture model, officially positioned as a "local programming and intelligent agent assistant."

In the SWE-bench Verified code repair test, GLM-4.7-Flash scored 59.2 points, significantly outperforming similarly sized models like Qwen3-30B and GPT-OSS-20B in evaluations such as "The Last Exam for Humanity."

Image

As a lightweight version of the flagship model GLM-4.7 released last December, GLM-4.7-Flash inherits the core capabilities of the GLM-4 series in coding and reasoning, while being specifically optimized for efficiency.

In addition to programming, the official also recommends using this model for creative writing, translation, long-context tasks, and even role-playing scenarios.

30B Parameters with Only 3B Activated, MLA Architecture Makes Its Debut

GLM-4.7-Flash continues the design of the series' "hybrid thinking model."

With a total parameter count of 30 billion, only about 3 billion parameters are activated during actual inference, significantly reducing computational overhead while maintaining capabilities.

The context window supports up to 200K, allowing for both cloud API calls and local deployment.

The official technical report has not yet been released; more details need to be extracted from the configuration files.

Image

Developers have noticed an important detail: the GLM team has adopted the MLA (Multi-head Latent Attention) architecture for the first time. This architecture was first used and validated by DeepSeek-v2, and now Zhipu has followed suit.

From a structural perspective, the depth of GLM-4.7-Flash is similar to GLM-4.5 Air and Qwen3-30B-A3B, but the number of experts differs—it uses 64 experts instead of 128, and only 5 are activated (including shared experts).

Image

Within less than 12 hours of release, mainstream platforms like HuggingFace and vLLM provided day-zero support.

Image
Image

The official also provided support for Huawei Ascend NPU in the first time.

Image

For local deployment, developers have tested it and found that on a MacBook with 32GB unified memory and an M5 chip, it can achieve a speed of 43 tokens/s.

Image

On the official API platform, the basic version of GLM-4.7-Flash is completely free (limited to 1 concurrent), and the high-speed version GLM-4.7-FlashX is also very affordable.

Image

Compared to similar models, it has advantages in context length support and output token price, but latency and throughput still need optimization.

Image

HuggingFace:https://huggingface.co/zai-org/GLM-4.7-Flash

Reference Links:[1] https://x.com/Zai_org/status/2013261304060866758

End


分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.