Qwen-Scope: Seeing Through the 'Hidden Thoughts' of Large Models

Interpretability analysis is a crucial direction for probing large model behavior, offering perspectives for performance optimization and controllable reasoning. Today, we are thrilled to introduce Qwen-Scope—an interpretability module trained on the Qwen3 and Qwen3.5 model series. Specifically, we inserted and trained Sparse Autoencoders (SAEs) within the hidden layers of Qwen models. By imposing sparsity constraints, we automatically extract highly decoupled, low-redundancy, and more interpretable latent space features. Qwen-Scope can not only be used to analyze the intrinsic mechanisms of Qwen model behavior but also holds immense potential for model optimization. Application scenarios include targeted control of reasoning results, data classification and synthesis, model training and optimization, and analysis and comparison of evaluation sample distributions.

Key Highlights of Qwen-Scope:

Reasoning: Achieve targeted control of reasoning results without explicitly providing natural language instructions.
Data: Collect features for data classification using only a small amount of seed data, significantly reducing data dependency. Simultaneously, leverage inactive feature information to directionally construct data, supplementing long-tail capabilities.
Training: By analyzing low-frequency error issues like language mixing and repetitive generation, locate abnormally activated features. Assist model training during supervised fine-tuning and reinforcement learning stages to reduce the frequency of such responses.
Evaluation: Calculate feature activation patterns across different samples or evaluation benchmarks to jointly assess evaluation redundancy. This guides the selection of evaluation sets, enhances evaluation capability coverage, and reduces evaluation costs.

Overall Overview

The weights open-sourced for Qwen-Scope cover 7 large models, encompassing dense models and Mixture of Experts models from the Qwen3 and Qwen3.5 series, totaling 14 sets of Sparse Autoencoder weights. To ensure a wide distribution of SAE features, strong semantic meaning, and a stable and reliable training process, we sampled 0.5 billion tokens from the corresponding models' pre-training data for training.

Practice

You can leverage Qwen-Scope to analyze and develop Qwen series models. Below, we demonstrate the utility of Qwen-Scope from four perspectives: reasoning, evaluation, data, and training. Detailed information can be found in the technical report.

Reasoning: Analyzing Model Behavior and Controllable Outcomes

By controlling the activation of features, we achieve targeted control over reasoning outputs, such as directional modifications to language, entities, or style, without explicit natural language instructions.

Data: Classification and Synthesis

Qwen-Scope dissects and summarizes model representations in various directions, making it a useful data processing tool that provides novel approaches for both data classification and data synthesis. In the context of toxic data classification, we can analyze the activation patterns of toxic samples on SAE features based on a small amount of seed data and select features highly correlated with toxicity for classification. The entire process requires no additional classifier training, significantly reducing annotation and training costs. Moreover, even with limited seed data, high classification accuracy is achieved, drastically lowering the dependency on large-scale labeled data.

In data synthesis scenarios, Qwen-Scope can also identify toxic text features that are rarely activated or even inactive in existing data and directionally synthesize supplementary samples. Compared to traditional data synthesis approaches, this method offers stronger controllability and targeting, enabling more efficient coverage of long-tail capabilities and improving training data efficiency by approximately 15 times.

Training: Targeted Optimization

The features of Qwen-Scope can also be applied during the training stage. For instance, when we observe language mixing in the model (such as Chinese characters anomalously appearing in an English response), we can locate the abnormally activated features. During the supervised fine-tuning phase, we design a loss function targeting these abnormal activation features to guide the model to reduce the frequency of such bad cases.

Consider another example: repetitive generation. This is a low-frequency phenomenon that is difficult to sample during the reinforcement learning stage. To address this, we can control the corresponding features to increase the sampling frequency of anomalous responses, thereby increasing the density of learning rewards. This facilitates the model's thorough optimization of this issue during reinforcement learning.

Evaluation: Identifying Gaps and Redundancy in Test Samples

Evaluation is central to large model development. As the number of capabilities and dimensions to evaluate grows and sample sizes become massive, a key question emerges: which evaluation sets are redundant, and which areas are insufficiently covered? With Qwen-Scope, we can analyze the feature coverage of test sets to determine the degree of evaluation redundancy between different benchmarks. As shown in the figure below, we found that some commonly used evaluation sets exhibit mutual coverage relationships in their activated features, causing certain benchmarks to have less practical significance due to redundant assessment. We hope these analytical methods can help you select test samples and evaluation sets with higher coverage and lower evaluation costs.

Summary

Qwen-Scope can not only analyze model behavior but also delve into the model's internals, transforming complex parameter computations into human-understandable concepts and patterns. It doesn't just "understand" the model; it can also "improve" it. Practice has proven that it can provide ideas and guide directions for model optimization during reasoning, evaluation, data processing, and training stages. Interpretability is not just a tool for post-hoc analysis; it can also serve as a core engine driving model evolution. We welcome community feedback and are even more excited to see everyone demonstrate their creativity and showcase more novel and interesting use cases!

Try It Out

You can experience Qwen-Scope on Hugging Face or ModelScope.

Links

Hugging Face: https://huggingface.co/spaces/Qwen/QwenScope?spm=a2ty_o06.30285417.0.0.65e5c921MGq3Tu

ModelScope: https://modelscope.cn/studios/Qwen/QwenScope?spm=a2ty_o06.30285417.0.0.65e5c921FZvQi4

Technical Report: https://qianwen-res.oss-accelerate.aliyuncs.com/qwen-scope/Qwen_Scope.pdf

Follow us to stay updated on the latest Qwen large model developments

Qwen-Scope: Seeing Through the 'Hidden Thoughts' of Large Models

Related Articles

分享網址