Qwen3.5: Towards Native Multimodal Agents

We are excited to officially release Qwen3.5 and introduce the open-weights version of the first model in the Qwen3.5 series: Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B performs excellently across comprehensive benchmark evaluations in reasoning, coding, agent capabilities, and multimodal understanding, helping developers and enterprises significantly improve productivity. The model adopts an innovative hybrid architecture, combining linear attention (Gated Delta Networks) with Sparse Mixture-of-Experts (MoE), achieving outstanding inference efficiency: with a total of 397 billion parameters, only 17 billion parameters are activated per forward pass, optimizing speed and cost while maintaining capabilities. We have also expanded language and dialect support from 119 to 201, providing broader availability and better support for global users.

(Qwen3.5-Plus Performance)

Welcome to Experience

Qwen Chat:

https://chat.qwen.ai/

Model Performance

Natural Language

Below we comprehensively compare Qwen3.5 with frontier models across various evaluation tasks and modalities.

Vision Language

Compared to the Qwen3 series models, Qwen3.5's post-training performance improvements mainly come from our comprehensive expansion of various RL tasks and environments. We emphasize more on the difficulty and generalizability of RL environments, rather than optimizing for specific metrics or narrow categories of queries. The figure below shows the gains in general Agent capabilities as RL Environment scales. Overall performance is calculated from the average ranking of each model on the following benchmarks: BFCL-V4, VITA-Bench, DeepPlanning, Tool-Decathlon, and MCP-Mark. More details on scaling effects across tasks will be elaborated in our upcoming technical report.

(Agent performance gains with RL Environment scaling)

Pre-training

Qwen3.5 advances pre-training across three dimensions: capability, efficiency, and versatility:

Power: Trained on larger-scale vision-text corpora with enhanced Chinese, English, multilingual, STEM, and reasoning data, using stricter filtering to achieve cross-generational parity: Qwen3.5-397B-A17B performs comparably to Qwen3-Max-Base with over 1T parameters.
Efficiency: Built on the Qwen3-Next architecture—higher sparsity MoE, Gated DeltaNet + Gated Attention hybrid attention, stability optimization, and multi-token prediction. At 32k/256k context lengths, Qwen3.5-397B-A17B's decoding throughput is 8.6x/19.0x that of Qwen3-Max with comparable performance. Qwen3.5-397B-A17B's decoding throughput is 3.5x/7.2x that of Qwen3-235B-A22B.
Versatility: Achieves native multimodality through early text-vision fusion and expanded vision/STEM/video data, outperforming Qwen3-VL at similar scale. Multilingual coverage expanded from 119 to 201 languages/dialects; 250k vocabulary (vs. 150k) brings approximately 10–60% encoding/decoding efficiency gains across most languages.

(Qwen3.5 inference efficiency significantly improved)

Here is the base model performance:

Infrastructure

Qwen3.5 achieves efficient native multimodal training through heterogeneous infrastructure: decoupling parallel strategies on vision and language components to avoid inefficiencies from unified solutions. Utilizing sparse activation for cross-module computation overlap, achieving nearly 100% training throughput on mixed text-image-video data compared to text-only baseline. On this foundation, native FP8 pipeline applies low precision to activations, MoE routing, and GEMM operations, while maintaining BF16 on sensitive layers through runtime monitoring, achieving approximately 50% activation memory reduction and over 10% acceleration, with stable scaling to trillions of tokens.

To continuously unlock the potential of reinforcement learning, we built a scalable asynchronous reinforcement learning framework that supports Qwen3.5 full-size models and comprehensively covers text, multimodal, and multi-turn interaction scenarios. Through decoupled design with train-inference separation architecture, the framework significantly improves hardware utilization, achieving dynamic load balancing and fine-grained fault recovery. Combined with FP8 train-inference, Rollout route replay, speculative sampling, and multi-turn Rollout locking technologies, we further optimized system throughput and improved train-inference consistency. Through system and algorithm co-design, the framework effectively mitigates data long-tail issues while strictly controlling sample staleness, improving training curve stability and performance ceiling. Additionally, the framework is designed for native agent workflows, enabling stable, seamless multi-turn environment interactions, eliminating framework-level scheduling interruptions. This decoupled design enables the system to scale to millions of Agent scaffolds and environments, significantly enhancing model generalization capabilities. The above optimizations achieved 3×–5× end-to-end acceleration, demonstrating excellent stability, high efficiency, and scalability.

Getting Started with Qwen3.5

Interacting with Qwen3.5

Welcome to use Qwen3.5 at chat.qwen.ai. We provide three modes: auto, thinking, and fast. In "auto" mode, users can use adaptive thinking and call tools like search and code interpreter; in "thinking" mode, the model deeply thinks through difficult problems; in "fast" mode, the model directly answers questions without consuming thinking tokens.

Alibaba Cloud Bailian

Users can experience our flagship model Qwen3.5-Plus through Alibaba Cloud Bailian. To enable advanced capabilities like reasoning, web search, and Code Interpreter, simply pass the following parameters:

enable_thinking: Enable reasoning mode (chain-of-thought)
enable_search: Enable web search and Code Interpreter

Example code:

"""Environment variables (per official docs):  DASHSCOPE_API_KEY: Your API Key from https://bailian.console.aliyun.com  DASHSCOPE_BASE_URL: (optional) Base URL for compatible-mode API.  DASHSCOPE_MODEL: (optional) Model name; override for different models.  DASHSCOPE_BASE_URL:    - Beijing: https://dashscope.aliyuncs.com/compatible-mode/v1    - Singapore: https://dashscope-intl.aliyuncs.com/compatible-mode/v1    - US (Virginia): https://dashscope-us.aliyuncs.com/compatible-mode/v1"""from openai import OpenAIimport osapi_key = os.environ.get("DASHSCOPE_API_KEY")if not api_key:    raise ValueError(        "DASHSCOPE_API_KEY is required. "        "Set it via: export DASHSCOPE_API_KEY='your-api-key'"    )client = OpenAI(    api_key=api_key,    base_url=os.environ.get(        "DASHSCOPE_BASE_URL",        "https://dashscope.aliyuncs.com/compatible-mode/v1",    ),)messages = [{"role": "user", "content": "Introduce Qwen3.5."}]model = os.environ.get(    "DASHSCOPE_MODEL",    "qwen3.5-plus",)completion = client.chat.completions.create(    model=model,    messages=messages,    extra_body={        "enable_thinking": True,        "enable_search": False    },    stream=True)reasoning_content = ""  # Full reasoning traceanswer_content = ""  # Full responseis_answering = False  # Whether we have entered the answer phaseprint("\n" + "=" * 20 + "Reasoning" + "=" * 20 + "\n")for chunk in completion:    if not chunk.choices:        print("\nUsage:")        print(chunk.usage)    continue    delta = chunk.choices[0].delta    # Collect reasoning content only    if hasattr(delta, "reasoning_content") and delta.reasoning_content is not None:        if not is_answering:            print(delta.reasoning_content, end="", flush=True)        reasoning_content += delta.reasoning_content    # Received content, start answer phase    if hasattr(delta, "content") and delta.content:        if not is_answering:            print("\n" + "=" * 20 + "Answer" + "=" * 20 + "\n")            is_answering = True        print(delta.content, end="", flush=True)        answer_content += delta.content

You can seamlessly integrate Bailian API with third-party programming tools like Qwen Code, Claude Code, Cline, OpenClaw, OpenCode, etc., for a smooth "vibe coding" experience.

Summary and Future Work

Qwen3.5, with its efficient hybrid architecture and native multimodal reasoning, lays a solid foundation for general digital agents. The next phase will shift focus from model scale to system integration: building agents with cross-session persistent memory, embodied interfaces for real-world interaction, and self-improvement mechanisms, with the goal of creating systems capable of long-term autonomous operation and logical consistency—upgrading current task-bound assistants into sustainable, trustworthy partners.

Demo

Now, Qwen3.5 with agent capabilities can combine multimodality to think, search, and call tools simultaneously.

Code Agent

1. Web Development

Qwen3.5 can assist with web development, particularly excelling in frontend tasks like building webpages and designing user interfaces. It can transform simple instructions into runnable code, making website creation easier and more efficient.

2. OpenClaw

Qwen3.5 can integrate with OpenClaw to drive programming tasks. By integrating OpenClaw as a third-party agent environment, Qwen3.5 can perform web searches, information gathering, and structured report generation—combining its reasoning and tool-calling capabilities with OpenClaw's interface to provide users with a smooth coding and research experience.

3. Qwen Code

Powered by Qwen3.5 as the underlying model, Qwen Code supports a "vibe coding" experience, transforming natural language instructions into code, iteratively developing projects in real-time, and supporting creative tasks like video generation. Qwen Code collaborates with Qwen3.5 to bring a smooth and efficient experience for daily programming and exploratory coding.

Visual Agent

1. GUI Agent

Qwen3.5 can serve as a visual agent, autonomously operating mobile phones and computers to complete daily tasks. On mobile, it has adapted to more mainstream applications, supporting natural language instruction-driven operations; on PC, it can handle complex tasks like cross-application data organization and multi-step workflow automation, effectively reducing repetitive manual intervention and improving work efficiency.

2. Visual Programming

Qwen3.5 supports image and video input with an extended context window of 1M tokens, capable of directly processing up to 2 hours of video content. Based on this, it can transform hand-drawn interface sketches into well-structured frontend code, perform logic restoration on simple game videos, or automatically distill long video content into structured webpages or visualized charts, lowering the barrier from creativity to implementation.

Prompt:

Create a homepage of OpenQwen, a virtual assistant personal agent that can help with coding, office works, shopping and so on. Generate high-quality images as the website's resources, including an avatar and demos of its use cases.

3. Image-based Reasoning

Breaking through the limitations of traditional image cropping tools, Qwen3.5 natively supports code-level image processing: automatically cropping local areas to enlarge details, or enhancing key features through annotation and enhancement operations, enabling more refined visual reasoning and analysis.

4. Spatial Intelligence

Leveraging pixel-level position information modeling of images, Qwen3.5 performs more accurately in tasks like object counting, relative position judgment, and spatial relationship description. It effectively mitigates misjudgments caused by viewpoint changes or occlusion, demonstrating good spatial perception potential in embodied intelligence applications like autonomous driving scene understanding and robot navigation.

5. Visual Reasoning

Compared to Qwen3-VL, Qwen3.5 performs more robustly on academic problem-solving and other visual reasoning tasks. By combining image content with contextual understanding, it can perform multi-step logical reasoning, providing a more reliable foundation for multimodal Agent applications in education, research, and other fields.

Qwen3.5: Towards Native Multimodal Agents

Related Articles

分享網址