Qwen3.7-Plus: A Multimodal Agent That Sees, Codes, and Operates Screens—Edge Closer to Practical Use

Have you ever been in a situation at work where you have a prototype sketch or a UI screenshot but need to spend an entire afternoon manually converting it into frontend code? Or maybe you need to run repetitive functional tests on an app, clicking back and forth between different screens, recording findings, and verifying results, which drains a huge amount of energy. If past AI models could only "see" images and "answer" questions, starting today, a large model can "operate" interfaces and generate code for you, maintaining stable execution over tasks lasting several hours. This is starting to sound like a real productivity tool, right?

On June 1, 2026, the Qwen team released Qwen3.7-Plus, a significant step in this direction. It not only integrates visual understanding and language reasoning into the same base but also systematically strengthens capabilities for truly getting work done: screen perception, GUI operation, visual programming, and search-augmented visual question answering. By reading today's article, you will learn: what the core technical breakthroughs of this new model are, which benchmark tests it has outperformed the current state-of-the-art on, and what practical scenarios are available for developers.

The official promotional poster for Qwen3.7-Plus clearly showcases its four core positionings: Multimodal Interaction Hybrid Agent, Coding & Productivity Assistant, Visual Agent, and Cross-domain Generalization Capability.

From "Can See" to "Can Do": A Leap in Positioning

If last year's large models were still competing over "who can describe images more accurately," the release of Qwen3.7-Plus has clearly pushed the battleground a step further. A key phrase in its official positioning is the multimodal interaction hybrid agent. Breaking this term down, it means the model must not only handle image, text, and video inputs but also seamlessly integrate GUI and CLI operations within a single task, completing the entire process from understanding requirements to delivering results end-to-end.

In other words, previous models were more like a "staff advisor" that provided suggestions and then stepped out; Qwen3.7-Plus is designed to be an "executor" who can step into the field and do the work itself.

Core Technology: Intertwining "Seeing, Thinking, Writing, Doing, and Verifying"

The technical upgrade of Qwen3.7-Plus is not a single-point patch but a systematic capability reconstruction around a real-world task loop. Let's break down the most noteworthy technical directions one by one.

Multimodal Interaction Hybrid Agent: Capable of Working Continuously for Over 6 Hours

This is the most core breakthrough of Qwen3.7-Plus. In traditional agent development, vision and action are often separated: one model is responsible for looking at images, another for planning actions, with glue code needed to bridge them. The approach of Qwen3.7-Plus integrates these two things into the same model, letting it autonomously complete the full cycle from "seeing the screen" to "operating the interface" and then "verifying the result."

Showcase of Qwen3.7-Plus Agent's long-task capability

This infographic is very intuitive: An agent based on Qwen3.7-Plus completed 1,000+ tool calls, 300+ GUI operations, and ran stably for over 6 hours (reaching 11 hours in an actual case) during a single long-duration task, ultimately independently completing the development of an English vocabulary learning app.

The report disclosed more concrete real-world cases. In the task of developing an English vocabulary learning app, the agent achieved 1,000+ tool calls and 300+ GUI operations, running continuously and stably for over 11 hours, completing the full loop from requirement analysis to version iteration. In another test replicating a native macOS Stocks application, the Hybrid-Agent autonomously completed UI layout understanding, SwiftUI source code generation, integration of real-time market APIs, and 10 automated functional verifications, delivering a high-fidelity application.

This means the model is no longer just a "one-time Q&A" machine, but possesses the capability to maintain context and operational consistency over long-duration, multi-step tasks. Have you encountered the problem of "an agent getting lost mid-way" in your actual projects? The enhanced stability of Qwen3.7-Plus in this respect might change the automated development workflows for some teams.

Visual Agent: From Perception to Programmatic Solving

Faced with visual puzzles like "spot the difference," "solve Klotski," or "navigate a maze," humans typically observe the structure first and then deduce the steps mentally. Qwen3.7-Plus takes a similar approach: it can first convert the geometric structures and constraints in an image into a computable problem representation, then autonomously generate and execute Python code to solve it.

The significance of this capability leap is that the model no longer just provides label-like "descriptions" of an image, but can treat visual input as a problem requiring logical solving. This opens up a larger imaginative space for its application in fields like quality inspection, game testing, and educational assistance, which require a combination of "seeing" and "computing."

Visual Coding & GUI Agent: Screenshot to Code, Interface as Command

The capabilities of Qwen3.7-Plus in the two directions of Visual Coding and GUI Agent are mutually reinforcing.

Visual Coding enables the model to understand geometric structures, colors, layouts, and even dynamic changes in visual references and accurately reproduce them in the form of SVG, web pages, or interactive front-end code. For scenarios where front-end developers and designers interact, this means the barrier from design drafts to editable code is significantly lowered.
GUI Agent allows the model to understand interface layouts, locate controls, plan tasks, and perform multi-step interactions in both mobile and desktop environments. A browser intelligent assistant based on Qwen3.7-Plus can act as a Browser Agent, executing operations such as clicking, typing, navigating, configuring, and verifying in a real browser, and even completing full-chain automation from cloud server procurement to O&M upgrades.

Seeing this, does it remind you of those operation and maintenance processes in your company that require repeated "clicking around" in backend systems? If this direction continues to mature, it might liberate more than just programmers.

Search-Augmented Visual Q&A: Solving "I Don't Know This" in the Open World

For visual questions that depend on external knowledge, Qwen3.7-Plus can combine image input with web search. The model first extracts key entities and scene clues from the vision, then retrieves the latest knowledge from the web in real-time, and finally synthesizes the visual evidence and search results to give a reliable answer.

This greatly expands the model's Q&A boundaries in the open world. For example: you take a photo of an unknown plant and ask, "Is this plant suitable for planting in Yunnan?" The model can first identify the plant species, then search for corresponding planting conditions, and finally provide a comprehensive answer—all completed within a single reasoning loop.

Seamless Integration with Mainstream Frameworks: Lowering the Barrier for Developers

No matter how good the technology, if the integration cost for developers is too high, it's hard to land. Qwen3.7-Plus has made pragmatic arrangements in this regard: it supports seamless integration with Claude Code via the Anthropic API protocol; a simple configuration on the Alibaba Cloud Bailian platform allows connection to OpenClaw; the Qwen team also launched the Qwen Code tool deeply optimized for it, maximizing model performance. No matter which framework you are used to, theoretically, you can integrate it with a low learning curve.

Let the Data Speak: Proving Itself on Multiple Benchmarks

No matter how loud the promotion, what ultimately matters are the scores. Qwen3.7-Plus has undergone quite detailed benchmark comparisons in the two major directions of pure text & agent, and multimodal capabilities.

Pure Text and Agent Capabilities

Qwen3.7-Plus multi-benchmark SOTA comparison table

This table compares Qwen3.7-Plus with top models like Opus-4.6 Max and DeepSeek-V4-Pro Max on pure text and agent benchmarks. Qwen3.7-Plus ranks first on Terminal Bench 2.0 (terminal coding agent) with 70.3 points, significantly leads on Deep-Planning (deep planning) with 62.3 points, and also takes the top spot on MCP-Mark (MCP tool usage) with 58.7 points.

In the field of coding agents, Qwen3.7-Plus achieved a leading 70.3 points on Terminal Bench 2.0 (terminal coding agent) and a top-level 1588 points on QwenSVG (SVG code generation), surpassing DeepSeek-V4-Pro Max's 67.9 points on the former. In terms of general agent capabilities, it scored 62.3 points on Deep-Planning (deep planning) and 58.7 points on MCP-Mark (MCP tool usage), demonstrating its autonomous planning and execution stability in complex multi-step tasks. It should be noted that on SWE-Verified (software engineering verification), Qwen3.7-Plus's 77.7 points is slightly lower than Opus-4.6 Max's top score of 80.8; on GPQA Diamond (STEM reasoning), its 90.3 points also slightly trails Opus-4.6 Max's 91.3 points. However, overall, its competitiveness in agent-related metrics is already quite prominent.

Multimodal Capabilities: A Qualitative Leap

Multimodality is truly the "main battlefield" of Qwen3.7-Plus.

Qwen3.7-Plus multimodal SOTA comparison table

The multimodal comparison table shows that Qwen3.7-Plus significantly leads comparable products on multiple core benchmarks such as BabyVision (70.4), ScreenSpot Pro (79.0), and AndroidWorld (81.0), and compared to its predecessor Qwen3.6-Plus, it has achieved a visibly noticeable generational leap.

Compared to the previous generation Qwen3.6-Plus, Qwen3.7-Plus saw its score on BabyVision (early visual cognitive reasoning) jump from 37.4 points to 70.4 points, nearly doubling; on AndroidWorld (mobile task execution), it leaped from 67.2 points to 81.0 points. On ScreenSpot Pro (GUI element grounding), its score of 79.0 points significantly surpasses GPT-5.4's 67.4 points and Gemini-3.1 Pro's 67.5 points. The leaps on these few benchmarks are particularly noteworthy because they are more about the ability to "work on real interfaces" rather than simply answering questions about images. BabyVision tests capabilities akin to early human visual cognition and spatial reasoning, and the doubled score indicates a qualitative breakthrough in this foundational ability.

In this visual chart horizontally comparing seven mainstream models across 12 benchmarks, Qwen3.7-Plus ranks first in most items. Its advantages are particularly obvious in agent coding and visual understanding dimensions like Terminal-Bench 2.0, ScreenSpot Pro, and RealWorldQA, only slightly behind individual competitors in a few tests like NL2Repo and HLE.

Pricing and Ecosystem

Qwen3.7-Plus is currently available via API service through the Alibaba Cloud Bailian platform. The official release did not disclose specific input/output token pricing, but judging from its "service through Bailian" model, it is expected to follow the existing commercial system of Alibaba Cloud models. Developers can keep an eye on the Bailian platform for subsequently updated pricing details.

Where Can These Capabilities Actually Be Applied?

The technical direction of Qwen3.7-Plus points to several very specific application scenarios:

End-to-end Software Development: You provide a design draft or interface screenshot, and the model can directly generate runnable front-end code. For product managers or designers without a technical background, prototype verification no longer requires frequently asking for development resources.
Automated Testing and O&M: Let an agent perceive application interfaces, understand task steps, and autonomously execute verification. This can significantly reduce the proportion of manual repetitive labor in software regression testing, app data scraping, and batch management of cloud resources.
Multimodal Knowledge Assistant: Combined with search-augmented visual Q&A, the model can provide more reliable integrated answers in scenarios requiring a fusion of "seeing" and "looking up," such as competitive product analysis (screenshots + real-time information retrieval), travel inquiries (scenic spot photos + latest guides), and research report generation.

A noteworthy point is Qwen3.7-Plus's compatibility with mainstream frameworks like Claude Code, OpenClaw, and Qwen Code, which means you don't need to swap out your familiar development toolchain to utilize its capabilities. This pragmatic strategy might accelerate its popularity in the developer community.

Summary: A Key Step for Multimodal Agents Toward "Practicality"

At its core, the release of Qwen3.7-Plus pushes the competition focus of multimodal large models from "describing images" to the more complex domain of GUI/CLI hybrid agents. Its significant improvements on benchmarks like BabyVision, ScreenSpot Pro, and AndroidWorld prove that its capabilities in perception, planning, operation, and construction within real digital environments are moving beyond the stage of "demos are cool, but implementation is hard."

Two directions are worth continuous observation in the future: first, the stability performance of such long-duration agents over even longer periods and in more open environments; second, whether the developer ecosystem and application cases surrounding Qwen3.7-Plus can continue to enrich. After all, for a powerful model to truly change workflows, it ultimately depends on how many developers are willing to pay for it and use it to solve real problems in actual scenarios.

1. Qwen3.7-Plus Technical In-depth Report: https://qwen.ai/blog?id=qwen3.7-plus

2. Qwen3.7-Plus Multi-benchmark SOTA Comparison Table (Text/Agent)

3. Qwen3.7-Plus Multimodal SOTA Comparison Table

4. Qwen3.7-Plus Official Promotional Poster and Performance Comparison Chart