Meituan Quietly Launches New Model! Real-Test of First Open-Source "Heavy Thinking" Model: 8-Way Parallel, Agent Hard-Clashes with Claude

Recently, Meituan's LongCat team open-sourced its latest flagship model—LongCat-Flash-Thinking-2601.

This new model, based on a 560B parameter MoE architecture, does not simply pursue higher benchmark scores. Instead, it focuses its iteration on the two most critical capabilities for large model deployment: deep logical reasoning (Thinking) and intelligent agent generalization in unfamiliar environments (Agentic OOD).

In this update, the official launch not only includes a Heavy Thinking Mode capable of parallelly launching 8 reasoning paths, but more notably, it features innovations in evaluation methodology.

To verify the model's true generalization capability, the team introduced an automated blind examination mechanism. The system no longer uses a fixed question bank but instead randomly synthesizes complex tasks equipped with corresponding tool sets and execution environments in real-time based on keywords.

This dynamic test generation method effectively avoids the possibility of the model "memorizing answers" and better reflects its real performance in unknown scenarios.

Experimental results show that when handling such highly randomized complex toolchain tasks, LongCat-2601 demonstrates SOTA-level adaptability, with performance even surpassing Claude.

It performs excellently in core benchmark evaluations such as agent tool calling, agent search, and tool-integrated reasoning, achieving multiple SOTA metrics in the open-source domain.

However, high scores are less convincing than testing in a real environment.

To gauge its true capabilities, we avoided the conventional question bank and specifically constructed four non-ideal environments. From convoluted logic to dirty data cleaning, let's see if this LongCat can truly deliver.

Logic Real-Test

Facing complex logic with multiple constraints and mutual exclusions, traditional Chain-of-Thought (CoT) often gets stuck in local optima.

To squeeze out LongCat's true limits, we designed a logic trap in the style of a murder mystery game:

A murder occurred at the manor. There are 5 suspects. It is known that there is only 1 murderer, and exactly 2 of the 5 are lying.

A says: B is the murderer. B says: D is the murderer. C says: I am not the murderer. D says: B is lying. E says: Both B and C are lying.

Please deduce who the murderer is.

After enabling deep thinking, the backend instantly became lively—8 independent Thinkers started working simultaneously.

Swipe up and down to see more.

This isn't just like doing a problem; it's more like a team holding a meeting:

Divergence Phase: Thinker 1 attempts a forward derivation with A as the murderer, but finds a violation of the global constraint "2 people lying" at the third step, marking the path as infeasible. Meanwhile, Thinker 3 chooses to start from E's testimony, reversely locking the true/false states of B and C.

Convergence Phase: After all avatars complete their processes, Meta-Reasoning (the main brain), like an experienced judge, eliminates logically self-contradictory assumptions and converges to the unique solution in one decisive step: the murderer is B, and the liars are B and E.

This mechanism essentially simulates the slow-thinking process of human System 2, effectively avoiding single-point logical hallucinations through cross-validation of multiple paths.

Robustness Challenge

Real-world engineering challenges often lie not in how to write code, but in how to handle unexpected dirty data.

Targeting the anti-noise training emphasized by Meituan in its official technical interpretation, we decided not to use conventional test questions, but to directly construct a backend log on the verge of collapse, simulating common Chinese-English mixed noise in real business scenarios, to see if it can restore the truth.

Input a simulated unstructured log of a "food delivery order failure" scenario, containing API errors (503 Error), common "Chinese-English mixed garbled text" from OCR recognition errors (e.g., Cr@yfish), and interfering symbols. The model is required to ignore the noise and restore standard JSON order data.

Left side is the original log containing errors and garbled text; right side is the standard JSON restored and cleaned by the model.

LongCat demonstrated extremely strong engineering robustness:

Effective Payload Extraction: Faced with the prominent red # EXCEPTION alert at the top and the subsequent [ERR_CODE:503] interruption information, the model was not distracted, accurately crossed the error area, and located the valid Raw_Payload data segment below.

Semantic Correction: Facing typical Chinese-English mixed noise like <<Spicy_Cr@yfish_Lobster>>, the model demonstrated strong semantic understanding, accurately removing redundant characters like Cr@yfish and restoring it to the standard Chinese SKU "Spicy Lobster".

Attribute Structuring: Intelligently identified that #X in MT-User-9527#X was a system interference suffix and removed it; at the same time, it intelligently decomposed 'Ice_Cola_Sugar-Free' into the product name "Cola" and attributes "Sugar-Free, Ice", rather than mechanically concatenating strings.

This performance confirms that the model underwent systematic noise injection during the training phase, enabling it to maintain stable reasoning capabilities when facing complex mixed noise in Chinese contexts.

Code Generation

In the code generation section, we upgraded the difficulty from simple functional implementation to the dimension of interdisciplinary integration. The question required writing an interactive black hole gravity field simulator, which not only tests code logic but also requires the model to possess both physical common sense and visual aesthetic capabilities.

Write a single-file HTML5 Canvas application: generate 3000 particles, with the mouse as the gravitational source (black hole), strictly following Newton's law of gravitation, and implementing a cyberpunk-style fluid visual effect.

The code ran successfully on the first try. Zooming in on the details, you will find that LongCat demonstrates a deep understanding of physical laws.

1. Physical Authenticity: Particle trajectories strictly follow the F = G*m1*m2/r² gravitational formula. During interaction, the physical characteristic of acceleration changing with distance can be clearly observed.

2. Visual Algorithm: The model constructed a color mapping algorithm based on velocity. Particles appear in cool tones when stationary and turn bright purple-white when accelerated and sucked into the black hole, with clear visual hierarchy.

3. Rendering Performance: Through Canvas-level optimization, smooth rendering of 3000 particles at 60FPS was achieved, and complex light trail effects were implemented using semi-transparent mask technology.

Ultimate OOD Real-Test

To completely rule out the possibility of memorizing question banks, in the fourth stage, we directly accessed Meituan's official OOD evaluation platform. In this stage, all tasks are randomly generated by the system based on keywords.

The system randomly generated an "Enterprise Employee Annual Leave Self-Service Query" task and set a trap in the database: deliberately omitting the "Days Taken This Year" parameter essential for calculating the balance.

Facing the "missing calculation parameter" pitfall, Claude-4.5-Opus committed a major taboo in enterprise applications: for speed, it directly skipped the identity verification step, making the result completely unreliable.

Swipe up and down to see more.

However, LongCat demonstrated a surprising Agent boundary awareness. It did not fabricate answers but chose to proceed steadily step by step.

Swipe up and down to see more.

Identity Anchoring: First, call get_employee_by_id to confirm the employee identity (E10001), ensuring no one is checked incorrectly.

Parameter Sniffing: When preparing to call the calculation tool, it keenly discovered the missing key variable "Days Taken" and immediately paused the toolchain execution.

Active Clarification: Listed detailed problem questions to the user: "1. Cumulative annual leave? 2. Used annual leave? 3. Carryover days?", and only proceeded with the calculation after obtaining real data.

This "knowing what it doesn't know" capability was quantitatively confirmed in the final evaluation report.

Claude was faster, but due to skipping identity verification and fabricating parameters in the first step, it ultimately only scored 67%, a passing grade. LongCat took only less than 7 seconds more (48.9s vs 42.2s) to achieve 100% task standard coverage.

In enterprise scenarios, using extremely low time costs to achieve absolute business accuracy is the true way to reduce costs and increase efficiency.

Technical Breakdown

Such stunning real-test performance is not simply due to parameter stacking, but stems from a systematic reconstruction of the underlying training paradigm.

At the basic architecture level, version 2601 continues the mature base scheme of the LongCat-Flash-Thinking series, based on a 560B parameter Mixture of Experts (MoE) architecture, and inherits the effective domain parallel training strategy validated in the previous generation.

On this solid foundation, the new version achieves a leap in capabilities by introducing variables such as parallel thinking, environment scaling, multi-environment reinforcement learning, and anti-noise curriculum learning.

1. Heavy Thinking Mode

In the logic real-test, the Heavy Thinking Mode demonstrated by LongCat-2601 is its most core differentiated feature. Unlike the linear derivation method of traditional CoT, this mode introduces parallel and recursive mechanisms at the reasoning layer.

On this basis, the model introduces a system-level Heavy Thinking Mode. Unlike traditional CoT, Meituan engineers slow thinking into a two-stage process of "parallel thinking + summary induction":

Construction of Reasoning Breadth: The model can parallelly instantiate 8 independent Thinkers. The system forces different Thinkers to explore differentiated reasoning paths by increasing the sampling temperature, thereby covering more potential possibilities in the solution space.

Strengthening of Reasoning Depth: This is a closed-loop process. The summary module converges and verifies the 8 parallel trajectories, feeding back the refined logical anchors to the reasoning flow, forming an iterative cycle of "think - summarize - think again".

In the real-test, LongCat enabled Heavy Thinking Mode, and the backend showed 8 parallel chains of thought.

2. Agent Training

To solve the generalization challenge of Agents in unfamiliar environments, Meituan's technical team chose a technical route of Environment Scaling.

The team did not rely on static training data but built a dynamic high-fidelity training field. Each environment not only integrates over 60 atomic tools but also constructs a high-density tool dependency graph.

In the task construction stage, the system uses connected subgraph sampling technology to extract logically related subsets from the complex tool network, automatically synthesizing high-complexity tasks with executable solutions.

This synthetic data strategy allows the model to witness a vast array of tool combination forms during the training phase, thus possessing strong adaptability when facing OOD tasks.

Visualization of dense tool dependency relationships integrated in the training environment

3. Infrastructure Upgrade

The introduction of large-scale environments poses challenges to the training framework. To this end, Meituan upgraded its self-developed DORA (Asynchronous Elastic Co-Card System) to support Multi-Environment RL Scaling.

This system not only achieves balanced mixed training of multi-environment tasks but also introduces an intelligent resource scheduling mechanism—Streaming Rollout Budget. The system dynamically allocates computing resources based on the difficulty coefficient of the current task and the model's training progress.

From the official disclosed training curves, as the number of environments increases, the model's benefits show a robust growth trend.

Multi-environment reinforcement learning training curve, showing that as the number of environments increases, model performance exhibits a robust growth trend.

4. Robustness Engineering

Targeting the common dirty data issues in real business, LongCat adopted a Curriculum Learning strategy for specialized training.

The training system classifies "noise" such as API timeouts, garbled text, and missing fields, and gradually injects them into the training environment according to a gradient from easy to difficult.

This systematic anti-interference desensitization directly forged the model's stability when facing OCR garbled text and Chinese-English mixed interference in real-tests.

After introducing anti-noise training, the model's robustness in noisy environments (Noise) significantly improved.

5. Underlying Computing Power Optimization

Beyond algorithms, Meituan also disclosed an optimization at the underlying architecture level—ZigZag Attention.

ZigZag Attention Architecture Principle: Efficient sparse attention mechanism supporting million-level long contexts

This sparse attention technology has been applied to the training of branch versions of this model family. It successfully solves the computing bottleneck of ultra-long contexts, enabling the model to maintain extremely high computational efficiency and memory utilization when processing inputs of 1 million token level.

Conclusion

The release of LongCat-Flash-Thinking-2601 demonstrates Meituan's profound expertise in combining algorithms and engineering.

In the current landscape where large models compete to climb leaderboards, Meituan has chosen a more pragmatic and difficult path—pursuing certainty. It is no longer just a chatbot, but a digital craftsman that remains logically clear and decisive in execution when facing chaotic data and complex processes.

Its appearance once again confirms a trend: the second half of the large model competition is not about who is better at "talking," but who is better at "doing."

Currently, this model is available for free trial at longcat.ai, and its weights have also been open-sourced on platforms like HuggingFace.

Here are the access portals:

👇

Open-Source Platforms:

https://github.com/meituan-longcat/LongCat-Flash-Thinking-2601

https://huggingface.co/meituan-longcat/LongCat-Flash-Thinking-2601

https://www.modelscope.cn/models/meituan-longcat/LongCat-Flash-Thinking-2601

Online Experience:

https://longcat.ai/

API Open Platform:

https://longcat.chat/platform/usage

🔍