Source | PaperWeekly

Can all physical world signals ultimately converge into homogeneous discrete tokens?

For a long time, the continuity of visual signals has been considered a difficult characteristic to handle in autoregressive modeling. To accommodate these irregular features, the current general approach is to introduce complex spatial encodings or heterogeneous modules into the model.

While this architectural compromise yields quick results, it blurs the logical unity of the model.

Just yesterday, Meituan's LongCat team open-sourced a new base model: LongCat-Next.

This model chooses to回归 to the most primitive Next Token Prediction (NTP) paradigm. In its view, whether it is complex code, high-definition images, or recordings with environmental background noise, they are essentially no different.

This architecture, named Discrete Native Autoregressive (DiNA), achieves unified modeling across all modalities at the underlying level.

LongCat-Next is built upon Meituan's self-developed LongCat-Flash-Lite MoE base, with only 3B activated parameters.

At the 3B activation scale, it demonstrates remarkable efficiency. On the OmniDocBench-EN and CharXivRQ benchmarks, which focus on document parsing and chart understanding capabilities, its performance comprehensively surpasses the multimodal model Qwen3-Omni-A3B of the same size.

Furthermore, its visual understanding capability is comparable to specialized models of the same size like QwenVL.

While acquiring multimodal capabilities, LongCat-Next successfully overcomes the pain point of catastrophic forgetting, retaining the original logical depth of language models.

Its SWE-Bench score remains stable at 43.0, indicating that it maintains extremely high usability in actual code engineering tasks.

〓 LongCat-Next Core Benchmark Performance

Alongside open-sourcing the model, Meituan also released the technical report for LongCat-Next.

Technical Report Address:

https://github.com/meituan-longcat/LongCat-Next/blob/main/tech_report.pdf

GitHub Address:

https://github.com/meituan-longcat/LongCat-Next

HuggingFace Address:

https://huggingface.co/meituan-longcat/LongCat-Next

Demo Experience:

https://longcat.chat/longcat-next

In this article, we will 详细 ly dismantle the underlying logic behind it.

A Single Autoregressive Logic for All Signals

To fit physical world signals into the same autoregressive framework, the primary challenge is unifying the representation of different modalities.

〓 DiNA Architecture: Unified Convergence of Cross-Modal Signals to Discrete Tokens

In the setting of LongCat-Next, the discrete modeling of language already possesses a mature ecosystem. Following this train of thought, since speech can be viewed as the phonetic representation of language, it naturally succeeds in discrete modeling as well.

The real challenge lies in vision. To enable images to be processed like text and speech, LongCat-Next converts all continuous visual signals into homogeneous discrete tokens.

This unification between modalities is very intuitive in T-SNE visualization analysis, where representations of different modalities are highly interwoven and aligned in space.

〓 Fusion Distribution of Visual and Language Representation Spaces

This natural fusion allows the model to achieve the natural emergence of "listening," "speaking," "seeing," and "drawing" within the same logic without introducing complex extra designs like 3D-RoPE or bidirectional attention.

Tokenizing Images Like Text

The core to solving the discretization of visual signals is LongCat-Next's pioneering discrete Native Vision Transformer (dNaViT).

It provides an extremely flexible unified visual interface, truly giving images the ability to "tokenize and detokenize" like language. It can extract visual features into visual vocabulary and convert them into hierarchical discrete tokens.

This mechanism supports input of arbitrary resolution, giving the model a overwhelming advantage in tasks sensitive to aspect ratios and minute details, such as complex chart reasoning.

〓 dNaViT Interface: Supporting Visual Discretization of Arbitrary Resolution

To lock in information during compression, dNaViT introduces the Residual Vector Quantization (RVQ) mechanism. By recursively fitting the residuals of the previous layer through the next layer's codebook, it constructs a vast representation space within a single autoregressive step, ultimately achieving 28x efficient compression.

In terms of architectural design, the front-end visual tokenization and the back-end generation decoding are strictly decoupled.

Multi-layer discrete tokens are simply added and fused when entering the large model; whereas in the generation phase, LongCat-Next independently introduces the Depth Transformer as a multimodal prediction head.

This design does not increase the front-end encoding burden and cleverly achieves efficient parallel decoding of multi-level tokens.

Additionally, to effectively avoid the loss of high-level semantics during the discretization process, LongCat-Next introduces the Semantic Alignment Encoder (SAE). Through global alignment and multi-task dense learning, the discrete tokens generated by the model possess intrinsic information recovery attributes.

〓 dNaViT Interface and Cascaded RVQ Discretization Process

High-Fidelity Restoration Under Decoupled Dual-Track Architecture

In the generation phase, relying solely on a frozen SAE encoder makes it difficult to capture high-frequency visual details. Therefore, LongCat-Next designs a unique decoupled Dual-Path Detokenization.

The first track is a Structural Pixel Decoder based on ViT, responsible for generating low-resolution anchor images to preserve global layout, thereby greatly reducing generation variance.

The second track is the Diffusion Refiner, specifically responsible for injecting and restoring ultra-high-frequency 微小 details into the image, ensuring high-fidelity reconstruction.

In testing, when facing complex structures containing higher-order summation and fractional nesting, with the intervention of the Diffusion Refiner module, the model is still able to achieve perfect replication with clear handwriting and accurate structure.

〓 Comparison of Reconstruction Effects of ViT Decoder and Refiner Module on Complex Mathematical Formulas

Furthermore, since these discrete tokens inherently encode the layout and structured elements of the image, LongCat-Next avoids the common text garbling problem in image-text generation from the underlying logic, demonstrating excellent text restoration in extreme OCR tasks.

Capability Testing

After the open-sourcing of LongCat-Next, we conducted a series of actual tests.

We first found a supermarket receipt containing correction records to test its information extraction and logical verification capabilities.

LongCat-Next avoided interference from numbers like "100g*3" in product names and directly output structured JSON data.

At the same time, it accurately 理清 ed the settlement logic. It identified that single-item discounts are negative deductions, while deleted discounts are positive add-backs, and listed the complete formula (-9.00) + (-4.50) + (-4.50) + (+4.50), completing a precise verification with the final discount total.

▲ Slide down to view full results

Next, we uploaded a perplexity (PPL) line chart from the YaRN paper and asked it to analyze the performance differences of different methods.

LongCat-Next accurately read the PPL trends under different sequence lengths and drew conclusions consistent with the original paper, showing no information omission or hallucination when processing such dense academic charts.

▲ Slide down to view full results

In terms of image generation, we attempted to have it generate a children's picture book cover, requiring the prompt to include a main title, subtitle, and author name, specifying layout positions and snowflake texture fonts.

From the generated image, the text spelling is completely accurate, capitalization follows instructions, and the vertical layout of the title and author name shows no issues with floating or 遮挡 ing text.

In the test cases displayed by the official team, the input was a logical reasoning question recorded in Sichuan dialect.

LongCat-Next directly understood the dialect audio and provided an accurate logical deduction process.

Similarly, in the official speech synthesis case, the model was asked to synthesize a daily meeting notification mixing Chinese and English. When handling this mixed-language scenario, its pronunciation and rhythm switching were very natural, without any stiff machine-spliced feeling.

Moving Towards the Next-Generation Base

Returning to the question at the beginning of the article: Can all physical world signals ultimately converge into homogeneous discrete tokens?

LongCat-Next has provided a clear answer with its actual performance. At a time when multimodal models generally rely on parameter stacking and heterogeneous module splicing, it proves that there is still huge room for dividends in reconstructing the underlying architecture.

By converting continuous visual and auditory signals into homogeneous discrete tokens, it successfully pulls multimodal tasks back to the most mature track of language models: Next Token Prediction.

This not only allows a base model with only 3B activated parameters to demonstrate cross-level image-audio understanding and generation capabilities, but more importantly, it provides an extremely simple and efficient new route for system engineering.

Currently, the code, model weights, and full technical report for LongCat-Next are all open source.

For researchers and developers struggling with information loss in cross-modal fusion, this purely discrete architecture provides a new sample worth digging into and verifying.

What the final form of modal fusion will look like, it may be too early to conclude.

But LongCat-Next at least lets us see that on the road to finding a unified representation of the physical world, besides constantly piling on external modules to do addition, we can also do subtraction through the unification of underlying logic.

Reconstructing Native Multimodality! Meituan Releases Purely Discrete Base Model, Truly Achieving 'Everything is Token'

A Single Autoregressive Logic for All Signals

High-Fidelity Restoration Under Decoupled Dual-Track Architecture

Capability Testing

Moving Towards the Next-Generation Base

Related Articles

分享網址