Remove the Vision Encoder, and Multimodal Models Actually Get Stronger?

In the development of unified multimodal models today, a deeply ingrained consensus is: to understand images, you need a pre-trained vision encoder (like CLIP, SigLIP) to extract features; to generate images, you need a VAE to compress pixels into a latent space. But what happens if you strip all these encoders away and let the model learn directly from raw pixels?

Tuna-2 delivers a counterintuitive answer: after sufficient pre-training, a completely encoder-free design consistently outperforms encoder-based solutions on multimodal understanding tasks, especially showing a clear advantage in benchmarks requiring fine-grained visual perception.

[Figure 1: Tuna-2 Architecture Evolution and Multimodal Performance Comparison] By progressively stripping away the visual encoding components of Tuna—first removing the VAE to get Tuna-R, which retains only the representation encoder, then further removing the representation encoder to achieve the fully encoder-free Tuna-2—Tuna-2, using pixel embeddings, surpasses both Tuna-R and Tuna on multiple multimodal benchmarks.

Peeling Back the Layers: From Tuna to Tuna-2

The core idea of the paper is "subtraction." Existing Unified Multimodal Models (UMMs) typically contain two visual encoders: one representation encoder for understanding (like SigLIP), and one VAE for generation. The paper removes them step by step.

In the first step, removing the VAE while retaining the representation encoder yields the intermediate product, Tuna-R. The understanding part of Tuna-R follows the classic encoder + LLM paradigm, while the generation part switches to pixel-space flow matching, adopting the x-prediction and v-loss paradigm proposed by JiT. Specifically, given a source image x₁ and sampled noise x₀, a linear schedule constructs noisy samples in pixel space. The model directly predicts the clean image, and the prediction is then converted into a velocity term for regression learning.

In the second step, the representation encoder is removed and replaced with a simple patch embedding layer, resulting in the final Tuna-2. The entire model is simplified into a single transformer decoder that directly processes image and text tokens. This design avoids the inductive biases built into pre-trained encoders, such as fixed input resolution and limited access to low-level visual details.

[Figure 3: Schematic of the Mask-based Feature Learning Scheme] During training, learnable mask tokens are used to regularize multimodal understanding and perform masked prediction for visual generation.

Challenges in Pixel Space and Masked Learning

Removing the VAE means visual modeling shifts from a compact latent space to a high-dimensional pixel space, significantly increasing redundant information. The model can easily rely on superficial shortcuts rather than learning truly useful visual cues. To address this, the paper introduces a mask-based visual feature learning scheme.

During training, a proportion of image patches are randomly selected according to a mask ratio, replaced with a learnable mask token, and then fed into the LLM decoder. The same masking operation plays different roles for generation and understanding samples: for generation samples, the model must predict the clean image for both masked and unmasked regions from partially visible noisy images, creating a harder denoising problem; for understanding samples, the model must complete multimodal reasoning under partial visual observation, serving as a regularization mechanism. Experiments show that Tuna-2 benefits more from mask training than Tuna-R, which the paper speculates is because the SigLIP 2 encoder used by Tuna-R was already pre-trained with a similar mask prediction method.

Two-Stage End-to-End Training

The encoder-free design allows Tuna-2 to be trained fully end-to-end, without requiring separate training of a connector layer. Training occurs in two stages:

Stage One is full model pre-training, using 550 million internal image-text pairs, of which 70% are image captioning data and 30% are text-to-image generation data, plus Nemotron plain text data constituting 20% of the total pre-training data. Training runs for 300,000 steps on 64 nodes with a learning rate of 1×10⁻⁴. Stage Two is supervised fine-tuning (SFT), using 13 million FineVision conversation samples and approximately 2 million OmniEdit image editing samples, trained for 50,000 steps at a learning rate of 2×10⁻⁵. Across all stages, the input sequence length per GPU is padded to 16k tokens.

The LLM decoder uniformly adopts Qwen2.5-7B-Instruct. Tuna-R additionally uses SigLIP 2 So400M as the representation encoder, and adds a 3,000-step connector alignment phase before Stage One.

Understanding Capability: Encoder-Free Surpasses Encoder-Equipped

The paper evaluates image understanding capabilities on 9 VQA benchmarks, including GQA, RealWorldQA, MMVet, MMMU, MMVP, SEED-Bench2+, AI2D, ChartQA, and OCRBench. Results show that both Tuna-R and Tuna-2 surpass Tuna, achieving state-of-the-art among all 7B-scale native UMMs. Notably, Tuna-2, after replacing the representation encoder with a simple patchify layer, actually outperforms Tuna-R in understanding performance.

On benchmarks emphasizing fine-grained visual reasoning, such as V*, CountBench, and VisuLogic, both Tuna-R and Tuna-2 surpass latent-space UMMs (like Show-o2, Tuna), indicating the necessity of pixel-space visual representations for fine-grained visual reasoning.

[Figure 6: Accuracy Curves for Tuna-R and Tuna-2 as Training Data Scale Increases] On the three understanding benchmarks—OCRBench, MMVP, and V*—Tuna-R leads in early training, but Tuna-2 catches up and ultimately surpasses it. On the GenEval generation benchmark, Tuna-R maintains a slight edge throughout, but the gap narrows as training scale increases.

Analysis of training dynamics reveals an interesting phenomenon: early in training, Tuna-R, leveraging the semantic priors of the pre-trained encoder, leads in understanding tasks; but as training data increases, Tuna-2 gradually catches up and overtakes it. This indicates that the monolithic, encoder-free design is better suited to benefit from large-scale, unified multimodal pre-training.

Generation Capability: Pixel Space Holds Its Own

On the GenEval and DPG-Bench image generation benchmarks, both Tuna-R and Tuna-2 achieve optimal levels, competitive with methods like BAGEL and Mogao. Tuna-R consistently slightly outperforms Tuna-2 in generation tasks, suggesting the semantic priors introduced by the representation encoder help learn a stronger generative model.

[Table 3: GPT-5.4 and Claude Opus 4.7 Evaluation Results] Tuna-R is slightly better on quality dimensions (35.7% vs. 32.1% under GPT-5.4), but Tuna-2 leads significantly in diversity (48.4% vs. 30.9% under GPT-5.4).

The paper also evaluates generation quality and diversity using LLM judges: 1,500 text prompts were sampled, with 4 images generated per model, judged by GPT-5.4 and Claude Opus 4.7. Tuna-2 is on par with Tuna-R for quality and surpasses Tuna, while significantly leading in diversity.

[Table 5: Image Reconstruction Performance of Different Visual Tokenizers] Tuna-R and Tuna-2 rank first among unified tokenizers, approaching the level of dedicated tokenizers like FLUX.1[dev]-VAE.

On the image reconstruction task, Tuna-R and Tuna-2 rank first among unified tokenizers, achieving rFID scores of 0.12 and 0.15 respectively, and both hitting an SSIM of 0.93, approaching the level of the dedicated image tokenizer FLUX.1[dev]-VAE.

[Figure 7: Attention Map Visualization for Tuna-R, Tuna-2, and Other Baseline Models] Red areas indicate high attention scores, while blue areas indicate low ones. Tuna-2 exhibits more accurate vision-language alignment in both basic perception and counterintuitive scenarios.

Attention Visualization: More Precise Cross-Modal Alignment

The paper compares Tuna-2's attention maps with models like LLaVA-OneVision-1.5, Qwen2.5-VL, and Penguin-VL. In basic perception scenarios like "glowing windows," Tuna-2 consistently highlights semantically relevant areas, whereas other models often provide only coarse or incomplete localization. In counterintuitive scenarios, such as a "soccer match kicking over a glass," most models are misled by text priors or visual distractors, while Tuna-2 accurately pinpoints key objects consistent with the question's semantics.

What Tiga Thinks

From Tuna to Tuna-R and then to Tuna-2, the paper completes a thorough "subtraction experiment." The final conclusion is clear and powerful: after sufficient visual pre-training, removing the pre-trained vision encoder is beneficial for learning stronger, fine-grained visual representations. Unified modeling in pixel space is not only viable but also demonstrates strong competitiveness and scalability across both understanding and generation. When the model is large enough and the data is sufficient, those meticulously designed encoder modules might just be the baggage that needs to be dropped.

📄 Original Title

Figure 1 Evolution of Tuna-2 architecture and multimodal performance comparison.

🔗 Original Link

https://arxiv.org/abs/2604.24763