Category: Computer Vision
- Remove the Vision Encoder, and Multimodal Models Actually Get Stronger?
- What Did DeepSeek's Overnight Deleted New Paper Actually Say?
- Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
- NUS, Fudan, and Tsinghua: The First Systematic Survey on Large Model Latent Spaces
- Southeast University's Geng Xin Team: Models Don't Fail Due to Inability, But 'Crowded-Out Capacity' | CVPR 2026
- Meta Bets on Neural Computers: Is the Next-Gen Computer the Model Itself?
- OCR Domain Adaptation Without Retraining from Scratch? Decoupling Language Models Reduces Computation by 95%
- Xiaohongshu's "Everything is OCR": A 3B Small Model Outperforms Giants, Parsing Charts into Code
- Reconstructing Native Multimodality! Meituan Releases Purely Discrete Base Model, Truly Achieving 'Everything is Token'
- VideoSeek Long-Video Understanding Agent: The Secret to Boosting GPT-5's Long-Video Comprehension by 10 Points
- Multimodal Video Streaming Inference Efficiency Boosted by 56%: Unveiling TWW's Segment-Level Dynamic Memory Mechanism
- The More Reasoning, The More Hallucinations? The "Hallucination Paradox" of Multimodal Reasoning Models
- Breaking! Meta Open-Sources Its Latest World Model
- Fei-Fei Li's Latest Interview: World Models Are Coming
- OPA-DPO: An Efficient Solution for the Hallucination Problem in Multimodal Large Models
- Thinking with Images Only: Reinforcement Learning Forges a New Reasoning Model Paradigm, Maximizing Complex Scene Planning!
- Global Attention + Positional Attention Refresh SOTA! Nearly 100% Accuracy!