Category: Computer Vision

Unlimited OCR: One-Shot Parsing of Long Documents with Reference Sliding Window Attention (R-SWA)
Remove the Vision Encoder, and Multimodal Models Actually Get Stronger?
What Did DeepSeek's Overnight Deleted New Paper Actually Say?
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
NUS, Fudan, and Tsinghua: The First Systematic Survey on Large Model Latent Spaces
Southeast University's Geng Xin Team: Models Don't Fail Due to Inability, But 'Crowded-Out Capacity' | CVPR 2026
Meta Bets on Neural Computers: Is the Next-Gen Computer the Model Itself?
OCR Domain Adaptation Without Retraining from Scratch? Decoupling Language Models Reduces Computation by 95%
Xiaohongshu's "Everything is OCR": A 3B Small Model Outperforms Giants, Parsing Charts into Code
Reconstructing Native Multimodality! Meituan Releases Purely Discrete Base Model, Truly Achieving 'Everything is Token'
VideoSeek Long-Video Understanding Agent: The Secret to Boosting GPT-5's Long-Video Comprehension by 10 Points
Multimodal Video Streaming Inference Efficiency Boosted by 56%: Unveiling TWW's Segment-Level Dynamic Memory Mechanism
The More Reasoning, The More Hallucinations? The "Hallucination Paradox" of Multimodal Reasoning Models
Breaking! Meta Open-Sources Its Latest World Model
Fei-Fei Li's Latest Interview: World Models Are Coming
OPA-DPO: An Efficient Solution for the Hallucination Problem in Multimodal Large Models
Thinking with Images Only: Reinforcement Learning Forges a New Reasoning Model Paradigm, Maximizing Complex Scene Planning!
Global Attention + Positional Attention Refresh SOTA! Nearly 100% Accuracy!