Category: Deep Learning

How Do You Evaluate the Interaction Model Recently Released by Thinking Machines? - wangleineo's Answer
ICML 2026 | Rejecting Brute Force, PRISM Framework Enables Efficient Test-Time Scaling for dLLMs
Attention Is All You Need Author Returns: Can a 99% Sparse Transformer Be Even Faster?
Token-Level, Precision Length Control: 3B Model Beats GPT 5.4 and Claude
Hardcore: Google's Jeff Dean Says the Bottleneck for Million-Chip LLM Pre-training Has Been Completely Broken!
Stanford's New Theory Unravels the Mystery of Neural Network Generalization, Adding One Line of Code to Adam Yields 2.4x Speedup
Remove the Vision Encoder, and Multimodal Models Actually Get Stronger?
What Did DeepSeek's Overnight Deleted New Paper Actually Say?
Scaling Laws for Looped Transformers
OCR Domain Adaptation Without Retraining from Scratch? Decoupling Language Models Reduces Computation by 95%
Demystifying the Sparse LLM Innovation by NVIDIA and Sakana AI
How Can a Model Trained on 200M Real Tokens Match the Performance of 360M Data?
AI Doesn't Need to Understand the World, But We Need to Understand AI
Rotate Attention by 90 Degrees! Today, Kimi's 'Attention Residuals' Takes Off
Nvidia's new technique cuts LLM reasoning costs by 8x without losing accuracy
Mining Activation Functions Like Crypto? DeepMind Builds a 'Compute Farm' to Brute-Force Search for the Next-Gen ReLU
Stop Clipping Aggressively! Qwen Proposes GatedNorm, Unifying the Perspective on Residual Flow Mysteries
Google's New Discovery: DeepSeek Reasoning Splits into Multiple Personalities, Left and Right Brain Competing for Intelligence
Is Transformer Dead? DeepMind Is Betting on Another AGI Path
What to do with poor pre-training data? Bengio team introduces explicit Bayesian for gradient-free In-Context RL
Optimization is Geometry, Geometry is Inference: Using Mathematics to End the Transformer Black Box Era
RLVR Reinforcement Learning Training Costs Plummet 98%! 12 PEFT Methods Head-to-Head, Results Are Surprising...
Attention Is Not What You Need? Reframing Sequence Modeling with Geometric Aesthetics via Grassmann Manifolds
Wenfeng Liang Signs, DeepSeek Kicks Off New Year with a New Macro Architecture Chapter, Cracking the Gradient Explosion and Memory Wall
[In-Depth] Ilya Sutskever's Selected Paper: The Platonic Representation Hypothesis