Editor's Note: Training and inference happening simultaneously—today, two papers from Princeton and the Meituan/Huawei Labs respectively offer system-level answers in the arenas of distillation training and long-context inference. TIP reproduces full-scale distillation effects using only 50% of tokens, slashing memory usage by 47%; AsyncTLS combines two-layer sparsity with asynchronous offloading, boosting end-to-end throughput by up to 4.7x. Together, these two studies address the core efficiency bottleneck in AI infrastructure: unified training and inference.
Training Side: TIP—More Tokens in Distillation Isn't Always Better
Source: arXiv:2604.14084, Princeton / Multi-institution collaboration, April 15, 2026
The mainstream approach to large model knowledge distillation involves aligning all output tokens of the student model with the teacher model—the more, the better? TIP's answer is a resounding no.
The Princeton team proposed a two-axis classification framework: Student Entropy (the model's degree of uncertainty) × Teacher-Student Divergence (the difference between teacher and student outputs). They discovered that token importance is not uniform:
- High-Entropy Tokens: The student is uncertain, containing dense exploratory signals.
- Low-Entropy + High-Divergence Tokens: The student is overconfident yet incorrect, providing extremely dense correction signals—a category completely invisible to traditional entropy sampling.
Key Experimental Data:
- 50% of tokens match full-scale distillation performance, with peak memory reduced by 47%.
- < 10% of precise correction tokens approach the performance of the full-scale baseline.
- < 20% of tokens actually surpass full-scale training in long-horizon planning tasks (DeepPlanning)—removing noise results in purer signals.
Testing covered three teacher-student pairs including Qwen3, Llama, and Qwen2.5, with comprehensive validation on the MATH-500 and AIME 2024/2025 mathematical reasoning benchmarks.
Inference Side: AsyncTLS—A Two-Layer Revolution in Sparse Attention
Source: arXiv:2604.07815, Multi-institution collaboration (including Meituan), April 9, 2026
Long-context LLM inference faces two major hurdles: O(n²) attention complexity and explosive KV Cache memory consumption. Token-level sparsity offers good precision but high indexing overhead; block-level sparsity is fast but lacks precision. AsyncTLS chooses to have the best of both worlds:
Two-Layer Sparse Attention Architecture:
- Coarse-Grained Block Filtering: Rapidly eliminates irrelevant blocks.
- Fine-Grained Token Selection: Retains key tokens for precise calculation.
Simultaneously, it employs an Asynchronous Offloading Engine: leveraging temporal locality, it parallelizes KV Cache transmission and computation to eliminate idle waiting times.
Key Experimental Data (Qwen3 + GLM-4.7-Flash, 48K~96K context):
- Operator-level acceleration: 1.2× ~ 10.0×
- End-to-end throughput improvement: 1.3× ~ 4.7×
- Precision approaches full-attention levels, supporting both GQA and MLA architectures.
Why These Two Papers Deserve Side-by-Side Reading
| Dimension | TIP (Distillation Training) | AsyncTLS (Inference) |
|---|---|---|
| Root Cause of Problem | Blindly using all tokens is inefficient | Naive sparsity cannot balance speed and precision |
| Core Insight | Token importance is non-uniform and can be categorized | Sparsity granularity is non-uniform; combining coarse and fine is optimal |
| Key Data | 50% tokens, Memory -47% | Throughput up to +4.7× |
| Implementation Threshold | Directly accessible via standard OPD frameworks | Supports Qwen3/GLM, compatible with GQA+MLA |
Both papers point to the same underlying logic: The next wave of efficiency dividends in large model AI infrastructure will not come from stacking more compute power, but from performing more precise calculations with less information. Whether in training or inference, the assumption that "all tokens are equivalent" is being systematically overturned.
Source: arXiv:2604.14084 (TIP, Princeton) | arXiv:2604.07815 (AsyncTLS)