TIP × AsyncTLS: Distillation Training Cuts Tokens by Half, Sparse Attention Inference Surges 4.7x

Editor's Note: Training and inference happening simultaneously—today, two papers from Princeton and the Meituan/Huawei Labs respectively offer system-level answers in the arenas of distillation training and long-context inference. TIP reproduces full-scale distillation effects using only 50% of tokens, slashing memory usage by 47%; AsyncTLS combines two-layer sparsity with asynchronous offloading, boosting end-to-end throughput by up to 4.7x. Together, these two studies address the core efficiency bottleneck in AI infrastructure: unified training and inference.


Training Side: TIP—More Tokens in Distillation Isn't Always Better

Source: arXiv:2604.14084, Princeton / Multi-institution collaboration, April 15, 2026

The mainstream approach to large model knowledge distillation involves aligning all output tokens of the student model with the teacher model—the more, the better? TIP's answer is a resounding no.

The Princeton team proposed a two-axis classification framework: Student Entropy (the model's degree of uncertainty) × Teacher-Student Divergence (the difference between teacher and student outputs). They discovered that token importance is not uniform:

  • High-Entropy Tokens: The student is uncertain, containing dense exploratory signals.
  • Low-Entropy + High-Divergence Tokens: The student is overconfident yet incorrect, providing extremely dense correction signals—a category completely invisible to traditional entropy sampling.

Key Experimental Data:

  • 50% of tokens match full-scale distillation performance, with peak memory reduced by 47%.
  • < 10% of precise correction tokens approach the performance of the full-scale baseline.
  • < 20% of tokens actually surpass full-scale training in long-horizon planning tasks (DeepPlanning)—removing noise results in purer signals.

Testing covered three teacher-student pairs including Qwen3, Llama, and Qwen2.5, with comprehensive validation on the MATH-500 and AIME 2024/2025 mathematical reasoning benchmarks.


Inference Side: AsyncTLS—A Two-Layer Revolution in Sparse Attention

Source: arXiv:2604.07815, Multi-institution collaboration (including Meituan), April 9, 2026

Long-context LLM inference faces two major hurdles: O(n²) attention complexity and explosive KV Cache memory consumption. Token-level sparsity offers good precision but high indexing overhead; block-level sparsity is fast but lacks precision. AsyncTLS chooses to have the best of both worlds:

Two-Layer Sparse Attention Architecture:

  1. Coarse-Grained Block Filtering: Rapidly eliminates irrelevant blocks.
  2. Fine-Grained Token Selection: Retains key tokens for precise calculation.

Simultaneously, it employs an Asynchronous Offloading Engine: leveraging temporal locality, it parallelizes KV Cache transmission and computation to eliminate idle waiting times.

Key Experimental Data (Qwen3 + GLM-4.7-Flash, 48K~96K context):

  • Operator-level acceleration: 1.2× ~ 10.0×
  • End-to-end throughput improvement: 1.3× ~ 4.7×
  • Precision approaches full-attention levels, supporting both GQA and MLA architectures.

Why These Two Papers Deserve Side-by-Side Reading

DimensionTIP (Distillation Training)AsyncTLS (Inference)
Root Cause of ProblemBlindly using all tokens is inefficientNaive sparsity cannot balance speed and precision
Core InsightToken importance is non-uniform and can be categorizedSparsity granularity is non-uniform; combining coarse and fine is optimal
Key Data50% tokens, Memory -47%Throughput up to +4.7×
Implementation ThresholdDirectly accessible via standard OPD frameworksSupports Qwen3/GLM, compatible with GQA+MLA

Both papers point to the same underlying logic: The next wave of efficiency dividends in large model AI infrastructure will not come from stacking more compute power, but from performing more precise calculations with less information. Whether in training or inference, the assumption that "all tokens are equivalent" is being systematically overturned.


Source: arXiv:2604.14084 (TIP, Princeton) | arXiv:2604.07815 (AsyncTLS)

Related Articles

分享網址
AINews·AI 新聞聚合平台
© 2026 AINews. All rights reserved.