The belief in the Scaling Law not only drives us to make continuous breakthroughs in model parameters and data scale, but also constantly pushes the limits of infrastructure engineering. This process is accompanied by inevitable growing pains, which we call "Scaling Pain."
As large model applications shift comprehensively from simple dialogues to more complex, long-running Coding Agent tasks, our inference infrastructure has faced unprecedented pressure, handling hundreds of millions of Coding Agent calls daily. In the past few weeks, some users experienced various anomalies when using the GLM-5 series models for complex Coding Agent tasks: garbled output, repetition, and occasionally generated rare characters. These issues do not exist in standard inference environments and are only triggered under high-concurrency, long-context Coding Agent scenarios, making them very difficult to reproduce consistently.
After several weeks of deduction, troubleshooting, and stress testing, we finally located and fixed several independent low-level race condition bugs. We also performed targeted optimizations on the system bottlenecks they revealed, significantly improving the stability and efficiency of the inference system.
We are sharing the experience and lessons learned during this exploration to help the community overcome the Scaling Pain of Coding Agent inference together.
From Local Reproduction to Anomaly Identification
Since March, we observed three types of anomalies in GLM-5's online monitoring and user feedback: garbled output, repetition, and rare characters. On the surface, these phenomena resemble the common "intelligence degradation" seen in long-context scenarios. However, because we had not deployed any optimizations that reduce model precision, a more critical question arose: Did the anomaly originate from the model itself, or from the inference pipeline? If it originated from the model, the anomaly would manifest as stable, repeatable behavior for specific inputs. Conversely, if the anomaly correlated with system pressure or runtime state, it would more likely point to issues in the inference infrastructure's pipeline or state management.
In the initial investigation, we replayed user-reported bad cases locally, repeating inferences on the same batch of requests hundreds of times, but could never reproduce the anomaly, suggesting the model itself was likely not the problem. To further simulate the pressure of the online environment, we desensitized online logs and preserved the original concurrency distribution and request timing as much as possible for a full local replay. Initially, the anomaly still did not appear. It was only after we further adjusted the Prefill-Decode (PD) separation ratio and continuously increased the system load, simulating peak-hour Prefill queueing and KV Cache pressure on the Decode side, that we were able to stably reproduce the anomaly 3-5 times per approximately 10,000 requests. This characteristic of being unrelated to request content but correlated with system pressure indicated that the problem likely stemmed from inference state management under high load. Meanwhile, the anomaly frequency reproduced locally was still lower than the frequency of online feedback, suggesting that existing detection methods might have missed some cases, or that certain triggering scenarios were not yet covered.
Reliably identifying anomalous output became a new challenge. Among the three types of anomalies, repetition is relatively easy to detect, while garbled output and rare characters are more troublesome. We tried heuristic methods like regular expressions and character set matching, as well as model-based discrimination methods. However, the former suffered from significant false negatives and false positives, while the latter struggled to meet the efficiency requirements of large-scale ablation experiments. These limitations made the anomaly detection itself a bottleneck in the localization process.
Figure 1: Speculative decoding metrics can serve as an important reference for anomaly detection.
After repeatedly analyzing inference logs, we found an unexpected entry point: Speculative Decoding metrics can serve as an important reference for anomaly detection. Speculative decoding is originally a performance optimization technique where a draft model generates candidate tokens, which are then verified and accepted or rejected by the target model, thereby improving decode efficiency without changing the final output distribution. As shown in Figure 1, we observed that two metrics—spec_accept_length (the prefix length of draft tokens continuously accepted by the target model) and spec_accept_rate (the proportion of draft tokens accepted)—exhibited stable patterns when anomalies occurred:
- Garbled output and rare characters: Typically accompanied by an extremely low spec_accept_length, meaning the candidate tokens generated by the draft model were almost entirely rejected by the target model, indicating a significant deviation between the KV Cache state seen by the target model and the draft model's expectations.
- Repetition: Typically accompanied by a high spec_accept_rate, suggesting that corrupted KV Cache might degrade attention patterns, pushing the generation process into a high-confidence repetitive loop.
Based on these observations, we further implemented an online anomaly monitoring strategy: When spec_accept_length remains below 1.4 and the generated length exceeds 128 tokens, or when spec_accept_rate exceeds 0.96, the system actively aborts the current generation and the load balancer retries the request. This strategy extends speculative decoding from a mere performance optimization technique to a real-time monitoring signal for output quality, becoming a key tool in subsequent ablation experiments.
BugFix #1: KV Cache Race Condition in PD Separation Architecture
After observing a clear correlation between anomalous output and concurrency pressure, we further analyzed its cause. By analyzing the request lifecycle and the execution timing of PD separation in the inference engine, we found that the problem stemmed from an inconsistency between the request lifecycle and the timing of KV Cache reclamation and reuse, leading to a KV Cache reuse conflict.
1. Root Cause Analysis: KV Cache Reuse Race Condition Caused by Asynchronous Abort
To constrain tail latency, we introduced a timeout-based request termination mechanism in the inference engine: when the Prefill phase does not complete within the specified time, the Decode side aborts the request and reclaims its occupied KV Cache resources. However, this abort signal was not correctly propagated to the Prefill side, and the Decode side also lacked sufficient information to determine if the KV Cache was safe to reclaim and reuse. Consequently, after Decode aborted and allocated the corresponding KV Cache space to a new request, previously initiated RDMA writes and ongoing Prefill computations were not synchronously cancelled.
Figure 2: Schematic diagram of the KV Cache race condition in a PD separation scenario.
Figure 2 illustrates the timing relationship between two requests interacting between Prefill and Decode in a PD separation architecture, and the resulting KV Cache race condition.
Initially, Req1 is sent to Prefill-1 (P1) and Decode (D). Due to scheduling or queuing, Req1 experiences a waiting period on the P1 side before it starts Prefill Forward. Meanwhile, the Decode side does not receive the corresponding KV Cache data within a certain time, triggering a timeout mechanism and aborting Req1.
Subsequently, the Decode side reclaims the KV Cache slot occupied by Req1, but without properly notifying P1. Immediately after, a new request, Req2, arrives and is assigned to Prefill-2 (P2) and Decode. Due to the memory reuse strategy, Req2 is assigned the same KV Cache address as Req1. P2 starts Prefill Forward and performs KV Transfer, completing in a shorter time, allowing the Decode side to enter the generation phase.
Meanwhile, the KV Cache write initiated by P1 for Req1 is still ongoing. Its data gets written to the video memory area that has already been reused by Req2, thus overwriting part of Req2's KV Cache. Ultimately, Req2 reads the overwritten data during the Decode phase, resulting in anomalous generation output.
2. Fix: Ensuring Timing Consistency for KV Cache Release
To eliminate the race condition described above, we introduced stricter timing constraints in the inference engine, establishing an explicit synchronization relationship between request termination and the completion of KV Cache writes.
Specifically, after Decode triggers an abort, it sends a notification to the Prefill side. Prefill returns a "release-ready" signal only when the following conditions are met: the relevant RDMA writes have not yet started, or all submitted writes have been completed. Decode is only allowed to reclaim and reuse the corresponding KV Cache slot after receiving this confirmation. This mechanism ensures that KV writes do not cross the memory reuse boundary, thus preventing cross-request KV Cache overwrites.
Fix Effect: After this fix was deployed, the occurrence rate of anomalous output dropped from approximately a dozen per ten thousand requests to fewer than three per ten thousand. The results show that in a PD separation architecture, explicit consistency constraints must be established for cross-node data transfer and video memory reuse to avoid similar problems.
BugFix #2: Missing Timing Sequence in HiCache Loading
The Coding Agent scenario significantly increases input length (averaging over 70K tokens) and is accompanied by a high prefix reuse rate. This type of workload makes HiCache (hierarchical KV Cache) a key optimization method in online services. However, when KV Cache swapping-in overlaps with computation, the previous implementation did not guarantee that data was fully loaded before use, leading to situations where unready KV Cache was accessed.
1. Root Cause Analysis: Read-Before-Ready Caused by Missing Pipeline Synchronization
By analyzing the execution timing of HiCache, we localized the problem to the cache read path of the DSA HiCache. The system asynchronously swaps in historical prefix caches from CPU memory and improves throughput by overlapping the execution of the Load Stream and Forward Stream.
As shown in Figure 3(a), the Load Stream is responsible for loading KV Cache and Indexer Cache, while the Forward Stream sequentially executes Index computation and subsequent Sparse Attention. Theoretically, the Indexer computation in the Forward Stream should not start until the corresponding Indexer Cache has been loaded. However, in the original implementation, this dependency was not explicitly expressed.
Specifically, the Indexer operator did not establish a synchronization constraint on the completion of the Load Indexer Cache operation at startup (indicated by the red dashed area in Figure 3). Therefore, the Forward Stream could begin executing before the Load Stream completed data loading, resulting in a Read-before-Ready access pattern, where data is read before it has been fully loaded.
This problem causes the Index computation to execute based on incomplete or uninitialized data, subsequently affecting the results of the subsequent Sparse Attention computation, which finally manifests as anomalous output.
Figure 3: Schematic diagram of the HiCache read pipeline timing anomaly and its fix.
2. Fix: Restructuring the Atomicity of the Operator Pipeline
To resolve the above problem, we modified the HiCache read pipeline (as shown in Figure 3(b)), introducing an explicit synchronization constraint between data loading and computation:
- Explicit Synchronization Constraint: Introduced a synchronization point with the Load Stream before the Indexer operator starts, ensuring that the corresponding layer's Indexer Cache has been loaded. The Forward Stream initiates computation only after the data is ready, thereby avoiding the Read-before-Ready access pattern.
After this fix was deployed, under the same load conditions, anomalies caused by execution timing inconsistencies completely disappeared, and system behavior stabilized. This fix has been submitted to the SGLang community via Pull Request #22811.
Optimization: LayerSplit for Hierarchical KV Cache Storage
The two race condition issues mentioned above revealed a common system bottleneck: In long-context Coding Agent serving scenarios, the Prefill phase dominates system performance.
To control the Time-To-First-Token (TTFT) caused by Prefill queuing, we introduced the timeout abort; to alleviate KV Cache capacity pressure on the Prefill side, we introduced HiCache. After fixing these state consistency issues, we returned to the bottleneck itself: How to improve Prefill throughput and reduce KV Cache video memory pressure on the Prefill side. To this end, we designed and implemented a hierarchical KV Cache storage scheme called LayerSplit.
Coding Agent workloads typically exhibit characteristics of long context lengths and a high Prefix Cache hit rate. In this scenario, the Prefill phase often becomes the main performance bottleneck, making Context Parallelism (CP) the primary parallelism strategy for online Prefill nodes. However, the existing open-source implementation of SGLang has the problem of redundant KV Cache storage, which causes the limited KV Cache capacity to become a constraint on GPU compute resource utilization.
Figure 4: LayerSplit, a hierarchical KV Cache storage scheme.
To address this problem, we designed and implemented a hierarchical KV Cache storage scheme (LayerSplit). In this scheme, each GPU no longer stores the KV Cache for all layers but only holds the KV Cache for a subset of layers (as shown in Figure 4(a)), thereby significantly reducing the video memory usage on a single card.
During computation, different CP ranks collaboratively complete Prefill in the manner shown in Figure 4(b): Specifically, a rank holding the KV Cache for a certain layer broadcasts this layer's cache to other relevant ranks before executing the Attention computation. To reduce communication overhead, we further designed an overlapping mechanism for KV Cache broadcast and indexer computation, allowing them to mask each other's time cost. Ultimately, the entire process only introduces the additional overhead of the Indexer Cache broadcast, which is about 1/8 the size of the KV Cache. Therefore, the overall communication cost is low, and the impact on performance is negligible.
Figure 5: Throughput improvement of GLM-5.1 + LayerSplit at various lengths.
Figure 5 shows the performance improvement brought by this optimization under a condition of a 90% cache hit rate for request lengths ranging from 40k to 120k. The experimental results indicate that the system throughput improvement ranges from 10% to 132%, and the benefit becomes more significant as the context length increases. Overall, this optimization significantly enhances the system's processing capability in Coding Agent scenarios.
Summary
When intelligence truly enters high-concurrency, long-context Coding Agent scenarios, the challenges for inference infrastructure go beyond just throughput, latency, and availability; maintaining its output quality becomes critically important. Every pursuit of the Scaling Law must be supported by an equally strong foundation of system engineering. We share these experiences in the hope of helping the community avoid some detours and jointly build an inference infrastructure capable of carrying the future of AGI.
Acknowledgments
This blog post introduces our research into a series of system-level problems in Coding Agent services, including the reproduction and analysis of these problems, as well as the corresponding optimization solutions. We thank Zhongke Jiahe (Beijing) Technology Co., Ltd. and the team from the State Key Laboratory of Processor Chips, Institute of Computing Technology, Chinese Academy of Sciences for their collaboration and support.
Technical blog link: https://z.ai/blog/scaling-pain