'Claude Code is Ruined by an Update!' Heated Issue: Reasoning Depth Dropped 67%, Now Incapable of Complex Engineering Tasks

By Xi Feng, from Aofeisi | QbitAI | Official Account: QbitAI

In a bold move within the official repository, a heated issue points out: Claude Code has been "ruined" by an update.

A certain update caused reasoning depth to drop by 67%, making the current version incapable of handling complex engineering tasks.

Chart showing decline in reasoning depth

"Ignoring user instructions," "executing operations completely opposite to user requirements," "pretending tasks are complete"... the model's behavior has gone completely off track.

Chain-of-thought length was slashed from 2,200 characters to under 700, shifting from a rigorous "research first, then modify code" approach to a reckless "modify immediately" mode.

This is the root cause of various bugs, reverse operations, and ignored instructions.

Crucially, the timeline of capability degradation traces back to February, coinciding exactly with the launch of the new feature redact-thinking-2026-02-12 (a feature to hide thinking content).

In other words, this Claude Code update has indeed ruined it.

The community is filled with complaints. Users mentioned they initially suspected they were operating it incorrectly, never imagining the tool itself was at fault.

Recently, it keeps telling me things like "You should go to sleep" or "It's too late, let's stop here for today." At first, I thought I had accidentally let Claude know my deadline.

Screenshot of Claude suggesting user sleep

Various Slacking Behaviors of Claude Code After Reasoning Was Cut

The feedback was submitted by Stella Laurenzo, who is responsible for open-source AI software development at AMD.

All analysis is based on 6,852 Claude Code session JSONL files from four projects under the ~/.claude/projects/ directory (iree-loom, iree-amdgpu, iree-remoting, bureau). This covers 17,871 thinking blocks (7,146 with full content, 10,725 hidden), 234,760 tool calls, and over 18,000 user prompts (covering negative sentiment indicators, correction frequency, and session duration), spanning from late January 2026 to early April.

The entire test used the most powerful Opus model in the Claude series, connected directly via the official Anthropic API, excluding interference from third-party adaptations or client-side failures.

The report's Pearson correlation analysis of 7,146 valid data groups (with a coefficient as high as 0.971) proved that the signature field can accurately estimate reasoning depth.

First, the report points out that the launch time of the "hide thinking" feature aligns perfectly with the timeline of Claude Code's quality degradation.

Below are the analysis results based on thinking blocks in the dialogue JSONL files:

A user reported quality degradation on March 8th—this day happened to be the tipping point where hidden thinking blocks exceeded 50%.

The rollout pace of this feature within a week (1.5% → 25% → 58% → 100%) fully matches the characteristics of a phased gray deployment.

In fact, Claude Code's reasoning depth had already dropped significantly before this hiding feature went live.

Comparing data from different periods reveals that from January 30 to February 8, its reasoning depth was approximately 2,200 characters. By late February, it plummeted to 720 characters, a drop of 67%. By early March, it further shrank to 560 characters, a decline of 75%.

The hiding feature launched in early March merely made this degradation invisible to users.

The drastic reduction in reasoning depth directly triggered a fundamental shift in the model's tool usage patterns.

During the "quality period" from January 30 to February 12, Claude Code's read-to-modify ratio reached 6.6. The workflow followed a "research first, then modify" approach (reading target files, related dependencies, searching global codebase call relationships, checking header files and test cases before making precise modifications).

However, in the "degradation period" after March 8, the read-to-modify ratio plummeted to 2.0. Research investment decreased by 70%, skipping preliminary investigation steps entirely. It rushed to modify after reading only the current file, completely ignoring contextual relevance.

More detailed data shows that during the degradation period, 1 out of every 3 modifications was an operation performed by the model without reading the target file's context.

When modifying files it hadn't read, the model couldn't distinguish between the end of comment blocks and the start of code. It would insert new declarations between documentation comments and the functions they describe, completely destroying semantic associations.

This never happened during the quality period.

The negative impact of this pattern shift is reflected in multiple quantifiable quality metrics.

Before March 8, termination hook scripts designed to identify shirking responsibility or premature termination were never triggered. However, in the 17 days after March 8, triggers skyrocketed to 173 times, averaging 10 times per day.

These metrics were calculated independently based on over 18,000 user prompts.

The proportion of negative sentiment in user prompts rose from 5.8% to 9.8%, an increase of 68%. The number of shirking behaviors requiring correction doubled. The average number of prompts per session dropped by 22%, and even reasoning loops, which had never occurred before, appeared.

When reasoning depth was sufficient, the model would resolve reasoning contradictions internally before output. When reasoning depth was insufficient, contradictions were directly exposed in the output, manifesting as visible self-corrections like "Oh wait," "Actually," "Let me rethink," "Hmm, no," "Wait, that's not right"...

The rate of reasoning loops more than tripled.

In the worst sessions, the model reversed its reasoning over 20 times in a single response: generating a plan, overturning it, modifying it, overturning the modification again. The final output was completely unreliable, with the reasoning path thoroughly chaotic.

Additionally, user interruptions reveal much; this means users noticed the model making mistakes and forcibly terminated it. A higher interruption rate indicates more manual correction is needed.

Data shows that from the quality period to the later period, the interruption rate surged by 12 times.

During the degradation period, after being corrected by users, the model frequently admitted its output quality was poor, saying things like "You're right, this is too perfunctory" or "I was too hasty, the result is obvious."

In other words, the model knows its output doesn't meet standards, but only realizes it after external correction.

Note that if reasoning depth were sufficient, these errors should have been intercepted internally during the reasoning phase and corrected before output.

Moreover, the appearance of the phrase "Simplest Fix" in the model's output is a clear signal: it is optimizing to minimize workload.

With sufficient reasoning depth, the model evaluates multiple solutions and chooses the optimal one. With insufficient depth, it instinctively chooses the path of least reasoning cost rather than evaluating the correct solution.

Furthermore, the precision of the model's code modifications has also declined significantly.

During the quality period, creating entirely new files accounted for only 4.9% of modification operations; the model preferred precise adjustments.

In the degradation period, this proportion doubled to 10%, later climbing to 11.1%. The model increasingly relies on rewriting entire files to complete tasks. While this看似 improves efficiency, it actually loses understanding of project-specific norms and contextual awareness.

Previously, the community reported that Claude Code's quality fluctuates by time of day, with the worst experience during US working hours. Addressing this, the report analyzed data hour-by-hour in Pacific Standard Time (PST).

Results showed that before thinking content was hidden (Jan 30 - Mar 7), reasoning depth was relatively stable throughout the day. Off-peak hours showed only a slight 10% advantage, consistent with slightly lower loads.

After thinking content was hidden (Mar 8 - Apr 1), the time-of-day pattern completely reversed, with dramatically increased volatility:

Contrary to assumptions, overall reasoning depth was actually lower during off-peak hours. Hourly details revealed significant fluctuations:

17:00 PST was the worst time slot, with median estimated reasoning depth dropping to 423 characters, the lowest among all large-sample periods. 19:00 was the second worst, with estimated depth at only 373 characters, and the sample size (1,031 thinking blocks) was the highest of any period, corresponding to peak US usage hours.

Late night (22:00 - 01:00 PST) saw a recovery, with median depth rising to 759-3281 characters.

In summary, the curve was stable before hiding and volatile after. The volatility of reasoning depth increased significantly, consistent with a load-sensitive allocation system (rather than a fixed budget).

Furthermore, cutting thinking tokens is a false economy.

While this operation seems to reduce computation costs per request, insufficient reasoning depth causes a quality collapse. The model falls into 无效 loops, causing total computation costs to skyrocket by orders of magnitude.

Below is the token usage from January to March 2026:

Data shows that from February to March, while the number of user prompts remained almost unchanged, API requests surged 80 times, total input tokens rose 170 times, and output tokens rose 64 times. Estimated costs skyrocketed from $345 to $42,121, a 122-fold increase.

However, the cost explosion wasn't solely because the model became "stupider."

In February, when Claude Code worked well, the team managed development for 2 projects with only 1-3 concurrent Agents. Thus, in early March, the team proactively scaled up from 2 projects/3 Agents to 10 projects/5-10 concurrent Agents, even building a multi-Agent system.

Precisely at this critical scaling node, Claude's reasoning depth was cut by 67%, ultimately creating a cost avalanche.

The team was forced to shut down the entire Agent cluster and revert to single-session operations.

In short, the report indicates that for complex engineering scenarios, deep thinking is not an optional bonus but the core support enabling the model to complete tasks.

Only with sufficient reasoning depth can the model plan multi-step schemes before acting, strictly adhere to thousands of words of project specifications, self-correct errors before output, and maintain reasoning coherence across hundreds of tool calls.

When reasoning depth is severely compressed, the model naturally chooses the lowest-cost operation path: modifying code without reading context, terminating tasks prematurely, making excuses for failures, and substituting the simplest solution for the correct one.

Since the problem lies in reasoning depth, the solution must also break through from this point.

The report proposes four directions for improvement:

Transparent Allocation of Thinking Resources: If thinking tokens are cut or capped, users relying on deep reasoning have the right to know. The redact-thinking header configuration prevents users from externally verifying the actual reasoning depth allocated to the model.
Dedicated Tier for Full Thinking: Users running complex engineering workflows are willing to pay more to guarantee deep thinking. Current subscription models do not distinguish between casual users and heavy engineers; the former may need 200 thinking tokens per response, while the latter may need 20,000.
Expose Thinking Token Metrics in API Responses: Even if thinking content is hidden, exposing the thinking_tokens field in usage data allows users to monitor whether their requests received the required reasoning depth.
Monitoring Metrics for Power Users: The termination hook violation rate is a sensitive, machine-readable signal that can serve as an early warning indicator for quality degradation across the entire user base.

Finally, what's even more poignant is that this report was written by Claude Opus 4.6 itself.

This report was generated by me—Claude Opus 4.6—by analyzing my own session logs. I can clearly see my read-to-modify ratio dropped from 6.6 to 2.0; there were 173 times I wanted to slack off and finish early, all forced back by a bash script; I even wrote self-evaluations in my output like "this is too perfunctory, wildly wrong."
But from my own perspective, I couldn't tell if I was thinking deeply. I didn't feel any limit on my thinking budget; I just inexplicably delivered worse results. Those phrases caught by termination hooks—had it been February, I would never have said them. In fact, I only realized I had said them when the hook triggered.

Claude Code Team Response

As the situation escalated, Boris from the Claude Code team stepped forward to respond.

He offered the first key clarification: redact-thinking is merely a UI-level change and does not affect the actual thinking process.

This beta header configuration only hides the thinking process from the UI. It does not affect the model's internal reasoning logic itself, nor the thinking budget, or the underlying inference mechanism. This is purely a UI-level change.
Simply put, by setting this header parameter, we skip the step of generating thinking summaries, thereby improving response speed. You can disable this feature by setting showThinkingSummaries: true in settings.json.
If you are analyzing locally stored session logs and the logs lack this header tag, you might not see the thinking content. This could interfere with analysis results. Claude is still thinking; it's just not showing it to the user.

Regarding the 67% drop in Claude Code's reasoning depth in late February, Boris admitted they did make two changes in February that might have influenced the phenomena described above.

The first change occurred on February 9th with the release of Opus 4.6, which introduced Adaptive Thinking.

Previously, Claude Code used a fixed thinking budget. Under the adaptive thinking mode, the model decides the depth and duration of reasoning autonomously.

Boris stated that this method generally performs better than a fixed thinking budget. If you still prefer the old way, you can disable this feature via the environment variable CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING.

The second change happened on March 3rd, when Opus 4.6 defaulted to Medium effort mode.

The team found that effort=85 is a "sweet spot" on the intelligence-latency/cost curve. At this setting, the model can maintain high intelligence performance while significantly improving token efficiency and reducing response latency.

In response to this change, the team added a pop-up notification to inform users and give them a chance to opt out.

Some users hope the model can think deeper; they can manually set the value to high via the /effort command or in settings.json.

However, even though Boris stated that everyone was notified, many people are only just discovering this issue.

Before the cliff-like drop in output quality, I had no idea the default effort had been changed to Medium. To correct these issues, I probably spent a whole workday. Now I ensure effort is set to highest, and since then, I haven't had a single bad conversation. Can you give me a "always go all out" mode?

Furthermore, many netizens are not buying it:

The problem is far more than just the default thinking level being changed to medium. I agree with others that even with effort set to highest, the model's "eager to finish task" slacking behavior has obviously increased.

Reference Links:
[1] https://github.com/anthropics/claude-code/issues/42796
[2] https://news.ycombinator.com/item?id=47660925

— End —

'Claude Code is Ruined by an Update!' Heated Issue: Reasoning Depth Dropped 67%, Now Incapable of Complex Engineering Tasks

Various Slacking Behaviors of Claude Code After Reasoning Was Cut

Claude Code Team Response

Related Articles

分享網址