Agentic Software Engineering | Stop Measuring AI's Work with Human Rulers: Native Agent Work Estimation

"Agentic Software Engineering" may be a series of articles documenting my transition from traditional software engineering to agentic software engineering.

A Veteran Programmer's Epiphany

I've been writing code for nearly twenty years. Estimating timelines has become muscle memory—breaking down tasks, prioritizing, multiplying by experience coefficients, adding integration and buffer time, and finally delivering a "two to three weeks" estimate. This methodology has served me through every paradigm shift: Waterfall, Agile, and DevOps, never failing me.

Until AI programming arrived.

Recently, I used an AI Agent to build a CLI tool. My immediate instinct was to map out the implementation process: parsing arguments, core conversion, validation logic, error handling, and testing. Following traditional methods, this would take at least two to three days. The result? The Agent went from start to delivery in under half an hour.

Did efficiency improve? Superficially, yes, of course.

But a deeper question emerged: I couldn't even correctly predict how long the AI would take.

My entire estimation framework is anchored to the "human developer" baseline. How many lines of code? How many debugging sessions? How many rounds of code review? When the executor shifts from human to agent, this ruler becomes completely useless.

Ironically, AI Agents make the same mistake. If you ask Claude or GPT, "How long will this feature take?" it will solemnly tell you "about 2-3 days." Why? Because its training data—Stack Overflow and technical blogs—says so. The Agent is using human experience to estimate human time, only to finish the job in ten minutes itself.

I began to realize that as software engineering moves toward Agentic Software Engineering, the first thing that must change is this mindset of "measuring AI's work with a human ruler."

So, I wrote a Skill called agent-estimation^[1] specifically to solve this problem.

The Root Cause: Human-Time Anchoring

I call this phenomenon Human-Time Anchoring.

Here is how it works:

When generating estimates, AI Agents unconsciously invoke "collective experience" from their training data. A REST API project? Forums say one week. A full-stack dashboard with real-time charts? A tech lead estimated three sprints during planning. These figures are empirical values produced by human developers operating under human cognitive bandwidth, context-switching costs, and communication overhead. They hold true for humans but are completely inapplicable to an Agent capable of completing a "think → code → execute → verify → fix" cycle every 3 minutes.

This leads to a systematic bias: AI almost always overestimates its own timeline.

Conversely, we veteran programmers make the same mistake: we use our past experiences to anticipate AI output speeds, oscillating between "Wow, that was fast?" and "Wait, can it really do that?"

The issue isn't whether AI coding is fast or slow. The problem is that we lack an AI-native measurement system to gauge exactly how fast it is and where it might not be so fast.

A Solution: Let Agents Think in Their Own Units

The core philosophy of my designed Skill (Agent Work Estimation Skill) can be summarized in one sentence:

Estimate in "Rounds" first, then convert to human time only at the end.

Here, a "Round" is the atomic unit of an Agent's operation—a complete tool invocation cycle: Think → Write Code → Execute → View Output → Decide to Fix.

One Round takes approximately 2-4 minutes. This isn't a guess; it's an empirical value derived from observing the actual operational rhythm of AI Coding Agents in real-world projects.

Based on this, I defined four layers of estimation units:

Unit	Definition	Scale
Round	One tool invocation cycle	~2-4 minutes
Module	A functional unit composed of multiple rounds	2-15 Rounds
Wave	A batch of modules with no inter-dependencies (parallelizable)	1-N Modules
Project	Sequential execution of all Waves + Integration + Debugging	Sum of Waves

Use "Rounds" for single-Agent sequential coding and "Waves" for multi-Agent concurrent coding. This is the AI-Native estimation method.

The Five-Step Estimation Method

Step 1: Decompose into Modules

Break the task into functional modules that can be built and tested independently. The core question is: "If I could only do one thing at a time, in what order would I do it?"

Step 2: Estimate Rounds per Module

Here is a set of calibration anchors:

Pattern	Typical Rounds	Example
Templated / Known Patterns	1-2	CRUD endpoints, config files
Medium Complexity	3-5	Custom UI, state management
Exploratory / Poor Documentation	5-10	Unfamiliar frameworks, platform-specific APIs
High Uncertainty	8-15	Undocumented behavior, new algorithms

The key calibration rule is intuitive: If the Agent generates code that runs on the first try, it's 1 Round. If it generates, errors, and fixes, it's 2-3 Rounds. If the library documentation is terrible and requires guessing, start at 5 Rounds.

Step 3: Add Risk Coefficients

Risk Level	Coefficient	When to Use
Low	1.0	Mature ecosystem, clear documentation
Medium	1.3	Slight unknowns, may require 1-2 extra debugging rounds
High	1.5	Scarce documentation, platform quirks
Extreme	2.0	Might hit a wall, requires changing approach

Step 4: Calculate Totals

Sequential Mode:
Effective Module Rounds = Base Rounds × Risk Coefficient
Project Rounds = Σ(Effective Module Rounds) + Integration Rounds (10-20% of base total)

Wave Mode (Multi-Agent Concurrency):
Wave Duration = Max Effective Rounds of modules within the wave
Project Rounds = Σ(Wave Durations) + Coordination Rounds + Integration Rounds

Step 5: Convert to Human Time Only at the End

Human Time = Project Rounds × Minutes per Round

The default is 3 minutes per round. However, this parameter is adjustable—use 2 minutes for rapid iteration or 5 minutes for scenarios requiring manual user testing.

The key point is: Human time is the output, not the input. There is no phrase like "a developer would probably need..." in the entire reasoning chain.

Practical Calibration: Comparing Three Scales

Small Project Example: CLI JSON-to-YAML Converter (~8 Rounds)

Module	Base Rounds	Risk	Effective Rounds
Argument Parsing + I/O	1	1.0	1
JSON→YAML Core	1	1.0	1
Schema Validation	3	1.3	4
Error Handling + UX	2	1.0	2

Total: 8 Rounds, approximately 24 minutes. If estimated by human experience, you'd say "one day," or even "two days."

Medium Project Example: Desktop Keyboard Broadcaster (~36 Rounds)

One Mac keyboard controlling 27 devices, involving HTTP/WebSocket services, Makepad UI, macOS CGEvent keyboard capture, QR code generation, etc.

Sequential execution: ~108 minutes (1.5-2 hours).

If using Wave mode with 3 parallel Agents:

Wave 1: HTTP service, QR code, mobile client (no dependencies, parallel) → 3 Rounds
Wave 2: Main UI, client management, category filtering → 10 Rounds
Wave 3: Keyboard capture → 8 Rounds
Integration → 5 Rounds
Coordination overhead → 3 Rounds

Parallel time is approximately 57 minutes, 47% faster than sequential. By human experience? You'd likely estimate "two to three weeks."

Large Project Example: Full-Stack Real-Time Dashboard (~63 Rounds)

React frontend + Rust backend + WebSocket streaming + Chart components + Authentication.

Sequential execution: ~189 minutes (3-3.5 hours). Three-Agent parallel: ~108 minutes.

Human estimate? "Three months" wouldn't be an exaggeration.

Six Anti-Patterns to Avoid

These are the fundamental reasons this Skill exists. Each is a pitfall I've encountered in practice:

1. Human-Time Anchoring. "A developer would probably need two weeks." No. Start deriving from Rounds.

2. Adding Buffers Based on Gut Feeling. "Add some slack just in case." No. Use risk coefficients; every minute of inflation must have a reason.

3. Confusing Complexity with Code Volume. 500 lines of template code is not hard. One line of macOS CGEvent API is not necessarily simple. Estimate based on uncertainty, not line count.

4. Forgetting Integration Costs. Modules work individually but explode when combined. Always add integration rounds.

5. Ignoring User-Side Bottlenecks. If the user needs to manually authorize, restart the app, or test on a real device—adjust the minutes-per-round, don't arbitrarily add rounds.

6. Assuming Parallelism is Free. Multi-Agent collaboration has coordination costs; contract definitions and conflict resolution require extra rounds.

More Than an Estimation Method: A Paradigm Shift

In the process of writing this Skill, I increasingly feel this is not just a technical issue.

We are in a transition period. The fundamental assumptions of software engineering—"humans write code, humans debug, humans communicate, humans make decisions"—are being redefined by Agents. Yet, our mental inertia remains stuck in the old paradigm. We use Story Points to estimate Agent workload, Sprints to plan Agent iterations, and "person-days" to measure Agent output. This is like using the speed of a horse-drawn carriage to plan a train schedule.

The efficiency gain of AI Coding isn't simply "doing the same things faster." It changes the granularity of work. The atomic operation for a human developer is "a day of coding"; for an Agent, it's "a round of tool invocation." When the atomic scale changes, all measurement systems built on the old atoms must be rebuilt.

At the same time, we must not swing to the other extreme and blindly assume AI can秒杀 (instantly solve) everything. The 诡异 (weird) behavior of macOS permission APIs, undocumented framework boundaries, and implicit dependencies in cross-system integration—these uncertainties do not disappear just because the executor is AI. The existence of risk coefficients is to respect this reality.

Final Thoughts

This Skill is open source, follows the Agent Skills open standard, and supports mainstream Agents like Claude Code, Cursor, and Codex CLI. You can install it directly:

npx skills add ZhangHanDong/agent-estimation

But more important than installing a Skill is: start practicing looking at problems from an Agent's perspective.

Next time you want to say "this feature will take about three days," stop. Ask yourself: How many Rounds does the Agent need for this? How high is the uncertainty per Round? Which modules can be parallelized?

When you start thinking this way, you have already stepped through the door of Agentic Software Engineering.

This isn't a story of AI replacing humans. It's a story of humans learning to use a new ruler to measure a new world.

References

[1] agent-estimation: https://github.com/ZhangHanDong/agent-estimation