Google Official: AI Code Success Rate Soars from 28% to 96%! Secret Weapon is a Folder, Context Slashed by 90%

【导读】Google has embedded the 'Agent Skills' specification into the Gemini API, ADK, and Android Studio product lines within a week. Official data shows that after equipping with skill packs, the code generation success rate of Gemini 3.1 Pro skyrocketed from 28.2% to 96.6%, while the baseline context usage dropped by 90%. This movement of 'adding plugins to AI' is redefining the collaboration between developers and Agents.

Your AI Assistant Might Be Using a Two-Year-Old SDK to Write Code for You

Imagine a scenario: you ask your AI assistant to write some code for calling the Gemini API. It confidently provides a version—syntactically perfect, logically clear, but using an interface that was deprecated six months ago.

This is not hypothetical; this happens every day.

In an article on the Google Developers Blog on March 25, this problem is stated very directly:

"Large language models (LLMs) have fixed knowledge, being trained at a specific point in time."

The knowledge of large language models is 'frozen'; they are trained at a specific point in time.

LLM knowledge has an expiration date, but software engineering practices change weekly. SDKs are updated, APIs iterate, best practices evolve—yet your AI assistant still lives in the time capsule of its training data.

This is known as the Knowledge Gap.

And Google's solution is not to retrain the model or implement RAG, but to give the AI a 'skill pack'.

This skill pack is called Agent Skills—the core idea is surprisingly simple: package domain knowledge into a folder for the AI to load on demand.

A folder contains a SKILL.md file, with YAML header for metadata and Markdown body for specific instructions. That's it.

But what Google really emphasizes is the loading mechanism of this skill pack.

Google for Developers official account's tweet on April 1 summarized the value of Agent Skills in one sentence:

"By using progressive disclosure, you can load domain expertise only when needed. This can reduce baseline context usage by 90%."

▲ Google for Developers official tweet: promoting the three-tier architecture of Agent Skills and 90% context reduction (170+ likes)

Three-tier architecture, progressive:

L1 Metadata (about 100 tokens/skill): only tells the Agent "what skills I have" — equivalent to a menu

L2 Skill Content (< 5000 tokens): loaded by the Agent when needed — equivalent to ordering

L3 External Resources (fetched on demand): scripts, documentation, code examples — equivalent to serving

The traditional approach? Stuff all knowledge into the system prompt—equivalent to reciting the entire menu every time you eat.

Google ADK's official guide calculates: assuming you have 10 skills, traditional method loads 10,000 tokens of context each time; with Agent Skills' L1 menu approach, only 1000 tokens are needed.

Tokens are money. Saving 90% means saving 90% of cost.

28.2% → 96.6%: Data speaks

Just talking architecture isn't explosive enough. Google's benchmark data is the real killer.

Google tested 117 programming problems (covering Python and TypeScript) on one thing: how much can the code generation accuracy of Gemini improve after equipping with Agent Skills?

Result:

Model: Gemini 3.1 Pro Preview, with Skill: 96.6%, without Skill: 28.2%

Model: Gemini 3 Flash Preview, with Skill: 87.2%, without Skill: 6.8%

Model: Gemini 3.1 Flash Lite, with Skill: 84.6%, without Skill: 5.1%

Model: Gemini 2.5 Flash, with Skill: 52.1%, without Skill: 0.0%

▲ @ai_for_success's summary tweet sparked discussion: nearly 900 likes, 120k views

Gemini 3.1 Pro from 28% to 96%, an increase of nearly 3.5 times.

Gemini 3 Flash from 6.8% to 87%, an increase of nearly 13 times.

To exaggerate further, in the Agentic tasks category, it reached 100%, Document processing also 100%, SDK usage reached 94.6%.

This set of numbers spread on Twitter.

▲ @TeksEdge questions: internal benchmark looks good, but how on SkillsBench?

Some are excited, others are calm. @TeksEdge directly asked a good question: how will it perform on independent benchmarks with such impressive internal test scores?

This skepticism is healthy—self-made tests naturally lack persuasiveness. But even discounted, jumps from 0% to 52% (Gemini 2.5 Flash), from 6.8% to 87% (Gemini 3 Flash), such magnitude of improvement is hard to explain by 'gaming the system'.

Written into official documentation, IDE, SDK—Google is serious

If it were just a blog post or demo repo, that would be 'testing the waters'.

But Google's actions go far beyond that.

Step 1: Written into Android Studio.

Android Developer official documentation directly has a page 'Extend Agent Mode with skills', binding Agent Skills with IDE's Agent Mode:

"Skills let you enhance Agent Mode's capabilities with specialized expertise and custom workflows. They are based on the Agent Skills open standard."

▲ @github_skydoves shares Android Studio Agent Mode skills documentation, 62 retweets

Note the wording: open standard. This is not Google's private format; it's an open specification written into official documentation.

Step 2: Squeezed into ADK (Agent Development Kit).

ADK directly provides three native APIs—list_skills (L1 menu), load_skill (L2 content), load_skill_resource (L3 resources), covering four skill modes from simple to complex.

Step 3: Open source example repository. One line npx skills add can add skills to Agent, as naturally as installing an npm package.

Real reactions from developers

Discussions on Twitter are very interesting.

The loudest voice is shock at the data:

▲ @_techibee: "AI isn't dumb... It's just outdated. Google's Agent Skills fixes: Learns new tools instantly, Uses latest docs & SDKs, No retraining needed. Basically... AI that upgrades itself on the fly."

"AI isn't dumb... It's just outdated. Google's Agent Skills fixes: Learns new tools instantly, Uses latest docs & SDKs, No retraining needed. Basically... AI that upgrades itself on the fly."

Some also view Agent Skills in combination with MCP (Model Context Protocol):

▲ @Anandzork combines Agent Skills with Docs MCP, provides installation guide and performance comparison chart

@Anandzork gave a very intuitive comparison chart: bare run 7.7%, with MCP 72.4%, with Skill 82.9%, both together directly to 96.3%. He also emphasized: "Token consumption dropped by 63%."

When AI Agents can update knowledge in real-time, what about developers still manually checking documentation? This question is being seriously considered by more and more people.

But don't hype it yet—Google is also pouring cold water

Google's blog actually hides several significant 'buts':

Skills may not always be better than AGENTS.md—Google itself cites Vercel's research, admitting that in some scenarios, directly writing AGENTS.md might be more effective.

Update mechanism not yet implemented—after installing a skill, if the SDK upgrades, the skill won't automatically update; workspaces may pile up outdated skills, misleading the Agent.

Script execution not supported—ADK documentation clearly marks Experimental; L3 layer resources can currently only be viewed, not executed.

More realistically, in GitHub Discussion, a developer asked if ADK supports the Agent Skills standard, and the response was: "Currently no clear plan, but the team is still evaluating."

On one hand, promoting vigorously with official documentation, on the other hand, core SDK support is still under 'evaluation'? Big companies promoting new standards have always been like this—first create hype, then fill in the gaps.

The real signal: from 'prompt engineering' to 'skill distribution'

Leaving aside data and product line details, what is the most noteworthy signal here?

Google is turning 'feeding knowledge to AI' from a craft into infrastructure.

Past: you are an advanced prompt engineer, spending three days crafting a perfect system prompt to make GPT write decent API calling code. This is your core competitiveness.

Now: SDK maintainers directly release a SKILL.md file, and all Agents automatically gain the latest API knowledge. Your three-day craft is replaced by one line npx skills add.

Authoritative context and accuracy improvement

▲ @iRomin: "Native Agent Skills optimize DX by building models on authoritative context"

This is why the core maintainer of the Agent Skills specification is not Google—it's Anthropic.

Yes, this specification originated from Anthropic's Claude Code, released as an open standard at the end of 2025. Google is one of the largest 'adopters', but Microsoft, OpenAI, GitHub Copilot, Cursor, and over 26 platforms are following suit.

When all major companies converge on the same file format, this is no longer a feature release.

This is a redefinition of an ecological niche: whoever writes SKILL.md masters the knowledge gateway of AI Agents.

SDK maintainers, framework authors, documentation teams—these roles that used to 'write for humans' now need to start 'writing for AI'.

And write according to standards.

— END —

Google Official: AI Code Success Rate Soars from 28% to 96%! Secret Weapon is a Folder, Context Slashed by 90%

Related Articles

分享網址