Claude Quietly Updated the Skills Generator: This is Absolutely an Epic Upgrade

During last week's live stream, I discovered that Anthropic's skills repository had actually been updated.

Clicking into it, I found that a super essential Skill had received an update.

This is it; it can even be said to be the cornerstone of the entire Skills ecosystem.

Skill-creator.

It can be said that the reason Xiaolongxia (Claude) is so capable now is half due to Skills, and the fact that these Skills can be created is almost entirely thanks to this mother Skill, Skill-creator.

I believe that anyone who has read our past articles about Skills or has played with Skills cannot be unfamiliar with this Skill-creator.

Simply put, this is the official Skills generator released by Anthropic.

You can describe your requirements verbally, and then directly use Skill-creator to turn them into a Skill for you.

If you don't understand, you can check out our previous article: One article to help you understand what the wildly popular Skills are all about. I consider it quite detailed.

This week, I finally had time to thoroughly review the documentation for this updated Skill-creator, and I found that this update can truly be called epic; it is so much stronger.

So I feel it is worth writing an article to discuss the new features and functionalities of this Skill-creator update.

Really, all Skills are worth re-optimizing.

Very simply put, this time they added four brand new capabilities all at once:

Evaluation System: Tells you directly after running whether this Skill works or not.
Benchmark Testing: Quantifies pass rates, time consumption, and token usage.
Multi-Agent Parallel Testing: Each test runs independently in a clean environment, supporting A/B blind testing, ensuring results do not contaminate each other.
Description Optimization: Can automatically help you modify the Skill description so that it triggers when it should and doesn't trigger randomly when it shouldn't.

The previous Skill-creator actually always had a pain point: the Skills you generated were a black box. You had no idea if the Skill was actually useful, what its quality was like, or if its triggering mechanism was reasonable.

In terms of the industrial systems we often mention today, it was missing a very important component: an evaluation mechanism.

Evaluation is too important; a good evaluation can truly lead the direction.

And now, the new version of Skill-creator has completely filled in the entire evaluation system.

I highly recommend that everyone must update to the latest version.

The update method is also extremely simple. You just send this sentence to your Agent, whether it's Claude Code, OpenClaw, OpenCode, etc.:

https://github.com/anthropics/skills/tree/main/skills/skill-creator, this skills has been updated, help me update to the latest version.

Yes, just that one sentence.

Then your Agent will update itself.

Very quickly, the update is complete.

I will use a case study to demonstrate the capabilities of the new Skill-creator for everyone.

In a previous article, I turned yt-dlp from Github into a Skill, capable of downloading videos from YouTube, Bilibili, and various other video sites.

But later we discovered that just being able to download videos wasn't enough.

I also hoped that after getting the video link, it could directly generate a text version of the transcript.

Moreover, if the video is in English or another language, it would be best if it could directly provide me with a transcript document in both the original language and Chinese.

So, taking this opportunity, I used skill-creator to craft a new Skill.

The prompt is very simple.

I want to create a Skill. I hope that when I provide a video link, it can send me a text version of the transcript. If it's in another language, preferably give me the transcript document in both the original language and Chinese.

It will first ask you a few questions to confirm the requirement details, and then start helping you design the entire Skill.

In about 3 to 5 minutes, this Skill was designed.

I took a YouTube interview video of the OpenClaw founder to test it.

I just provided a YouTube link.

Five minutes later, the Chinese version of the transcript came out.

However, there is actually a problem...

This huge block of text piled together, with small and crowded characters.

It's completely unreadable.

At this point, you can continue the conversation, asking it to optimize for you and help you improve this Skill.

The new version of Skill-creator also has some improvements in its refinement capabilities.

The effect after improvement:

Almost perfect.

The layout is clear, paragraphs are distinct; this is what a document should look like.

But that's not all.

At this point, a headache arises: I'm afraid my Skills will conflict with each other.

Because I now have two Skills related to video links.

One is yt-dlp, responsible for downloading videos to local storage.

The other is the transcript generation I just made, responsible for converting video to text.

The trigger condition for both Skills is providing a video link. I'm afraid they will fight, meaning the one that should trigger doesn't, and the one that shouldn't triggers randomly.

In that case, you can use the Skill-creator's evaluation system to let it help you optimize the Skill description.

It will first read your current Skill description, and then tell you it needs to do four things next:

Automatically generate two sets of queries: 10 that should trigger and 10 that should not trigger.

It's designed very interestingly.

It deliberately includes boundary cases, forcing the model to make judgments in ambiguous areas.

Then, it directly generates a webpage for you to confirm, which is incredibly awesome.

Really, I was stunned when I used it.

All queries are listed in the interface, with a toggle switch on the right of each one, marking whether it should trigger.

You can go through them one by one, and if you think a judgment is wrong, just turn it off.

For example, for the third situation, if I don't want it to trigger anymore, I can just turn it off.

Then there are the 10 cases that should not trigger. I looked through them, and there were no issues.

After confirming everything, you click export evaluation set, and that's it.

After confirming the samples, the optimization loop starts in the background, running up to 5 iterations.

Each iteration does three things to help you test and evaluate; the whole process takes about 10-20 minutes.

It will also report progress regularly.

After running, you will see a huge table.

Each column is a query sample, and each row is an iterative version of the description.

A green checkmark indicates successful triggering, and a red cross indicates failure to trigger.

The blue column is the test set, and the rest are training sets.

It splits the samples into 60% training set and 40% test set, iteratively optimizing on the training set, and finally selecting based on performance on the test set to prevent overfitting.

After running, the optimal description is automatically written back to your SKILL.md, requiring no manual intervention from you.

Anthropic officially tested this on their 6 document-type Skills, and 5 of them saw an improvement in trigger rates.

Just optimizing with the new version of skill-creator is really awesome.

This step can greatly improve the trigger accuracy of your Skills.

But triggering correctly doesn't mean everything is OK.

So, after your Skill is installed and can trigger stably, how it performs on actual tasks still needs to be evaluated.

I will continue to use this newly made Skill to run through it again, showing everyone the whole process.

Directly evaluate the Skill we just made.

It will first read your Skill file completely to figure out what the core process of this Skill is.

Then it will ask you: Which aspect do you want to test more?

I chose comprehensive evaluation.

Based on the Skill's functionality, it automatically designed three types of test scenarios and simultaneously designed quantitative acceptance criteria.

After confirming the plan, it launched 4 independent sub-agents at once to run in parallel.

This time, having 4 parallel Agents for testing is very appealing.

In the past, you could actually do some simple evaluations, but the biggest problem was that they would run sequentially, one after another.

But everyone knows how important context management is; the context accumulated from the previous task will contaminate the results of the next one.

You think it's the Skill's credit, but actually, it's completely the conversation history helping out.

This evaluation feels much more right.

Each agent runs independently in a completely clean environment, with its own token count and time metrics.

Zero crossover between them.

Faster results, cleaner data.

While waiting, it also prepared the quantitative scoring script.

After the test results come back, it directly checks automatically if the format meets the requirements; many small details are included.

After the test runs, an evaluation viewing page pops up in the browser with two tabs.

The Output tab allows you to directly view the output of each test case.

There is also a feedback box below where you can directly annotate what is wrong and what needs improvement.

These feedbacks will be saved and used directly the next time you improve the Skill.

The other is the Benchmark tab, where you can see Skill vs. No Skill.

Through quantitative comparison, it is clear at a glance.

The data aspect is also extremely quantitative.

With the Skill, the pass rate is 100%; without the Skill baseline, it's 9%, a difference of 91.5%.

In terms of cost, with the Skill it's about 4000 tokens per time, without the Skill it's 1750 tokens, a difference of 2250.

But this is the extra consumption brought by the Skill; compared to the output results, it's worth it.

But the value of evaluation goes far beyond this.

Anthropic officially gave an example.

They had a PDF Skill that previously made mistakes when processing tables.

Claude needed to place text precisely at specific coordinates, but because there were no clear fields for guidance, it often placed them crookedly.

This problem was discovered during the evaluation process, and after repairing and improving the positioning logic, the problem was solved.

This means that after finding the problem, you don't have to start over from scratch.

The evaluation results are stored locally. The next time you use skill-creator to improve this Skill, it will bring in the problems annotated last time and target those for changes.

Run the evaluation again after changing to see if there is an improvement.

Test, discover, fix, re-test; this cycle is complete.

Anthropic has introduced some rigorous practices from software development, such as testing, benchmarking, and iterative improvement, into the Skills creation process this time.

Really, it's so much better.

This is absolutely an epic enhancement for everyone.

You need to know why Xiaolongxia (Claude) is so strong and can do so many things. It's really not because it itself is so awesome; purely because there are so many Skills hanging on it; those are all skill packages one by one.

It can be said that Skills are the cornerstone of the entire Agent ecosystem's future great prosperity, and I myself have always been extremely optimistic about and strongly promoting various Skills.

Therefore, I highly suggest that everyone update Skill-creator to the latest version and then optimize and evaluate all your own Skills.

Of course, you must first distinguish what kind of Skills you are writing.

Because essentially, Skills are actually divided into two types.

The first type is Capability Enhancement.

This teaches Claude to do things it isn't originally good at.

For example, the official Frontend Design Skill and Document Creation Skill contain a large number of techniques that you simply cannot achieve with just Prompts.

Most of the Skills we craft ourselves basically fall into this category.

The second type is what the official side calls Coding Preference.

This tells Claude to follow your rules.

Claude can do every step itself, but your Skill strings these steps together according to your team's process.

For example, a Meeting Minutes Organization Skill that automatically converts recordings into documents with action items according to your company's fixed format.

Or a Weekly Report Generation Skill that pulls data from various platforms and formats it according to your requirements.

You can understand this type as a Workflow.

For these two types, the direction of evaluation will be slightly different.

For Capability Enhancement, we test whether this Skill is still necessary after the model update.

Use A/B testing to compare, running once with the Skill and once without.

If the results are about the same, this Skill can retire.

Coding Preference tests another thing: Did it follow your process faithfully?

Did it miss steps? Did it arbitrarily change the order? Did it forget a specific requirement you mentioned?

So there will be slight differences, which everyone should note when evaluating on their own.

Looking back, before, after creating a Skill, one just felt good about oneself.

But honestly, it was all a black box; I didn't know how to evaluate it at all.

Now it's much more comfortable.

Run the evaluation once, lay out the data, and whether it works well or not is immediately apparent.

All Skills are truly worth re-optimizing and re-evaluating.

The Skills ecosystem.

It feels like another wave of great prosperity is about to arrive.

Above, since you've read this far, if you think it's good, please give a like, a look, and a share. If you want to receive pushes immediately, you can also star me ⭐~ Thank you for reading my article, see you next time.

>/ Author: Kazuke, Keda

>/ For submission or tips, please contact email: wzglyay@virxact.com

Claude Quietly Updated the Skills Generator: This is Absolutely an Epic Upgrade

Related Articles

分享網址