Karpathy Just Open-Sourced AutoResearch: I Used It to Optimize Lobster Skills, Boosting Success Rates from 56% to 92%

After raising lobsters for so long, you've likely realized just how critical skills are.

A poorly written skill isn't actually that scary.

If you run Lobster twice and see it completely fails, you can just uninstall it on the spot and be done with it.

But many skills are incredibly frustrating. They don't fail completely; they might work 70% of the time, but that 30% failure rate is a disaster.

This is a very real problem. Recently, industry luminary Andrej Karpathy released an open-source project called AutoResearch, designed to let agents autonomously optimize large model training, and the results are phenomenal.

However, this methodology can actually be extended to other scenarios.

Today, I'm sharing my best practices for using this theory to optimize skills!

The essence of AutoResearch is simply: letting AI optimize AI.

Naturally, this means we can let AI automatically iterate and optimize skills. I tested a skill related to webpage copying, and its pass rate jumped from 56% to 92%.

Skills Fear Instability Most

When a skill is unstable, it generally means that following the skill's workflow leads the large model to produce unexpected results with some probability.

Often, you don't even know how these poor results are generated in the first place.

So what do you do? You might ask the AI to review itself, relying on intuition to make random tweaks.

In the end, the skill might turn into a patchwork monster.

What AutoResearch does is turn this entire process into a reproducible experiment.

The core logic is actually quite simple: it doesn't require the agent to rewrite the entire skill in one go.

Change one small thing at a time, then re-run and score it. If the result improves, keep the change; if it gets worse, revert it.

Therefore, anything that can be measured can be subjected to AutoResearch.

This includes various skills.

Defining What "Good" Means

First, why are skills unstable?

Because they often contain vague descriptions.

For example: "Avoid sounding like AI; make it more natural."

These instructions aren't wrong, but executing them feels like black magic. AutoResearch forces you to break this mysticism down into binary yes/no questions that can be judged objectively.

What is your definition of a good result?

For instance, for a creative writing skill, some definitions could be:

Is the full text kept under 1500 characters?
Do the first three sentences clearly state the main point?
Does the title contain [specific keyword]?

The goal is to create binary definitions so the AI can score them stably.

Here is a sample Prompt for establishing clear evaluation criteria:

### Eval Guide
How to write evals that truly improve your skill rather than giving you false confidence.
---
### The Golden Rule
Every eval must be a Yes / No question.
Not a scale.
Not a feeling-based judgment.
It must be binary.
Why: Scales accumulate variance. If you have 4 evals, each scored 1-7, the total score will have high variance between runs. Binary evals give you stable, reliable signals.
---
### Good vs. Bad Evals
#### Text / Copywriting Skills
Examples: Newsletters, tweets, emails, landing pages
**Bad Eval:**
- "Is this text written well?" (Too vague; what does "well" mean?)
- "Rate its appeal on a scale of 1-10" (Scale = Unreliable)
- "Does it sound human-written?" (Subjective, inconsistent scoring)
**Good Eval:**
- "Does the output completely avoid phrases from this banned list: [game-changer, here's the kicker, the best part, level up]?" (Binary, specific)
- "Does the first sentence mention a specific time, place, or sensory detail?" (Binary, checkable)
- "Is the output between 150-400 words?" (Binary, measurable)
- "Does the ending use a specific CTA clearly telling the reader what to do next?" (Binary, structural)
#### Visual / Design Skills
Examples: Diagrams, images, slides
**Bad Eval:**
- "Does it look professional?" (Subjective)
- "Rate the visual quality 1-5" (Scale)
- "Is the layout good?" (Vague)
**Good Eval:**
- "Is all text in the image clearly legible, with no truncation, overlap, or covering?" (Binary, specific)
- "Does the color scheme use only soft/pastel tones, with no neon, bright red, or high-saturation colors?" (Binary, checkable)
- "Is the layout linear, flowing left-to-right or top-to-bottom, with no scattered elements?" (Binary, structural)
- "Are there absolutely no numbered steps, ordinal numbers, or sequence markers in the image?" (Binary, specific)
#### Code / Technical Skills
Examples: Code generation, configuration, scripts
**Bad Eval:**
- "Is the code clean?" (Subjective)
- "Does it follow best practices?" (Vague; which best practices?)
**Good Eval:**
- "Does the code run without errors?" (Binary, testable; actually execute it)
- "Is the output completely free of TODOs or placeholder comments?" (Binary, grep-able)
- "Are all function and variable names descriptive (no single-letter naming except for loop counters)?" (Binary, checkable)
- "Does the code include error handling for all external calls (API, file I/O, network)?" (Binary, structural)
#### Document Skills
Examples: Proposals, reports, decks
**Bad Eval:**
- "Is the content comprehensive enough?" (Comprehensive relative to what?)
- "Does it address client needs?" (Too open-ended)
**Good Eval:**
- "Does the document contain all required sections: [list sections]?" (Binary, structural)
- "Is every conclusion supported by specific numbers, dates, or sources?" (Binary, checkable)
- "Is the document kept within [X] pages / [X] words?" (Binary, measurable)
- "Can the executive summary be compressed into 1 paragraph of no more than 3 sentences?" (Binary, countable)
---
### Common Mistakes
#### 1. Too Many Evals
After 6 evals, the skill starts "gaming the system." It optimizes for passing the test rather than producing good results. It's like a student memorizing answers without understanding the content.
**Fix:** Select the 3-6 most critical checks. If these pass, the output is likely good.
#### 2. Too Narrow / Rigid
Rules like "Must contain exactly 3 bullet points" or "Must use 'because' at least twice" make the skill technically pass the test but produce weird, rigid outputs.
**Fix:** Evals should check quality attributes you truly care about, not arbitrary structural constraints.
#### 3. Overlapping Evals
If Eval 1 is "Is the grammar correct?" and Eval 4 is "Are there spelling mistakes?", these overlap. Grammar failures often include spelling errors. You are double-counting.
**Fix:** Each eval should test only one independent dimension.
#### 4. Impossible for the Agent to Measure
"Will humans find this engaging?" The agent cannot answer this stably. It will almost always say "Yes."
**Fix:** Translate subjective feelings into observable signals. For example, rephrase "engaging" to: "Does the first sentence contain a specific claim, story, or problem, rather than a generic statement?"
---
### Before Writing an Eval, Ask These 3 Questions
Before finalizing an eval, ask yourself:
1. **If two different agents receive the same output, will they give the same score?**
   If not, the eval is too subjective; rewrite it.
2. **Can a skill pass this eval by exploiting loopholes without actually improving?**
   If yes, the eval is too narrow; broaden it.
3. **Is this eval measuring something the user actually cares about?**
   If not, delete it. Every unimportant eval dilutes the signal of the truly important ones.

So, a complete cycle looks like this:

1. Select a skill you want to optimize.

2. Provide it with some test inputs.

For example, write an opening for a long article, or generate content based on Google's 5 skill principles...

3. Give it a checklist.

Around 3 to 6 items is ideal. Too few provides insufficient constraints; too many, and the agent starts "cramming for the test" just to satisfy the checklist.

4. Score -> Iterate -> Score.

You might run the initial score and find it's terrible. The agent can analyze the failure points, make a small modification, and re-test. If the score goes up, keep it. If it drops, revert it.

Then proceed to the next round.

Continue until you achieve high scores consistently over multiple runs.

That Changelog is Invaluable

It contains the complete evolutionary history of the skill.

It records what was changed, why it was changed, whether the change led to improvement, and which reasonable modifications ultimately didn't work.

This is extremely important.

Because later, when models are upgraded or you want to migrate the skill to another platform,

You won't have to start from zero; you'll hold a verified evolutionary path for the skill.

This is actually a scarce resource in the age of agents.

Final Thoughts

Actually, these methods aren't limited to optimizing skills.

I've even used them for code performance optimization.

For one page load, over 67 rounds, we went from 1100ms down to 67ms.

So, as long as you can define scoring rules, you can let agents autonomously iterate and optimize.

This is essentially a form of reinforcement learning, where the scoring rules act as the reward score.

Stop relying on intuition for optimization. AutoResearch has made it clear to everyone:

If something is going to be called repeatedly, it's worth testing repeatedly.

If something can be tested repeatedly, it's worth handing over to an agent for automatic optimization.

#openclaw #autoresearch #karpathy

Karpathy Just Open-Sourced AutoResearch: I Used It to Optimize Lobster Skills, Boosting Success Rates from 56% to 92%

Skills Fear Instability Most

Defining What "Good" Means

That Changelog is Invaluable

Final Thoughts

Related Articles

分享網址