Study: Self-generated Agent Skills are useless

dcre

"Self-Generated Skills: No Skills provided, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs’ latent domain knowledge"

This is a useful result, but it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills." Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to.

I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.

show comments

colonCapitalDee

I have a custom skill-creator skill that contains this:

> A common pitfall is for Claude to create skills and fill them up with generated information about how to complete a task. The problem with this is that the generated content is all content that's already inside Claude's probability space. Claude is effectively telling itself information that it already knows!

> Instead, Claude should strive to document in SKILL.md only information that:

> 1. Is outside of Claude's training data (information that Claude had to learn through research, experimentation, or experience) > 2. Is context specific (something that Claude knows now, but won't know later after its context window is cleared) > 3. Aligns future Claude with current Claude (information that will guide future Claude in acting how we want it to act)

> Claude should also avoid recording derived data. Lead a horse to water, don't teach it how to drink. If there's an easily available source that will tell Claude all it needs to know, point Claude at that source. If the information Claude needs can be trivially derived from information Claude already knows or has already been provided, don't provide the derived data.

For those interested the full skill is here: https://github.com/j-r-beckett/SpeedReader/blob/main/.claude...

show comments

secbear

The finding that self-generated skills provide negative benefit (-1.3pp) while curated skills give +16.2pp is the most interesting result here imo. Big discrepancy, but makes sense. Aligns with the thought that LLMs are better consumers of procedural knowledge than producers of it.

+4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare. I suspect this reflects that frontier models already have strong SWE priors from training data, so skills add less marginal value. If true, skills become most valuable precisely in the domains where models are weakest — which is where you'd actually want to deploy agents in production. That's encouraging.

show comments

embedding-shape

The general rule seems to be, the more layers you automate with LLMs, the worse each successive layer gets. Piping LLM output as input into new LLM calls, you're already starting to notice how things fall apart and get lost quickly.

If you have the idea, more or less the implementation plan, let the LLM do the coding, you can end up with something maintainable and nice, it's basically up to you.

Strip away one layer, so you have the idea, but let the LLM come up with the implementation plan, then also the implementation, and things end up a lot less than ideal.

Remove another layer, let the LLM do it all, and it's all a mess.

show comments

smcleod

There is almost no point in telling an agent to build a skill without augmenting it's knowledge on the thing it's writing about as you're just piping output to input without expanding the information in the system. If you get an agent to perform a bunch of research online, distil that down to information that the models tend not to get right or is newer than what is in their training data or simply better aligns with your desired workflow than what they generate out of the box - that's going to create a far more useful skill. I use a skill that gets activated when creating a skill to help guide this approach: https://github.com/sammcj/agentic-coding/blob/main/Skills/sk...

show comments

CharlieDigital

This has been my observation with self-generated docs as well.

I have seen some devs pull out absolutely bad guidance by introspecting the code with the LLM to define "best practices" and docs because it introduces its own encoded biases in there. The devs are so lazy that they can't be bothered to simply type the bullet points that define "good".

One example is that we had some extracted snippet for C#/.NET that was sprinkling in `ConfigureAwait(false)` which should not be in application code and generally not needed for ASP.NET. But the coding agent saw some code that looked like "library" code and decided to apply it and then someone ran the LLM against that and pulled out "best practices" and placed them into the repo and started to pollute the rest of the context.

I caught this when I found the code in a PR and then found the source and zeroed it out. We've also had to untangle some egregious use of `Task.Run` (again, not best practice in C# and you really want to know what you're doing with it).

At the end of it, we are building a new system that is meant to compose and serve curated, best practice guidance to coding agents to get better consistency and quality. The usage of self-generated skills and knowledge seems like those experiments where people feed in an image and ask the LLM to give back the image without changing it. After n cycles, it is invariably deeply mutated from the original.

Agentic coding is the future, but people have not yet adapted. We went from punch cards to assembly to FORTRAN to C to JavaScript; each step adding more abstractions. The next abstraction is Markdown and I think that teams that invest their time in writing and curating markdown will create better guardrails within which agents can operate without sacrificing quality, security, performance, maintainability, and other non-functional aspects of software system.

show comments

rahimnathwani

This is unsurprising and irrelevant.

When you create a skill for a particular model, you don't typically ask the model to create the skill based solely on its own latent knowledge. Otherwise, you'd expect the effect to be similar to telling the model 'make a plan before acting, make not mistakes'.

But that's what the paper's authors did!

When they say 'self-generated' they don't allow the model any tool access at all, not even web search.

It would be much more interesting if they had tested skills that were created in one of these ways:

A) The model interviews a human and then creates the skill, or

B) The model executes one or more deep research tasks in order to gather information, or

C) Some combo of the above.

show comments

bee_rider

In general terms, we get these kinds of results that seem to indicate that LLMs can’t really “create” new information using inference. LLM generated skills don’t help. Training on content that was generated by LLMs causes models to collapse or something. It seems like it is accepted as really intuitive.

But it seems pretty surprising to me. The training corpus contains so much information and the models operate at the level of… a bright novice. It seems like there obviously ought to be more insights to derive from looking harder at aspects of the corpus.

Why isn’t this considered astonishing?

show comments

jngiam1

The Skills I have for Claude are all based on personal preferences and reflects the setup I have going. It's a way to narrow the probability space to the specific set which works really well for me.

lmeyerov

We had a measurable shift when we started doing ai-coding loops driven by evals. By definition, the additions make the numbers go up-and-to-the-right. It's the epitomy of "you get what you measure" :)

Chaos Congress talk on this from a couple months ago, jump to the coding loops part: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... . The talk focuses mostly on MCPs, but we now use the same flow for Skills.

This kind of experience makes me more hesitant to take on plugin and skill repos lacking evals or equivalent proving measurable quality over what the LLM knows and harness can handle. Generally a small number of things end up mattering majorly, but they end up being pivotal to get right, and the rest is a death by a thousand cuts.

pizza

The more general question of how to evaluate the quality of a given skill file is quite interesting to me. A skill may prime a model's responses in a way that a prompt alone may not. But also models aren't good at judging what they are or are not capable of.

Just asking a model "how good is this skill?" may or may not work, possibly the next laziest thing you could do - that's still "for cheap" - is asking the model to make a quiz for itself, and have it take the quiz with and without access to the skill, then see how the skill improved it. But there's still many problems with that approach. But would it be useful enough to work well enough much of the time for just heuristically estimating the quality of a skill?

rriley

The biggest gap in this paper is the condition they didn't test: Skills built through human-AI collaboration. They found fully self-generated Skills are useless (-1.3pp) and human-curated ones help a lot (+16.2pp), but that's a false dichotomy. In practice, especially in tools like OpenClaw, skills will emerge iteratively: the AI drafts procedural knowledge while solving a real problem, the human refines it with domain expertise. Neither produces the same artifact alone. The +16.2pp from curated Skills is likely the floor for this approach, not the ceiling. Would love to see a fourth condition.

andix

I think there are generally 3 kinds of skills:

1. only information and instructions on how to answer 2. some defined actions (run specific cli commands for specific tasks, use this api with those parameters) 3. skills including scripts

1 seems to be of limited use

2 and 3 can save the agent quite some time for finding a solution. And once the agent found a programmatic solution to a specific problem, they can store this information in a skill

lukev

This clarifies an important point for me.

The derivative of a LLM agent's capabilities (on its own) is negative. It's not that they can't do useful work -- it means that (for now) they require some level of input or steering.

If that were to change -- if an agent could consistently get better at what it does without intervention -- that would represent a true paradigm shift. An accelerating curve, rather than one trending back towards linearity.

This represents a necessary inflection point for any sort of AI "takeoff" scenario.

So this study is actually kind of important, even though it's a null result. Because the contra view would be immensely significant.

show comments

alexhans

Isn't the title editorialised? Probably for clicks?

I think that most of the adoption around Agent Skills would have a focus on ease of use, standarization and context management and not correctness.

My own thoughts on how to approach skill building target people who are adopting LLM development now more than ever although this was definitely possible (in a non standard way before) [1]

[1] https://alexhans.github.io/posts/series/evals/building-agent...

rapind

Breaking news: Developers who yak shave their vim configs also get carried away with their LLM setups.

rrvsh

Despite skills being just a new form of memory and context engineering for an agent, I think the framework is still great for agents to self-develop, given a good prompt to regularly review their own sessions and pick learning points to save as skills. In fact, I think the "craft" of prompt engineering has been lost somewhat - I still enjoy puzzling out and iterating over the best possible starting prompt for a conversation to get the best result I can for a one-shot

FWIW I didn't read the paper and am judging it based on its title, which I think is fair because "self-generated agent skills" is a pretty loose definition.

show comments

getoffit

"Small models" will always outperform as they are deterministic (or closer to it).

This was realized in 2023 already: https://newsletter.semianalysis.com/p/google-we-have-no-moat...

"Less is best" is not a new realization. The concept exists across contexts. Music described as "overplayed". Prose described as verbose.

We just went through an era of compute that chanted "break down your monoliths". NPM ecosystem being lots of small little packages to compose together. Unix philosophy of small composable utilities is another example.

So models will improve as they are compressed, skeletonized down to opcodes, geometric models to render, including geometry for text as the bytecode patterns for such will provide the simplest model for recreating the most outputs. Compressing out useless semantics from the state of the machines operations and leaving the user to apply labels at the presentation layer.

show comments

rcarmo

I only generate skills _after_ I've worked through a problem with the model - usually by asking it "what have you learned in this session?". I have no idea why people would think it can zero-shot a problem space without any guidance or actual experience...

show comments

ryanthedev

Love the article and happy to have a framework but I don’t think those are good SWE skills.

I imagine some more like. https://github.com/ryanthedev/code-foundations

Based of an actual software book.

ineedasername

The title should changed to reflect the actual title, as the user-provided one is incorrect and misstates a central conclusion.

daxfohl

I'm kinda surprised by this. Yeah it's just regurgitating something it already "knows", but I'd still expect that having the skill materialized there in the context would give it something concrete to reference, less likelihood of getting lost or hallucinating, and probably need less incremental context to do the job.

I mean, basically it's doing the same thing as reasoning IIUC, except up-front rather than inline and ad-hoc, so I'd almost expect it to work even better than reasoning alone.

show comments

sebastianconcpt

I am the only one surprised about anyone's need for a study to conclude that?

show comments

small_model

Skills seem to be a crutch until we get continual learning. Imagine you've been running an instance for 6 months and it still remembers when you told it was running on your linux server over ssh and not on your Mac.

show comments

turnsout

It seems intuitive that a naive self-generated Skill would be low-value, since the model already knows whatever it's telling itself.

However, I've found them to be useful for capturing instructions on how to use other tools (e.g. hints on how to use command-line tools or APIs). I treat them like mini CLAUDE.mds that are specific only to certain workflows.

When Claude isn't able to use a Skill well, I ask it to reflect on why, and update the Skill to clarify, adding or removing detail as necessary.

With these Skills in place, the agent is able to do things it would really struggle with otherwise, having to consume a lot of tokens failing to use the tools and looking up documentation, etc.

show comments

verdverm

Anecdotal middle ground, I have used LLM automation to generate AGENTS.md files at scale across a repo

1. You MUST review and correct them

2. Embrace minimalism, they are spark notes and an index, not comprehensive

3. Force them into context

I imagine similar concepts hold for skills

scotty79

I think self-generation of skills might be useful if it's based on model doing websearches, experiments in a sandboxed environment and putting into skill what it found out.

Also generating skills using top of the line model to keep using them later in cheap open weights model seems like a good use of resources.

Online sharing of skills generated in such manner also seems like a wonderful idea.

j45

I am lucky to count friends who are academics engaged in research, and one topic of discussion I notice around AI is researchers with a non-tech background and/or a lack of implementation / operationalization / commercialization in applying technology to Business, which can also cloud these kidns of results.

I have systemized and automated businesses for a long time before LLMs came out, which generally wasn't very popular.

It is really weird to see everyone get excited about this kind of automation and then try to jump to the end points with something that's non-deterministic and wonder why it doesn't work like every other computer they've used (all or none).

Agents can self generate skills, maybe not effortlessly, or with psychic skills of reading between the lines (special exception for Claude), it's also about the framework and scaffolding in which to create skills that work, and what can be brought back to the "self-generation".

Without experience in creating computer skills in general, attempts for self-generating agent skills is kind of trying to use AI to autocomplete a sentence and then not like how it went. To a fair degree it can be lined up to improve considerably.

Right now there seems to be a 6-12 month lag between studies like these and it being shared/reported in the wild.

Too often, they are researching something reported in the wild and trying to study it, and it very well may work for some cases, but not all cases, and the research kind of entirely misses it.

With AI, it's incredibly important to follow show and not tell.

Sharing this from genuine curiousity if this resonates with anyone, and if so, how/where.