Magic Words

Skills are the newest hype commodity in the world of agentic AI. Skills are text files that optionally get stapled onto the context window by the agent. You can have skills like “frontend design” or “design tokens” and if the LLM “thinks” it needs more context about that topic, it can import the contents of those files into the context to help generate a response.

Generally speaking, skills do an okay job at providing on-demand context. Assuming the AI model is always 12-to-18 months behind in its training data, a skill could potentially backfill any recent framework updates. A skill could potentially undo some training data biases. A skill could potentially apply some of your sensibilities to the output. I’ve seen some impressive results with design guidance skills… but I’ve also seen tons of mediocre results from the same skills. That’s why I deliberately use the word “potentially”. When skills can be optionally included, it’s hard to understand the when and why behind how they get applied.

In that way, skills remind me a bit of magic numbers.

In programming “magic numbers” are a pattern you typically try to avoid. They’re a code smell that you haven’t actually solved the problem, but found a workaround that only works in a particular context. They’re a flashing light that you have brittle logic somewhere in your system. “We don’t know why, but setting the value to 42 appears to have fixed the issue” is a phrase that should send shivers down the spine.

And so now we have these “magic words” in our codebases. Spells, essentially. Spells that work sometimes. Spells that we cast with no practical way to measure their effectiveness. They are prayers as much as they are instructions.

Were we to sit next to each other and cast the same spell from the same book with the same wand; one of us could have a graceful floating feather and the other could have avada kedavra’d their guts out onto the floor. That unstable magic is by design. That element of randomness –to which the models depend– still gives me apprehension.

There’s an opaqueness to it all. I understand how listing skills in an AGENTS.md gives the agent context on where to find more context. But how do you know if those words are the right words? If I cut the amount of words (read: “tokens”) in a skill in half, does it still work? If I double the amount of words, does it work better? Those questions matter when too little context is not enough context and too much context causes context rot. It also matters when you’re charged per-token and more tokens is more time on the GPU. How do you determine the “Minimum Viable Context” needed to get quality out of the machines?

That sort of quality variance is uncomfortable for me from a tooling perspective. Tooling should be highly consistent and this has a “works on my machine” vibe to it. I suppose all my discomfort goes away if I quit caring about the outputs. If I embrace the cognitive dissonance and switch to a “ZOMG the future is amazeballs” hype mode, my job becomes a lot easier. But my brain has been unsuccessful in doing that thus far. I like magic and mystery, but hope- or luck-based development has its challenges for me.

Looking ahead, I expect these types of errant conjurations will come under more scrutiny when the free money subsidies run out and consumers inherit the full cost of the models’ mistakes. Supply chain constraints around memory and GPUs are already making compute a scarce resource, but our Gas Towns plunder onward. When the cost of wrong answers goes up and more and more people spend all their monthly credits on hallucinations, that will be a lot of dissatisfied users.

Anyways, all this changes so much. Today it’s skills, before that MCP, before that PRDs, before that prompt engineering… what is it going to be next quarter? And aren’t those all flavors of the same managing context puzzle? Churn, churn, churn, I suppose.

File under: non-determinism