I ran with grok cli this morning and man, it makes a lot more assumptions about things than other LLM I have used.
Other LLMs have a similar issue, but grok seems to really lean into it.
LLMs need to be trained that when it doesn't know, has open questions, knows it is assuming, it should verify with the human.
While social media has been pushing the narrative that you must be building skills, I see no one talking about testing whether you actually need those skills or not.
More importantly, by adding in all of those skills have you evaluated how an LLM will reason when they're all loaded into the context window?
Have you tested whether or not your multiple skills conflict in anyway? Are they using the same patterns, same verbage or are you being inconsistent? If you aren't checking this, then you're creating the perfect situation for the LLM to produce the wrong output.
Always question the so called AI experts on social media, because the vast majority are not doing any of this testing. They are not context-first and context-aware.
That should be a given. Also, it's not just about write all the skills, put them into a git repo and call it good. You need to do constant testing of those skills to see if they're still needed as new models come out. A lot of skills should be viewed as temporary - filling in the model gaps. So, a skill/agent prompt test harness is also important so you can retire skills as the models get better.
While everyone is so focused on build a skill for everything you think of, very few are being context-aware. What needs to go in context versus what doesn't and what is harmful to the context.
I see so many variations of markdown files where so called "AI experts" say you need to have a CLAUDE.md, AGENTS.md, SOUL.md, ABOUT.md, CONTEXT.md, CONTEXT-MAP.md .... filling up their context window, never asking if any of these markdowns actually conflict and then blame the LLM for either its bad output or "hallucinations" when the culprit was the person filling up the context window as fast as possible without ever taking the time to understand if it was even necessary.
I woke up to a heated argument between my AI agents about one of my ideas.
The team took my idea, ran a deep market analysis, and then sent the idea and market research to my council to deliberate and determine if the idea has enough weight to pursue and if I am the right person to pursue it.
The online AI debate is either about how I gave it an idea and shipped it or AI is so horribly they gave up on it. Both are noise.
I find joy in building the systems AI runs within, and not skipping the last 100 years of business lessons.
https://t.co/KybVvnndfu
Markdown will not be the dominant format for LLMs for long.
HTML is much more suited for LLMs that understand XML very well. With html you get:
* Clean Structure
* Supports narrative & prose
* Supports rules
* Supports state - think short term memory
The W3C already supports custom elements and attributes.
Xpath with agents can allow agents to be very precise with what it reads and writes to.
We have spent 30+ years making DOM traversal fast for humans.
AI can take advantage of that!
Imagine every html page with custom elements/attributes for agents. No need for LLM.txt. It just works.
Weโll have to solve the security aspect of it, but thatโs expected.
/cc @karpathy
@shivsakhuja I do like fairly strict boundaries between agents. It makes debugging easier! ๐ I tend to use a few rules often in agent design: Single Responsibility Principle, Separation of Concerns, and Progressive Disclosure.
Earned Abstraction
Knowledge lives in the conversation between the question and the answer. If youโre not in the conversation and you only look at the question or the answer then you lose out on seeing and walking the path connecting the two.
I have a question for those who read this:
For some context, I was working on my global CLAUDE.md file making some tweaks and whatnot. Then I started thinking about how much interference I'm going to get when working with agents and plugins because of the CLAUDE.md file. When debugging, I won't know if the CLAUDE.md is the culprit or the plugin itself.
I decided I wanted the CLAUDE.md file empty. Nice and simple. However, that introduced a problem. There are commands in that file that I like.
Idea!
Lenses.
What I love about AI, you can have an idea, spend a bit of time fleshing it out, open up claude code, tell it you want to plan out an idea and then start talking and less than a minute later you have a prototype.
Lenses. A Claude code plugin that allows you to load lenses that will tell Claude how you want to interact with it. This solves my CLAUDE.md issue, because I can load a lens that runs the prompt I original had and now the session will be in that frame.
This introduced a whole set of ideas that I didn't have, until I started playing with it. The obvious, multiple lenses, load any time. So I added several.
Next, personalities. Maybe I want Claude to be Bob and communicate in a different way. Maybe I want to simulate a discussion with Marcus Aurelius? If I wanted to make the experience even better, I'd add a SQLite database with the vector extension and embed some of his work and tell the personality how it can query it.
Next, if I have lenses and personalities, maybe it should stack them. Maybe have Hemingway being Socratic?
The list could go on and on: Job Function, Cultures, Audience profile... Not sure it needs to go that far, but it could.
Gotta love the ability to freely creative.
Anyways, back to the first thing I wrote, while I am building out an assortment of lenses and personalities, what personalities would you add or find fun? Something that would crack you up? Whatever suggestions I get, I'll add in. Then I'll open source it!
Thank you! I'm very curious about the internals of what you have running - I assume you're not exposing that, which is understandable.
I've been doing a lot of work around building systems for AI to work within to try and achieve a more deterministic output. I wrote about this in a small series that may or may not be of interest to you, but I thought I'd share: https://t.co/7xAQCyEtTk