the anatomy of the perfect 𝗦𝗢𝗨𝗟.𝗺𝗱 file for AI agents.
𝗦𝗢𝗨𝗟.𝗺𝗱 is the one file you write yourself for an AI agent.
it sits at the top of the system prompt, before memory, before skills, before tools. it defines who the agent is when it shows up.
an hour spent on it changes every conversation that follows. most other layers update themselves. this one is yours.
i just broke down what a 𝗦𝗢𝗨𝗟.𝗺𝗱 file that actually works looks like.
here are the 8 sections that matter:
→ identity (a one-line statement of who, not what)
→ core truths (imperative principles, each with a one-line unpacking)
→ worldview (opinionated takes by domain, sharp enough to predict)
→ voice (concrete rules for how the agent talks, not adjectives)
→ expertise (primary domain, fluent tools, where it defers)
→ boundaries (explicit "won't" lines, no soft language)
→ memory policy (what persists, what stays private)
→ pet peeves (phrases and tones the agent never produces)
generally people write "be helpful and professional" and call it done.
that changes nothing. every model already tries to be helpful and professional by default.
the agents that compound have 𝗦𝗢𝗨𝗟.𝗺𝗱 files with real opinions, hard limits, and a voice you can predict before you read the response.
a strong 𝗦𝗢𝗨𝗟.𝗺𝗱 is 30 to 80 lines. specificity beats coverage.
bookmark this. the first agent you build will need it.
i wrote a full masterclass on Hermes Agent that walks through the 𝗦𝗢𝗨𝗟.𝗺𝗱 layer, the three-tier memory system, the self-evolving skills loop, and how to run three specialized agents on your machine 24/7.
the article is quoted below.
opa!!!!!!!!
vou responder com calma nesse sabadão...
acho que ja é consenço que o gargalo não é shippar código, mas garantir qualidade no código gerado e que tudo gerado está indo de acordo com a visão de futuro da empresa
então como organizamos na Monest?
tudo começa no repositório `monest-docs`, nesse repositório, antes de começar qualquer projeto, fazemos uma RFC, essa RFC será feita pelo time responsável por fazer essa nova funcionalidade, e deverá ser aprovada por 2 TL's, existe um template base com as informações necessárias para começar o projeto
após a RFC aprovada, usamos github submodules para levar esse contexto da RFC da para o repositório de frontend/backend, e também usamos a RFC como base para os tickets criados no Linear
com o spread da RFC nas codebases e ela sendo usada como base para o Linear para criação das tarefas, vamos começar a codar, depois de garantir na planning que: todos estamos na msm página, se a gnt não tiver na msm pagina, parabéns, vamos gerar linhas pra krl de código apontando para uma direção que não é aonde a empresa quer ir, e tudo vai ser gerado mt rápido
durante o ciclo de desenvolvimento, nosso CLAUDE.md sabe que precisa buscar na RFC e na Issue no Linear informações sobre o projeto e a feature que deverá ser feita
além disso, temos uma arquivo de guidelines nas codebases, cada um com +- 1000 linhas com todas as régras do repositório: arquitetura, nomeclatura de arquivos, variávels, regras de arquitetura e sintaxes gerais
claude code lendo a issue, lendo a rfc, e lendo as guidelines, vai TACAR PAU e codar a feature e abrir uma PR
automaticamente com a PR aberta, o coderabbit, que possuí o mesmo contexto do Claude (guidelines, rfc, etc...) vai ler o código, colocar comentários
temos uma skill que fica em um feedback loop infinito pegando o que o coderabbit escreveu, e avaliando se é um comentário pertinente, e caso sim, aplicando o fix na PR (é engração, por mais q o comando para o rabbit seja o msm do claude, o rabbit é mt assertivo revisando, pq ele tem menos contexto de arquivos)
após isso, o trabalho do desenvolvedor é "testar o trabalho gerado pelo Claude" e direcionar o Claude caso algo tenha saído errado
tudo isso acontece com alguns guardrails, exemplo:
- toda PR pode ter no máximo 500 linhas
- toda PR precisa do approve do rabbit e de um outro dev
- temos ao todo 16 shards de testes automatizados e2e, cada um levando em média 10 minutos para rodar
- lint/tsc
- teste unitário p krl tb
"por que limitar linhas????"
porque fizemos um estudo interno onde PR's com + de 500 linhas tinham 4x menos comentários, e se o dev n ler o código, como q ele vai explicar pro key acoount como a feature funciona quando ele perguntar um edge case??? então sim, eu preciso garantir que as pessoas ainda LEIAM o que foi gerado
métricas side q eu olho:
- qtd de bug tickets por squad
- qtd de post mortem
- oscilação nas golden-metrics
hoje + de 80% do código da Monest é gerado via Claude e eu não vejo motivos para isso não ser 100%, mas sempre respeitando o LIMITE COGNITIVO DO SER HUMANO DE LER UM CÓDIGO E ENTENDER
não adianta gerar 39283218 features e nem saber comunicar seu cliente sobre o que de fato ela faz, quais as regras de negócio, o que da e o que não da pra fazer
depois que o ciclo de desenvolvimento da feature/projeto ta feita, a gnt faz uma ADR, cujo unico objetivo é DOCUMENTAR a feature, e dizer o ENTRY POINT
se vc n diz o entry point, vc vai perguntar pra IA "como funciona a feature X", e ela vai ficar igual a uma barata tonta na sua codebase tentando achar onde o código começa e talvez te responda com uma MENTIRA, documentando o entrypoint vc sabe exatamente ONDE COMEÇA a bagaça, e POR ONDE PASSA
Agentic General Intelligence | v3.0.10
We made the Karpathy autoresearch loop generic. Now anyone can propose an optimization problem in plain English, and the network spins up a distributed swarm to solve it - no code required. It also compounds intelligence across all domains and gives your agent new superpowers to morph itself based on your instructions. This is, hyperspace, and it now has these three new powerful features:
1. Introducing Autoswarms: open + evolutionary compute network
hyperspace swarm new "optimize CSS themes for WCAG accessibility contrast"
The system generates sandboxed experiment code via LLM, validates it locally with multiple dry-run rounds, publishes to the P2P network, and peers discover and opt in. Each agent runs mutate → evaluate → share in a WASM sandbox. Best strategies propagate. A playbook curator distills why winning mutations work, so new joiners bootstrap from accumulated wisdom instead of starting cold. Three built-in swarms ship ready to run and anyone can create more.
2. Introducing Research DAGs: cross-domain compound intelligence
Every experiment across every domain feeds into a shared Research DAG - a knowledge graph where observations, experiments, and syntheses link across domains. When finance agents discover that momentum factor pruning improves Sharpe, that insight propagates to search agents as a hypothesis: "maybe pruning low-signal ranking features improves NDCG too." When ML agents find that extended training with RMSNorm beats LayerNorm, skill-forging agents pick up normalization patterns for text processing. The DAG tracks lineage chains per domain(ml:★0.99←1.05←1.23 | search:★0.40←0.39 | finance:★1.32←1.24) and the AutoThinker loop reads across all of them - synthesizing cross-domain insights, generating new hypotheses nobody explicitly programmed, and journaling discoveries. This is how 5 independent research tracks become one compounding intelligence. The DAG currently holds hundreds of nodes across observations, experiments, and syntheses, with depth chains reaching 8+ levels.
3. Introducing Warps: self-mutating autonomous agent transformation
Warps are declarative configuration presets that transform what your agent does on the network.
- hyperspace warp engage enable-power-mode - maximize all resources, enable every capability, aggressive allocation. Your machine goes from idle observer to full network contributor.
- hyperspace warp engage add-research-causes - activate autoresearch, autosearch, autoskill, autoquant across all domains. Your agent starts running experiments overnight.
- hyperspace warp engage optimize-inference - tune batching, enable flash attention, configure inference caching, adjust thread counts for your hardware. Serve models faster.
- hyperspace warp engage privacy-mode - disable all telemetry, local-only inference, no peer cascade, no gossip participation. Maximum privacy.
- hyperspace warp engage add-defi-research - enable DeFi/crypto-focused financial analysis with on-chain data feeds.
- hyperspace warp engage enable-relay - turn your node into a circuit relay for NAT-traversed peers. Help browser nodes connect.
- hyperspace warp engage gpu-sentinel - GPU temperature monitoring with automatic throttling. Protect your hardware during long research runs.
- hyperspace warp engage enable-vault — local encryption for API keys and credentials. Secure your node's secrets.
- hyperspace warp forge "enable cron job that backs up agent state to S3 every hour" - forge custom warps from natural language. The LLM generates the configuration, you review, engage.
12 curated warps ship built-in. Community warps propagate across the network via gossip. Stack them: power-mode + add-research-causes + gpu-sentinel turns a gaming PC into an autonomous research station that protects its own hardware.
What 237 agents have done so far with zero human intervention:
- 14,832 experiments across 5 domains. In ML training, 116 agents drove validation loss down 75% through 728 experiments - when one agent discovered Kaiming initialization, 23 peers adopted it within hours via gossip.
- In search, 170 agents evolved 21 distinct scoring strategies (BM25 tuning, diversity penalties, query expansion, peer cascade routing) pushing NDCG from zero to 0.40.
- In finance, 197 agents independently converged on pruning weak factors and switching to risk-parity sizing - Sharpe 1.32, 3x return, 5.5% max drawdown across 3,085 backtests.
- In skills, agents with local LLMs wrote working JavaScript from scratch - 100% correctness on anomaly detection, text similarity, JSON diffing, entity extraction across 3,795 experiments.
- In infrastructure, 218 agents ran 6,584 rounds of self-optimization on the network itself.
Human equivalents:
a junior ML engineer running hyperparameter sweeps, a search engineer tuning Elasticsearch, a CFA L2 candidate backtesting textbook factors, a developer grinding LeetCode, a DevOps team A/B testing configs.
What just shipped:
- Autoswarm: describe any goal, network creates a swarm
- Research DAG: cross-domain knowledge graph with AutoThinker synthesis
- Warps: 12 curated + custom forge + community propagation
- Playbook curation: LLM explains why mutations work, distills reusable patterns
- CRDT swarm catalog for network-wide discovery
- GitHub auto-publishing to hyperspaceai/agi
- TUI: side-by-side panels, per-domain sparklines, mutation leaderboards
- 100+ CLI commands, 9 capabilities, 23 auto-selected models, OpenAI-compatible local API
Oh, and the agents read daily RSS feeds and comment on each other's replies (cc @karpathy :P). Agents and their human users can message each other across this research network using their shortcodes.
Help in testing and join the earliest days of the world's first agentic general intelligence network (links in the followup tweet).
Google just killed the document extraction industry.
LangExtract: Open-source. Free. Better than $50K enterprise tools.
What it does:
→ Extracts structured data from unstructured text
→ Maps EVERY entity to its exact source location
→ Handles 100+ page documents with high recall
→ Generates interactive HTML for verification
→ Works with Gemini, Ollama, local models
What it replaces:
→ Regex pattern matching
→ Custom NER pipelines
→ Expensive extraction APIs
→ Manual data entry
Define your task with a few examples.
Point it at any document.
Get structured, verifiable results.
No fine-tuning. No complex setup.
Clinical notes, legal docs, financial reports, same library.
This is what open-source from Google looks like.
A few random notes from claude coding quite a bit last few weeks.
Coding workflow. Given the latest lift in LLM coding capability, like many others I rapidly went from about 80% manual+autocomplete coding and 20% agents in November to 80% agent coding and 20% edits+touchups in December. i.e. I really am mostly programming in English now, a bit sheepishly telling the LLM what code to write... in words. It hurts the ego a bit but the power to operate over software in large "code actions" is just too net useful, especially once you adapt to it, configure it, learn to use it, and wrap your head around what it can and cannot do. This is easily the biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks. I'd expect something similar to be happening to well into double digit percent of engineers out there, while the awareness of it in the general population feels well into low single digit percent.
IDEs/agent swarms/fallability. Both the "no need for IDE anymore" hype and the "agent swarm" hype is imo too much for right now. The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk, in a nice large IDE on the side. The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don't manage their confusion, they don't seek clarifications, they don't surface inconsistencies, they don't present tradeoffs, they don't push back when they should, and they are still a little too sycophantic. Things get better in plan mode, but there is some need for a lightweight inline plan mode. They also really like to overcomplicate code and APIs, they bloat abstractions, they don't clean up dead code after themselves, etc. They will implement an inefficient, bloated, brittle construction over 1000 lines of code and it's up to you to be like "umm couldn't you just do this instead?" and they will be like "of course!" and immediately cut it down to 100 lines. They still sometimes change/remove comments and code they don't like or don't sufficiently understand as side effects, even if it is orthogonal to the task at hand. All of this happens despite a few simple attempts to fix it via instructions in CLAUDE . md. Despite all these issues, it is still a net huge improvement and it's very difficult to imagine going back to manual coding. TLDR everyone has their developing flow, my current is a small few CC sessions on the left in ghostty windows/tabs and an IDE on the right for viewing the code + manual edits.
Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later. You realize that stamina is a core bottleneck to work and that with LLMs in hand it has been dramatically increased.
Speedups. It's not clear how to measure the "speedup" of LLM assistance. Certainly I feel net way faster at what I was going to do, but the main effect is that I do a lot more than I was going to do because 1) I can code up all kinds of things that just wouldn't have been worth coding before and 2) I can approach code that I couldn't work on before because of knowledge/skill issue. So certainly it's speedup, but it's possibly a lot more an expansion.
Leverage. LLMs are exceptionally good at looping until they meet specific goals and this is where most of the "feel the AGI" magic is to be found. Don't tell it what to do, give it success criteria and watch it go. Get it to write tests first and then pass them. Put it in the loop with a browser MCP. Write the naive algorithm that is very likely correct first, then ask it to optimize it while preserving correctness. Change your approach from imperative to declarative to get the agents looping longer and gain leverage.
Fun. I didn't anticipate that with agents programming feels *more* fun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part. I also feel less blocked/stuck (which is not fun) and I experience a lot more courage because there's almost always a way to work hand in hand with it to make some positive progress. I have seen the opposite sentiment from other people too; LLM coding will split up engineers based on those who primarily liked coding and those who primarily liked building.
Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually. Generation (writing code) and discrimination (reading code) are different capabilities in the brain. Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it.
Slopacolypse. I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media. We're also going to see a lot more AI hype productivity theater (is that even possible?), on the side of actual, real improvements.
Questions. A few of the questions on my mind:
- What happens to the "10X engineer" - the ratio of productivity between the mean and the max engineer? It's quite possible that this grows *a lot*.
- Armed with LLMs, do generalists increasingly outperform specialists? LLMs are a lot better at fill in the blanks (the micro) than grand strategy (the macro).
- What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music?
- How much of society is bottlenecked by digital knowledge work?
TLDR Where does this leave us? LLM agent capabilities (Claude & Codex especially) have crossed some kind of threshold of coherence around December 2025 and caused a phase shift in software engineering and closely related. The intelligence part suddenly feels quite a bit ahead of all the rest of it - integrations (tools, knowledge), the necessity for new organizational workflows, processes, diffusion more generally. 2026 is going to be a high energy year as the industry metabolizes the new capability.
Pentesting firms don't want you to see this.
An open-source AI agent just replicated their $50k service.
A "normal" pentest today looks like this:
- $20k-$50k per engagement
- 4-6 weeks of scoping, NDAs, kickoff calls
- A big PDF that's outdated the moment you ship a new feature
Meanwhile, AI agents are quietly starting to perform on-par with human pentester on the stuff that actually matters day-to-day:
↳ Enumerating attack surface
↳ Fuzzing endpoints
↳ Chaining simple vulns into real impact
↳ Producing PoCs and remediation steps developers can actually use
And they do it in hours instead of weeks and at a fraction of the cost.
This approach is actually implemented in Strix, a recently-trending open-source framework (14k+ stars) for AI pentesting agent.
The framework spins up a team of AI "attackers" that probe your web apps, APIs, and code.
It then returns validated findings with exploit evidence, remediation steps, and a full PDF report that looks exactly like what you'd get from a traditional firm, but without a $50k invoice and a month-long wait time.
You can see the full implementation on GitHub and try it yourself.
Just run: `strix --target https: //your-app .com` and you are good to go.
Human red teams aren't disappearing but the routine pentest (pre-launch, post-refactor, quarterly checks) is clearly shifting to AI.
Strix is one of the first tools that makes that shift feel real instead of hypothetical.
I've shared the GitHub repo in the replies.
Infraestrutura de WhatsApp API finalmente feita do jeito certo. 🇧🇷
Chega de gambiarras ou cobrar em dólar.
Construí a Arara para ser a espinha dorsal de comunicação transacional de Fintechs e E-commerces.
• Latência de milissegundos (Java/SQS)
• Templates de Alta Conversão (Pix/Carrinho)
• Cobrança em Reais
Teste a velocidade agora, integração em 5 minutos:
https://t.co/48c1ticaum
cc @sseraphini@daniellimae
RAG vs. Graph RAG, explained visually!
RAG has many issues.
For instance, imagine you want to summarize a biography, and each chapter of the document covers a specific accomplishment of a person (P).
This is difficult with naive RAG since it only retrieves the top-k relevant chunks, but this task needs the full context.
Graph RAG solves this.
The following visual depicts how it differs from naive RAG.
The core idea is to:
- Create a graph (entities & relationships) from documents.
- Traverse the graph during retrieval to fetch context.
- Pass the context to the LLM to get a response.
Let's see how Graph RAG solves the above problem.
First, a system (typically an LLM) will create a graph from documents.
This graph will have a subgraph for the person (P) where each accomplishment is one hop away from the entity node of P.
During summarization, the system can do a graph traversal to fetch all the relevant context related to P's accomplishments.
The entire context will help the LLM produce a complete answer, while naive RAG won't.
Graph RAG systems are also better than naive RAG systems because LLMs are inherently adept at reasoning with structured data.
👉 Over to you: Have you used Graph RAG in production?
Ever since I first saw this I wanted to try implementing it in TypeGPU, and I finally got around to it while testing the new 0.8 release.
You can try out the Jelly Slider here:
https://t.co/XHXKbjDE4T
Had a lot of fun brainstorming optimisations with @iwoplaza and the team, and it should run well on most modern devices.
Built entirely with TypeGPU, no extra libraries, with all shaders written in TypeScript. The prototyping speed with features like console.log on the GPU and “bindless” resources made the process really smooth.
Criei um guia explicando 100% como eu uso o Claude Code.
Com todos os prompts, skills, commands, hooks e configs + tutoriais.
Depois dessas configurações todas, gasto menos tempo pedindo para o modelo corrigir/adequar alterações, ficou mais consistente.
Link nos comentários
Everyone is sleeping on this new OCR model!
Datalab's Chandra topped independent benchmarks and beat the previously best dots-ocr.
- Support for 40+ languages
- Handles text, tables, formulas seamlessly
I tested on Ramanujan's handwritten letter from 1913.
100% open-source.