stop asking Claude one question and thinking you understand the topic. you don't.
Stanford proved a better way. it's called STORM. peer reviewed. 25% more organized output. open source.
the trick: don't ask one question. ask five. from five different experts.
>the practitioner: what do they know that academics miss?
>the skeptic: what's the strongest counterargument?
>the economist: who profits from the current narrative?
>the historian: what pattern has played out before?
>the academic: what does the evidence actually say?
4 prompts. 5 minutes. no software. no GitHub. just paste into Claude.
single prompts give you what everyone already knows.
STORM gives you what nobody else found.
this article has all 4 prompts ready to copy. pick your hardest topic. paste prompt 1. you'll know more in 5 minutes than people who spent days reading.
SpaceX has exercised the option to acquire @cursor_ai in an all-stock transaction with the goal of building the world’s most useful AI models.
For the past few months, SpaceXAI has been jointly training a model with Cursor, which will be released in Cursor and Grok Build soon.
We look forward to working closely with the Cursor team to advance our frontier AI capabilities
GOOGLE HA LIBERADO EN SILENCIO UNA IA QUE PREDICE PATRONES
Ventas. Precios de mercado. Tráfico web.
Demanda energética. Volatilidad cripto.
Se llama TimesFM:
→ Entrenada con 100B de datos reales
→ Forecasting zero-shot, sin fine-tuning
→ Corre en local.
100% Gratis y Open Source.
Enlace abajo👇
Someone built a free collection of production-grade engineering skills that teaches your AI coding agent to work exactly like a senior engineer.
It's called agent-skills. 60,800+ stars on GitHub.
You drop it into Claude Code, Codex, Cursor, or Gemini CLI.
Here's what it does:
→ `/spec` forces the agent to define what to build before touching code. Spec before code. Every time.
→ `/plan` breaks the spec into small, atomic tasks. No giant PRs. No mystery diffs.
→ `/build` implements one slice at a time. Each task is test-driven and committed individually.
→ `/build auto` generates the plan and runs every task in a single approved pass. You approve once. It executes autonomously. Pauses on failures or risky steps.
→ `/test` proves the code works. Tests are treated as proof, not afterthought.
→ `/review` enforces code health before merge. A real quality gate, not a vibe check.
→ `/code-simplify` rewrites for clarity over cleverness. Kills the clever nonsense your agent wrote at 2am.
→ `/ship` runs the full production checklist. Faster is safer only when nothing is skipped.
→ Skills activate automatically based on context. Building an API triggers `api-and-interface-design`. Building UI triggers `frontend-ui-engineering`. No manual configuration.
100% Open Source.
Github repo link: https://t.co/DUlNIoUr7u
Elon Musk explains his 5-step algorithm for solving any problem:
"The most common mistake of smart engineers is to optimize a thing that should not exist."
"I have this very basic first principles algorithm that I run as a mantra."
Elon breaks it down:
Step 1: Question the requirements.
"Make the requirements less dumb. The requirements are always dumb to some degree, no matter how smart the person who gave you those requirements. You have to start there, because otherwise you could get the perfect answer to the wrong question."
Step 2: Try to delete it.
"Try to delete the part or the process step entirely. If you're not forced to put back at least 10% of what you delete, you're not deleting enough. Most people feel like they've succeeded if they haven't been forced to put things back in. But actually they haven't, they've been overly conservative and left things in that shouldn't be there."
Step 3: Optimize or simplify.
"The most common mistake of smart engineers is to optimize a thing that should not exist. So you don't optimize until after you've tried to delete."
Step 4: Speed it up.
"Any given thing can be done faster than you think. But you shouldn't speed things up until you've tried to delete it and optimize it otherwise, you're speeding up something that shouldn't exist."
Step 5: Automate.
"And then the fifth thing is to automate it."
Elon explains why the order matters:
"I've gone backwards so many times where I've automated something, sped it up, simplified it, and then deleted it. I got tired of doing that. So that's why I have this mantra."
HarnessX: a harness that compiles itself.
every harness improvement so far has come from a human editing code by hand.
Anthropic strips planning steps out of Claude Code when a stronger model ships. Manus rebuilt its agent five times in six months, removing complexity each round.
the craft runs on human judgment about what to change and when. HarnessX is what happens when a system makes those edits itself.
the trick is to treat the harness as a first-class object, the way we already treat model weights.
once it's a typed, editable artifact, it can be optimized from its own execution traces.
the framing they use is an operational mirror. evolving a harness maps cleanly onto reinforcement learning.
the harness is the state. an edit is the action. the trace plus a score is the feedback. a new version is the update.
once you see it that way, the failure modes come for free. reward hacking, catastrophic forgetting, under-exploration.
the same problems that break model training show up when a system edits its own scaffolding.
so edits never ship blind. each round, a loop reads the traces, plans a change, writes the edit, then critiques it.
a gate keeps the new version only if it beats the current one on tasks it hasn't seen.
what makes this safe is the structure underneath. the harness is built from typed components the system can swap without breaking the rest.
that is what compiles really means here. every candidate harness is type-checked before it runs.
here is the result that matters. the weakest model improved the most. the strongest barely moved.
an evolved harness closes the gaps a weak model cannot fix on its own. the weights never changed. the environment around them got smarter.
this is the natural next phase of harness engineering. we moved from weights, to context, to hand-built harnesses.
the harness was the last piece we still tuned by hand.
i wrote a deep dive on agent harness engineering a while back, covering the orchestration loop, tools, memory, context management, and everything that turns a stateless LLM into a capable agent. the article is below.
paper: HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry: https://t.co/L0GeUKCgef
ANTHROPIC JUST QUIETLY SHIPPED A FEATURE THAT LETS CLAUDE SPAWN A WHOLE TEAM OF AGENTS THAT MESSAGE EACH OTHER AND REVIEW EACH OTHER'S WORK.
It's a Claude Code feature called agent teams. The team lead spawns multiple agents that share a task list and message each other directly, not subagents reporting back, actual peers. In the demo a QA agent caught three bugs, sent the work back to the front-end and back-end devs, they fixed it, app shipped in one pass.
How to run it:
1. Enable it. Needs Claude Code v2.1.32+. Add to settings.json: "env": { "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1" }. Or paste that to Claude and say "add this to my settings." Restart.
2. Prompt in plain English. Start with a goal (agents wake with zero context), then "create a team of 3 using Sonnet," describe each role, its deliverable, and who it messages when done.
3. The rules: each agent owns its own files, define exact outputs, name who talks to who, keep it to 3-5 agents.
Use it for complex work with separate parts running in parallel. Skip it for simple or sequential tasks, teams cost 3-4x the tokens.
Bookmark this.
Claude Code fully dissected!
Researchers from UCL reverse-engineered the leaked Claude source. What they found changes how you should think about agent design.
Only 1.6% of the codebase is AI decision logic.
The other 98.4% is operational infrastructure. Permission gates, tool routing, context compaction, recovery logic, session persistence. The model reasons. The harness does everything else.
This is the opposite of what most agent frameworks do today.
LangGraph routes model outputs through explicit state machines. Devin bolts heavy planners onto operational scaffolding. Claude Code gives the model maximum decision latitude inside a rich deterministic harness, and invests all its engineering effort in that harness.
The core loop is a simple while-true. Call model, run tools, repeat.
But the systems around that loop are where the real design lives:
A permission system with 7 modes and an ML classifier. Users approve 93% of prompts anyway, so the architecture compensates with automated layers instead of adding more warnings.
A 5-layer context compaction pipeline. Each layer runs only when cheaper ones fail. Budget reduction, snip, microcompact, context collapse, auto-compact.
Four extension mechanisms ordered by context cost. Hooks (zero), skills (low), plugins (medium), MCP (high). Each answers a different integration problem.
Subagents return only summary text to the parent. Their full transcripts live in sidechain files. Agent teams still cost roughly 7x the tokens of a standard session.
Resume does not restore session-scoped permissions. Trust is re-established every session. That friction is the point.
The bet behind all of this is simple. As frontier models converge on raw coding ability, the quality of the harness becomes the differentiator, not the model.
Paper: Dive into Claude Code (arXiv:2604.14228)
We've shared an article on Agent Harness and what every big company is building.
Read it below.
Karpathy said something you'll regret ignoring:
"Remove yourself as the bottleneck. Maximize your leverage. Put in very few tokens, and a huge amount of stuff happens on your behalf."
Loop engineering is the exact thing that does that.
In a hand-run session, the operator handles two things:
- deciding what the agent runs next
- and checking its output before the next step
Both are manual, and both decide how far the agent gets on its own without the operator.
Loop engineering moves both steps into the system.
A core operating structure surrounds the loop, and the diagram below depicts it.
- A schedule decides what to run
- Loop is the maker that produces the work
- A separate checker agent grades the output
- A file on disk holds the state they both read.
The loop runs until either done, max iterations, or an exhausted budget.
Here are some practical engineering considerations:
1) A model grading its own output justifies what it already did instead of catching where it failed.
That's why a separate checker's findings return to the maker as the next instruction. And the cycle repeats until the checker finds nothing left to fix.
2) A loop with no stop condition burns tokens, and the cost climbs fast once sub-agents and long runs add up.
That's why the exit must be set before the loop runs, not while it is running.
A simple exit could be:
↳ fix only the major issues, run one final pass, and stop after two loops, with "all tests pass and lint clean" as the rule that ends it.
3) State has to live on disk, not in context.
The model forgets everything between runs, so an MD file or a knowledge graph holds what is done and what is still open.
Each run reads it and writes back to it, which lets a loop pick up again after days.
4) The lower the verification bar, the safer the loop.
Boring, repetitive checks like a stale version string or a missing test are trivial to verify, so a loop runs them with little risk while the operator is away.
Judgment-heavy work is loopable too, but only as far as the checker can confirm the result.
Let's look at how an unattended loop fails in two ways.
1) It reports done when nothing is actually verified.
The separate checker exists to prevent it, but it merges code faster than anyone reads it, so over weeks, the team stops understanding its own codebase while every check stays green.
Green tests say the code passed the tests, not that anyone knows what shipped. Someone still has to read what the loop merges.
2) The checker keeps a running loop honest, but it only catches failures inside a run.
The harness around the loop, like the prompts, tools, and checks wrapped around the model, still drifts and breaks in production as models change.
That repair loop is usually run by hand based on observability traces.
My co-founder wrote a detailed walkthrough (with code) on making that harness repair itself, where a failing trace gets diagnosed, the fix is verified against the exact input that failed, and the failure is locked as a regression test so it cannot recur.
Read it below.
#NJU research team, led by Professor Wang Xinran and Associate Professor Qiu Hao from the College of Integrated Circuits, in collaboration with Suzhou National Laboratory and Huawei Technologies Co., Ltd., has successfully developed the Mengqi-1000: the world's first molybdenum disulfide-based multi-bit parallel microprocessor.
Mengqi-1000's transistor integration density sets a new record among emerging non-silicon digital circuits. This achievement marks that China's research on two-dimensional semiconductors has entered a new stage of integration with industrial production lines.
The achievement was published in Nature Electronics @NatureElectron on May 26, 2026: https://t.co/25c3vEL6ic
#NJUresearch
Anthropic just literally spoon-fed you how to use Fable properly.
99% of Claude users missed it.
The way you need to prompt Fable is fundamentally different from all other AI models.
I translated their entire new Fable prompting handbook:
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use.
Its capabilities exceed those of any model we’ve ever made generally available.
Stanford + Meta just dropped the paper that flips everything about AI agents.
It's called "Code as Agent Harness."
Right now, we treat large language models as text generators. When they need to solve a complex problem, they rely on a "chain of thought."
But natural language is slippery. It's vague. It loses context. When an agent hallucinates in English, it just keeps talking.
So they introduced a framework that changes the entire architecture of autonomy: "Code as Agent Harness."
They stopped asking the AI to reason in words, and forced it to reason in code.
Code isn't just the final output anymore. It is the memory. It is the environment. It is the boundary.
Instead of writing a paragraph about how to solve a problem, the agent writes a script, executes it, and reads the output.
Tests become its senses. Execution logs become its memory. Sandboxes become its physics.
If an agent makes a mistake in English, it apologizes and hallucinates again.
If an agent makes a mistake in code, the compiler throws an error. The trace tells it exactly what broke. The system forces it to fix it.
This is where prompt engineering dies, and systems engineering takes over.
The paper proves that reliability doesn't come from a smarter base model. It comes from the "harness" wrapped around it:
- The model proposes.
- The harness executes.
- The environment returns feedback.
- The verifier checks.
what is agent looping
for the last two years we prompted agents one task at a time. that is starting to change
instead of asking an agent to build the landing page and then driving every step yourself, you set up a loop that handles discovery, planning, the work, checking, and iterating until the goal is met
looping is a setup you build. almost any agent harness can run it, it just depends on how you wire it up
at its simplest, looping is one agent working on itself:
> researches
> drafts
> checks the draft against a goal
> fixes what is weak
> runs that cycle again until the work clears the requirements
you are not prompting each step anymore. the agent repeats the cycle for you
the bigger version is a fleet looping. you give an orchestrator agent a goal, it breaks the goal into pieces, hands each piece to a specialist agent, and those specialists hand smaller jobs to their own subagents
the whole tree keeps looping through discovery, planning, execution, and verification until the goal is met
one agent looping is like a person redoing their own draft. a fleet looping is a whole team running a project end-to-end
you create a goal, and the system runs the loop until it finishes within the reqs you set
open and closed looping:
OPEN LOOPING is exploratory. it still has conditions and a goal, but you give the agent or the fleet a wide space to move in. it can try different paths, discover things, build something you did not fully spec out
this is the exciting end, it is what Peter and others are doing, and tbh it is where I want to spend more time
the catch is cost, an open loop with real room to explore burns an insane amount of tokens. for the 90 percent of people without an unlimited budget it is not runnable yet, and pointed at projects with a loose standard it turns into a slop machine
CLOSED LOOPING is bounded. a human designs the end-to-end path first:
> clear goal
> defined steps
> an eval at each step
> a point where it stops or hands back to you (and feeds back performance data)
the agents still loop, but inside framework you built. it gets better every run because each pass feeds the next, and it runs on a normal budget because the path is tight.
for most marketing work, closed is the one that pays off today.
> the orchestrator owns the goal
> the specialists own the steps
> the subagents do the narrow work
> an eval gate make sure its not slop
Andrej Karpathy spent 2h showing how he actually uses AI day to day
he's a co-founder of OpenAI and led AI at Tesla, so when he shows how he works, it’s worth watching
and the whole session is just him telling the machine what he wants in simple terms, like he's briefing a coworker
watch what's actually happening the entire time:
> he describes the task in normal words
> it goes off and does the work
> he glances at the result and nudges it with one more sentence
that's the whole skill, and you've had it since you learned to talk
the only gap between that and a worker that runs on its own is handing that sentence a schedule and the tools to act
check his work, then build the version that keeps working when you stop
This is the best site on the internet to learn harness engineering.
Free. Completely.
Most AI engineers have never heard the term.
https://t.co/bwDbTTYsjM
Bookmark this site.
Then read this setup ↓
Anthropic's in trouble, again!
They spent years building what's now fully open-source.
What made Claude feel different from a normal app is that the agent could act inside the interface instead of only talking in a chat box.
For instance, Claude Artifacts let an agent render real UI, charts, dashboards, and interactive components that assemble live inside the response.
Every major AI product tried to replicate it.
But the problem was that unlike reasoning, planning, tool-calling, etc., none of it shipped natively with LangGraph, CrewAI, or Google ADK.
So teams started building an owned version that required engineering the entire interface layer from scratch.
Most teams, however, just settled for shipping the agent as a backend API in a chat box since rendering the UI is only one piece of it.
To actually make it work, the interface layer also needed real-time streaming, state kept in sync between agent and UI, conversations that persist across sessions, and reconnection when a user refreshes mid-run.
@CopilotKit is now the only open-source framework that actually lets you build your own full-stack Claude-like apps.
It decouples the agent from the interface, talking over AG-UI (an open protocol for agent-to-user communication).
Being a standard protocol, the frontend never needs to know whether it is talking to a LangGraph or a CrewAI agent. You can change the backend anytime and the UI will never notice.
In practice, CopilotKit's interface layer gives several pre-implemented React building blocks that wire the agent directly into the app, like:
- generative UI, so the agent renders real components instead of text
- chat windows, sidebars, and popups, or a fully headless setup
- shared state, so the agent and app stay in sync
- human-in-the-loop approvals, where the agent waits before acting
- persistent threads that store the whole session, including the agent-user interactions and generated UI, not just text
And because that full history is captured, those interactions can feed a self-learning layer that also improves the agent from real usage over time.
The interface layer that Anthropic spent years engineering in-house is now literally available to any developer/team.
CopilotKit is open-source with 30k+ GitHub stars, and AG-UI, the protocol underneath, is already supported across every major agent framework: LangGraph, CrewAI, Mastra, Google ADK, and more.
CopilotKit GitHub repo → https://t.co/wkQ1taF0rM
(don't forget to star it ⭐ )
If you want to go deeper, I found a detailed breakdown by Shubham Saboo recently on the three Generative UI patterns, with implementation.
Read it below.