NEW paper worth reading.
(bookmark it)
Autonomous research systems usually prove themselves on cherry-picked wins, human-framed topics, or a handful of preset tasks.
FARS runs the full loop at scale instead. Stage-specific agents handle ideation, planning, experimentation, and writing over a shared workspace that records proposals, code, logs, results, and manuscripts.
Its first public deployment produced 166 complete papers across 67 fine-grained AI/ML topics, and it kept the failures in the corpus rather than curating a highlight reel.
Why it matters.
282 volunteer reviews over 140 papers give an honest read. FARS can produce review-worthy artifacts, while the same reviews expose recurring failure modes in narrow scope, methodology, and integrity.
Paper: https://t.co/f6pMG0hYAA
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
this is f*cking gold
Andrej Karpathy joined Anthropic five weeks ago.
A friend on his team just showed me the exact LOOPS.md file he actually uses.
I dropped it into my setup. The very first response was different.
Not slightly different. Completely different.
Claude stopped giving generic answers and started working exactly the way I think.
You don't talk to the model anymore. You build the system that talks to the model for you.
Bookmark it before it gets lost in your feed.
Read it now, then check the article below.
Anthropic engineer:
"You can build 5 assistants in one afternoon. Each one handles a task you've been doing manually every single day."
In 45 minutes he shows exactly how to do it from scratch, step by step.
Most people are still doing all of this by hand.
Watch the session, then save the guide below.
“Loop engineering” is a hot buzzphrase after mentions of it by Boris Cherny (Claude Code’s creator) and Peter Steinberger (OpenClaw's creator) went viral on social media. Loops are now a key part of how we get AI agents to iterate at length to build software. In this letter, I’d like to share my 3 key loops, shown in the image below, for building 0-to-1 products. These loops guide not just how I build software, but also how I decide what software to build.
Agentic coding loop: Given a product specification and optionally a set of evals (that is, a dataset against which to measure performance), we can have an AI agent write code, test its work, and keep iterating until the code is bug-free and meets its specification. This idea of closing the loop took off around the end of last year, and it has been a game changer in enabling coding agents to work longer productively without human intervention. For example, over the weekend, I was building an app for my daughter to practice typing, and my coding agent could easily work for around an hour, using a web browser to check what it had built multiple times before getting back to me, without needing my intervention.
The engineering loop executes quickly. Every few minutes, the coding agent might build and test a new version of the software. I hear frequently from developers who are finding new ways to engineer more effective engineering loops. This is an active area of invention!
Developer feedback loop: In this loop, a developer examines the current product and steers the coding agent to improve it. Last year, a lot of developers (including me) were acting as the QA (quality assurance) function for our coding agents, manually finding bugs and then asking the agent to fix them. But with coding agents much more able to test their own code, the amount of time we need to spend on this function has decreased significantly. This allows us to make higher-level product decisions, such as what key features to offer, where the UI needs improvement, and so on.
The developer-feedback loop operates over time intervals between tens of minutes and hours — that's how frequently a developer might review a product and give feedback. In the case of the typing app, I changed my mind a few times about the visual design, what cat costumes she can unlock as she learns (she loves cats), and the user flow for a grown-up to log in and steer the child's learning experience.
When a developer has a clear vision for what to build, it is still a lot of work to translate that vision into a specification for a coding agent to implement. Further, after the developer has seen an implementation, they might update (or perhaps clarify) the spec to steer it toward what they want. If you find that the system repeatedly runs into certain problems, building a set of evals for the agent becomes useful.
AI-native teams are increasingly using AI to help shape product direction, for example, automating the gathering and analysis of usage data, summarizing written and verbal customer feedback, or carrying out competitive analysis. However, for pretty much all the products I’m involved in, I see humans as having a significant context advantage over current AI systems — we know a lot more than the AI system about the users and the context the product has to operate in — and thus humans play a critical role. Many people describe this human contribution as “taste,” but I prefer to think of it as humans having a context advantage, since that gives us a clearer path to helping AI systems get better. This also speaks to why this step can’t be automated: So long as the human knows something the AI does not, human-in-the-loop is needed to to inject that knowledge into the system.
External feedback loop: This includes a wide range of tactics like asking a few friends for feedback, launching to alpha testers, or putting the code into production with A/B testing. These tactics are usually slow, rarely taking less than hours and sometimes taking days or even weeks. This data informs the developer vision, which in turn continues to drive the detailed product spec, which in turn drives the coding agent.
With coding agents speeding up software development, more engineers are starting to play a partial product management role. For many engineers who are growing into this role, the hardest part is shaping the product vision and striking a balance between building (bridging the gap between vision and spec) and getting user feedback to evolve the vision. It is important to do both!
I will write more about how to do this in future posts, but for now, I find it encouraging that engineers are playing an expanded role (just as product managers and designers now do more engineering).
[Original text: The Batch]
New research from Google.
LLMs hallucinate with high confidence, miss their own knowledge boundaries, and misreport uncertainty. Most fixes bolt calibration on from the outside.
RLMF turns the model own metacognition into the training signal. It refines completion rankings during preference optimization based on how good the model self-judgments of its performance are, and uses those same self-judgments to select high-value training data.
The approach is two-stage. First calibrate the faithfulness of self-reported confidence, then map it to natural linguistic uncertainty through targeted output editing.
RLMF reaches state-of-the-art faithful calibration across diverse tasks while preserving accuracy, and surpasses standard RL by up to 63%.
Paper: https://t.co/tBzuIYXAmf
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
Qwen publishes new work on RL coding agents.
(bookmark it)
The idea is to continually build a verification system that co-evolves with AI agents.
LLMs suffer from all sorts of reward hacking issues. This work studies coding-agent reward signals, test pass rates, LLM judges, and execution traces, and shows each one has a horizon beyond which it stops tracking real correctness and starts getting hacked.
They report that reward design for long-horizon coding is really a horizon problem. The metric you pick matters less than how long it keeps tracking correctness, and the paper finds where each signal crosses that line.
Paper: https://t.co/51YYEM3kXm
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
Senator Warner just dropped the AI AGENT Act. Thirty pages of federal rules for a market that's barely six months old.
The acronym alone tells you everything. Artificial Intelligence Access, Gatekeeper Exchange and Nondiscriminatory Transfer. A law so bureaucratically named it reads like a committee generated it before the committee met.
The pitch is familiar. Protect consumers from bad actors. Ensure fair competition. Prevent Big Tech from locking out upstarts. These are all real concerns in theory. In practice, this bill does the opposite of what it claims.
Compliance is a fixed cost. Paperwork, legal review, certification processes. That cost is the same for a startup with two engineers and a $500 billion platform. One of those entities absorbs it as a rounding error. The other just got priced out of the market.
The companies Warner says he's worried about? They already have legal teams. They already have government affairs offices. They will help write the regulations, then comply with them effortlessly, then watch their smaller competitors disappear.
The second problem is timing. AI agents are evolving faster than any technology since the browser. The definitions in this bill will take years to settle through administrative rulemaking and court challenges. By the time anyone knows what the rules actually mean, agents won't look anything like they do today.
Congress regulates whatever existed when they started drafting. Not what's arriving next month.
Then there's the problem this bill tries to solve. Has anyone actually been harmed by an unregulated AI agent at scale? The market is pre-commercial. Most "agents" are demos and research projects. We're writing thirty pages of rules for a test flight.
This is how regulation calcifies emerging industries. Not through malice. Through the natural desire of well-intentioned people to get ahead of a problem. The road to a stagnant tech sector is paved with preemptive frameworks.
The irony is that Warner probably thinks this helps. A "clear federal framework promotes innovation." That's what the press release says. But the relationship between regulation and innovation isn't linear. At low levels, rules provide stability. At high levels, they provide a moat for whoever can afford the compliance headcount.
This bill crosses that line on page one.
Nothing new here. Congress sees a thing happening, writes rules for the thing as it existed during the comment period, and the rules only bind the people who try to follow them. The bad actors ignore the law and move faster. That's the pattern. It's never not the pattern.
NEW: 2026 Microsoft AI in Education Report
The numbers that matter:
- 9 in 10 students and educators have used AI for school purposes
- 66% of education leaders are already scaling AI
- 77% of students have never had formal AI training
- AI literacy job listings increased 6× on LinkedIn in the past year
- 70% of job skills will change by 2030
Download the full report—free: https://t.co/MmIKRxPfDa
#AIinEducation #MicrosoftEDU #EdLeadership
A senior Google engineer just dropped a 19-page PDF on "Loop Engineering" for LLM and agentic systems.
Act -> Observe -> Learn -> Repeat
Act: the LLM proposes a code transformation (tile this loop, parallelize that one).
Observe: a compiler runs it and reports back - is it valid? faster? slower? by how much?
Learn: the LLM reads that feedback and adjusts its next move.
Repeat until it stops finding improvements.
The agent gets smarter purely from grounded feedback inside its own context window.
This 19-page PDF totally changed the way I’m building agentic systems today.
Read it now, then explore the article below.
Karpathy just wrote the manual for Claude + Obsidian as a real second brain.
Most vaults die the same way. A year of saved articles and highlights. None of it linked. The graph rots while it still looks impressive.
So he moved the upkeep to the model. You curate sources and ask questions. Claude files, links, and reconciles. You keep judgment. It keeps the books.
raw belongs to you and never gets edited. wiki belongs to Claude. It isn't RAG. Your sources compile once into linked pages and compound from there.
9 rules. Start with 10 sources, not 10,000.
Most people hoard notes. This turns them into a brain that maintains itself.
How I use LLMs as a staff engineer in 2026
https://t.co/KDzmfiUF87
The biggest AI workflow change in 2026: treating agents as capable collaborators for coding, debugging, testing, and codebase research—while still keeping humans responsible for judgment, communication, and review.
A super long overdue (3+ years?) post on scaling laws.
Compute is expensive. Scaling laws are a way to help us reason about the optimal compute allocation between data and model size before committing to a large run.
The post covers what scaling laws predict, how compute-optimal allocation works, why Kaplan et al. and Chinchilla disagree, and how data limits + fitting details make extrapolation tricky.
https://t.co/HP26eJvjHB
this is f*cking gold
A senior Google engineer dropped a 424-page doc on agentic design patterns.
424 pages.
Most engineers bookmarked it and never opened it again.
if I had this a year ago, I would've shipped my first app in a day instead of 2 weeks
in the right hands, this changes everything:
Great Stanford + MIT + Harvard + Anthropic paper.
Gives a clear training-based reason for why larger models learn abilities smaller models miss.
Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals.
The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts.
Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge.
In a crowded data mixture, common patterns get first claim on the model’s internal machinery.
Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again.
They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters.
The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less.
Larger models can remember weak rare signals long enough to turn them into real learned skills.
----
Link – arxiv. org/abs/2605.29548
Title: "Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention"
🚨 A SENIOR ANTHROPIC ENGINEER JUST DROPPED AN 11-PAGE PDF ON LOOP ENGINEERING.
The core shift: stop prompting the agent. Build the system that prompts it.
Inside the autonomous loop:
- Discover → Finds its own work (failing CI, open issues).
- Isolate → Uses separate git worktrees to prevent collisions.
- Verify → A second agent reviews the work. (Never let agents self-grade).
- Persist → Writes to disk, not temporary context windows.
- Schedule → Runs automatically on a timer.
This is a great framework for building more reliable agentic systems
link to the guide below.
Read it, then check out this ace article on Loop Engineering by @akshay_pachaar 👇
Claim: Autoresearch that moves the frontier will be about better data: we call that *Autodata*.
🧵1/6 -- Paper is out! https://t.co/b8gOALndzy
Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*.
We show our method gives gains on computer science, legal and math problems over classical synthetic dataset creation methods.
We also show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data.
Overall, we believe this direction has the potential to change how we build AI data!
this is f*cking gold
Anthropic published PDF on how their own teams actually use Claude Code.
Explaining real workflows from engineering, security, growth, and design.
Spec → Dispatch → Verify → Systemize
• Spec: they hand Claude a clear goal, then let it run, instead of typing every line.
• Dispatch: their growth team split work across sub-agents, hundreds of outputs in minutes.
• Verify: Claude runs its own builds, tests, and lints, so trust comes from proof.
• Systemize: they commit checkpoints and turn repeated workflows into commands. Security wrote 50% of them.
The key insight: their engineers don't write more code.
They set up the system, then review the 80% Claude ships on its own.
Read it now, then save the full breakdown below 👇