psa to saas founders:
convert everything into api / mcp services asap and charge by usage.
allow connectors to all agents and make your own agent with pre-built context
all ui/ux/dashboards will be vibed and dynamically generated
if you don’t have prop data, you’re fucked
Well well… ARC-AGI-2 (François Chollet’s “hardest” benchmark) is starting to smell like toast. 🍞🔥
@agenticasdk just set a new SOTA: 85.28% with an Agentica agent (~350 lines) that writes & runs code.
Best part: it’s not ARC-specialized—it's a general system that’s strong across other benchmarks too. Details at https://t.co/JmVuJiUp83 What benchmark should we throw at it next?
Why do most LLM agents hit a wall?
They don’t accumulate skills.
Introducing SkillRL📚 — recursive skill-augmented reinforcement learning that lets agents learn skills from failure and evolve over time.
🔥A 7B model:
• +41% over GPT-4o
• ~20% fewer training tokens
• 33% faster convergence
SkillRL bridges raw experience → policy improvement by distilling trajectories into structured, co-evolving skills during RL.
Most agents forget.
SkillRL evolves. 🔄
📄 Paper: https://t.co/6VoxpGoPR6
💻 Code: https://t.co/qVDnIaci2K
Great work @richardxp888, Jianwen Chen, Hanyang Wang, @JiaqiLiu835914, @lillianwei423, @AiYiyangZ, and nice collab. w/ @__YuWang__, @XujiangZhao, Haifeng Chen, Zeyu Zheng, @cihangxie.
The new engineering is building the agents that "take your job", but now do it at 100x the scale. Agents give developers horizontal scalability.
The simple version of this is Ghostty splits and tabs, 𝚝𝚖𝚞𝚡 sessions and the like, running CLI agents in parallel.
Skills and MCPs help you direct the behavior of these agents. Sandboxes give the ultimate leverage: ~infinite parallelism, run while you sleep, on PRs, when an incident is filed, a customer reports an issue…
Automating the full product development loop is now your job, and your edge.
the holy trinity of agentic UI:
- https://t.co/ymclHB0RDA from @elirousso
- https://t.co/DZLnezoft4 from @Ibelick
- https://t.co/xzdoVQzSd5 from @vercel
① Install the skill:
$ npx add-skill vercel-labs/agent-skills
② Paste this prompt:
Assess this repo against React best practices. Make a prioritized list of quick wins and top fixes.
③ Review and prompt to "make the fixes"
I'm happy to share that we (@AnthropicAI) are investing $1.5 million in support of the Python Software Foundation and open source security.
Python powers so much of the AI industry. Supporting the folks that make our work possible is an honor.
We're encapsulating all our knowledge of @reactjs & @nextjs frontend optimization into a set of reusable skills for agents. This is a 10+ years of experience from the likes of @shuding, distilled for the benefit of every Ralph
There's an app I use regularly that's defective. I'm weighing in my mind whether I should send feedback and hope it gets attention from its creators, or re-build it with AI from scratch.
This is a tiny app so it's plausible for me to do it. But if you're in the business of selling software, this is how your every customer is thinking now, or how they'll be thinking soon.
Iteration velocity matters more than ever before. How quickly you fix, improve, and ship is your counter-signal.
so many ambitious startups making "the LLM OS" tried all these fancy UXes and failed
so many ambitious startups making "the AI browser" tried to book your flights for you and failed
meanwhile Claude Code started unpretentiously as a CLI and now can run your browser and operate your system.
classic disruption theory
Claude Code doesn't just resonate with developers anymore. Non-technical people are using it to build things. Technical people are using it for non-technical work. The line is blurring.
I'm by far not the first to think about this. Multiple teams at Anthropic have been working on "agentic experiences" for months - Claude not just as a chat partner, but as something that helps you do real work. @bcherny nudged me: can we take what we've built internally and ship an early, scoped-down version in a few days? So we took a small team, set an aggressive deadline ("Monday sound good?"), and got to work.
@claudeai wrote Cowork. Us humans meet in-person to discuss foundational architectural and product decisions, but all of us devs manage anywhere between 3 to 8 Claude instances implementing features, fixing bugs, or researching potential solutions.
For native code, we use local Git worktrees on our local machines. For smaller or web-code only changes, we just tell Claude to go implement it. When someone reports a bug in Slack, we often just @-mention Claude and tell it to fix it. A human (and another Claude) reviews all code before it's merged, but we're now spending most of our time orchestrating a fleet of Claudes and making decisions than artisanally writing individual lines of code.
We're releasing Cowork early. It has rough edges. But figuring out what to build is increasingly the hardest part of software engineering - and we think getting feedback early and hearing what users actually need is how we build something truly good.
Coding agents running in cloud sandboxes will be a big part of 2026.
Kick off a task, close your computer/phone, enjoy your life, and come back in a few hours to review the work.