This SkillOpt paper from Microsoft is a must-read!
(bookmark it)
I was a bit skeptical of the results reported in the paper when I shared it a few days ago.
However, I managed to integrate it into my agent orchestrator and ran a few experiments.
The results are mindblowing.
Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this.
One exciting result was when I applied it to my paper-figure-extraction skill, which requires an agent to do multimodal analysis. In particular, it improved quality by +20 points (0.73 → 0.93). I went to see the extracted tables and figures, and I was absolutely stunned by how much better my skill got at the task.
Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve.
In this case, it was skills, but it's not hard to imagine how this scales to optimizing agent patterns, tool use, context engineering efforts, agentic search, workflows, evals, and even the harness itself. I already started with a few of these ideas inspired by SkillOpt.
Stay tuned!
ByteDance has published a paper that should make every NVIDIA investor sweat.
They trained an AI that writes CUDA better than humans experts.
They call it CUDA Agent.
And it completely rewrites the economics of AI hardware.
They built a massive agentic reinforcement learning loop. The AI writes a kernel, compiles it, profiles the hardware, analyzes the bottlenecks, and rewrites the code until it's flawless.
It learned how to optimize memory access patterns and hardware tiling strategies that traditional compilers miss.
The results are staggering.
On the industry-standard KernelBench, CUDA Agent completely destroyed traditional compilers.
It delivered code that runs up to 3.2x faster than PyTorch's native execution.
On the hardest, most complex models, it beat the strongest proprietary models in the world—including Claude Opus 4.5 and Gemini 3 Pro, by 40%.
It didn't just match human experts. It started discovering optimizations that static compilers literally cannot see.
Here is why this is a massive threat to NVIDIA.
NVIDIA's dominance relies on the fact that CUDA is incredibly hard to master. Developers get locked in because optimizing code for other chips is too painful.
But if an AI agent can autonomously generate hyper-optimized hardware kernels...
You don't need a team of $500k a year CUDA engineers to build world-class infrastructure.
And if an AI can autonomously master CUDA, it can master AMD's ROCm. Or custom silicon.
The impenetrable software wall protecting NVIDIA's monopoly just got breached by a reinforcement learning loop.
If anyone can automatically squeeze maximum performance out of any chip...
Hardware becomes a commodity.
New research from Google.
Just shows the impressive results you can get from custom agent harnesses.
LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback.
The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%.
Paper: https://t.co/bh4Yoi19E2
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
Building autonomous agents for scientific discovery? 🧬🤖
@GoogleDeepMind Science Skills is now available on GitHub. We've open-sourced this specialized toolkit to accelerate your agentic workflows with scientific grounding and higher token efficiency.
Download now ↓
https://t.co/cwp1HOeKvo
Can medical AI research be automated with AI itself
This new benchmark from NVIDIA and UC Santa Cruz aims to evaluate this:
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
"we present AutoMedBench, a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks"
The benchmark covers 24 tasks across segmentation, question answering, report generation, etc. and across modalities like CT, X-ray, pathology, etc.
The paper experiments with six frontier models (Opus 4.6, GLM-5, Gemini 3.1 Pro, GPT-5.4, MiniMax-M2.5, Qwen3.5-397B) and these models remain far from reliable medical AI researchers. While agents can often set up runnable pipelines, validation is consistently the weakest stage, and engineering failures dominate over understanding errors.
Definitely curious to see how this performs with the newest generation of models/agents!
Check RACO, accepted as an 𝗢𝗿𝗮𝗹 paper to #ICML2026 (𝗧𝗼𝗽 𝟬.𝟳%)✨
we propose a new conflict-averse optimization scheme for LLM multi-objective finetuning, with counterintuitive theoretical acceleration and better empirical pareto frontier.
paper: https://t.co/pDvjn5fYR4
Google just figured out why AI lies with confidence.
Large language models still make confident mistakes on simple factual questions.
A new paper from Google Research explains why this keeps happening.
Models cannot reliably tell what they know from what they are guessing.
The internal score separating right answers from wrong ones sits around 0.70 to 0.85.
Forcing strict accuracy backfires.
Cutting errors from 25% to 5% means staying silent on over half of correct answers.
The team proposes faithful uncertainty.
The model's words should match its actual internal confidence.
Instead of refusing to answer, it hedges honestly.
"I think" becomes a real signal, not filler.
This same awareness tells agents when to reach for search tools.
The paper flags open problems worth tackling:
> Static training versus shifting knowledge
> Alignment erasing confidence signals
> Misleading calibration metrics dominating evaluation
A 4B model can now anticipate scientific breakthroughs before scientists do.
Researchers often build breakthroughs by combining ideas from older papers.
A new paper asks whether language models can do the same thing on demand.
The task is called insight anticipation.
Give a model two foundational papers, and it predicts the core insight of a future paper built on them.
To test this, the team built GiantsBench, an open benchmark of 17K paper tuples spanning 8 scientific fields.
They then trained GIANTS-4B using reinforcement learning, rewarding it for generating insights close to real follow-up papers.
The results:
> 34% higher similarity score than Gemini 3 Pro
> Preferred 68% of the time for citation potential
> Generalizes zero-shot to physics, biology, economics
Only 4B parameters, fully open-source.
The model produces ideas with clearer reasoning, not just more complex ones.
A 0.6B model learned to manage giants.
That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang.
The paper is not asking:
“How do we build one model that knows everything?”
It is asking something more interesting:
“How do we build a small intelligence layer that knows who should think, who should act, and who should verify?”
TRINITY is a lightweight coordinator for LLMs.
It does not merge weights.
It does not require architectural compatibility.
It does not need access to closed-model internals.
It does not try to turn the coordinator into the smartest model in the room.
Instead, it orchestrates a pool of strong models at test time, including closed and open models.
At each turn, TRINITY chooses a model and gives it one of three roles:
Thinker — plan and decompose
Worker — solve and execute
Verifier — critique and accept/revise
That may sound simple.
It is not.
Too many multi-agent systems are still prompts plus hope.
TRINITY learns the coordination policy.
A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse.
The result is not just better routing.
It is learned division of labor.
The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D.
The most important idea here is bigger than the benchmark.
The future of AI may not be a single supermodel.
It may be an organization of models.
A small conductor.
A team of specialists.
A protocol for planning, execution, and verification.
An intelligence layer that learns how to allocate cognition.
This feels like a real shift:
from bigger models
to better systems
from raw capability
to coordinated capability
from “which model is best?”
to “what structure makes many models better together?”
Full credit to the authors:
Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang.
Paper: TRINITY: An Evolved LLM Coordinator
https://t.co/H7YE67U67f
I’m attaching the first page because the abstract is worth reading closely.
The future of AI may not be monolithic.
It may be coordinated.
#ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms
NEW paper worth reading.
A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality.
The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure.
Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation.
This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction.
Paper: https://t.co/4k4urYOAeQ
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
Every memory system for LLM agents evolves what it stores. None evolves how it retrieves.
🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat.
🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench.
🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package.
📄 Paper: https://t.co/BWCXebWhG1
💻 Code: https://t.co/hhdgvVjblP
Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie
My first PhD paper is out now in @Nature! Very grateful to have worked with the FutureHouse team on this, and a big shoutout to my co-first author @agreeb66 😀
Our latest research on Co-Scientist is out today in Nature! Built with Gemini, this multi-agent system powers the new Hypothesis Generation tool within Gemini for Science, helping researchers navigate the rigorous cycle of ideation, critique, and refinement. Read more from @ymatias and explore the full announcement: https://t.co/aFqSBETpsh"
New Google paper: A forecast needs context, not just history.
Some patterns are caused by events, not time. Nexus reframes forecasting as a reasoning problem, where events and numbers have to explain each other.
Nexus argues that forecasting improves when models read the world around the numbers, not just the numbers themselves.
In the Zillow tests, one Claude-based version cut average MAPE by 86.6% versus direct chain-of-thought prompting.
That matters because most time series models are fluent in pattern, but mute about cause.
A housing inventory curve can reflect seasonality, mortgage pressure, migration, layoffs, and local supply, while a stock price can be bent by earnings, regulation, hype, and fear.
Nexus separates those jobs instead of asking one prompt to do everything.
One agent turns messy historical text into a clean event timeline, one reads the broad regime, another tracks local shocks, and a synthesizer reconciles them with calibration from past errors.
The interesting result is not merely that context helps, but that structure helps the language model use context without losing the time series.
The evidence is still narrow: Zillow counts, seven equities, post-cutoff data, and single-run evaluations, so this is not a universal law of forecasting.
But the direction is clear: future forecasters will not only extrapolate curves; they will argue about what made the curve move.
----
Paper Link – arxiv. org/abs/2605.14389
Paper Title: "Nexus : An Agentic Framework for Time Series Forecasting"
Oxford, Stanford, and Anthropic just discovered that the smarter an AI model gets at reasoning, the easier it is to jailbreak.
The same feature labs are selling as "safer" is the one breaking the safety guardrails.
The attack is called Chain-of-Thought Hijacking. You wrap a harmful request inside a long, harmless puzzle. Sudoku grids. Logic puzzles. Math problems. Then you add the actual dangerous question at the very end.
The model gets so absorbed in solving the puzzle that the refusal mechanism never activates.
The success rates are not borderline. They are catastrophic.
99% on Gemini 2.5 Pro.
100% on Grok 3 mini.
94% on GPT o4 mini.
94% on Claude 4 Sonnet.
Every frontier reasoning model on the market. Every major lab. One trick.
The researchers showed the attack scales with reasoning length. Minimal reasoning: 27% success. Natural reasoning: 51%. Extended reasoning: 80%+.
The smarter you make the model think, the more reliably it breaks.
Every lab has spent the last 18 months telling the world that "more thinking" makes AI safer. The Oxford paper just proved the opposite is true on every major model they tested.
🚨 Google acaba de liberar sus skills oficiales para agentes de IA.
Ha publicado 13 skills compatibles con Claude Code, Cursor, Copilot y otros agentes.
Permiten que los agentes puedan ejecutar tareas avanzadas y automatizar flujos de trabajo complejos.
Es gratis y open-source 👇
// Harnessing Agentic Evolution //
Pay attention to this one if you run iterative agentic search loops.
(bookmark it)
AEvo splits the self-improvement loop into two jobs:
> One proposes the next candidate.
> The other watches what worked, what failed, and edits the procedure that proposes future candidates.
Past runs (candidates, feedback, traces, failures) become memory the meta-agent reads from.
Achieves 26% relative gain over the strongest evolution baseline on agentic and reasoning benchmarks. SOTA on three open-ended optimization tasks under the same iteration budget.
If you are accumulating agentic search logs you never use, this is how to feed them back into the search procedure itself.
Paper: https://t.co/eWFO4rI4iA
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
Godfather of AI: "If you sleep well tonight, you may not have understood this lecture."
This 47-minute lecture is the best thing I saw about AI in the last few months.
It will definitely help you understand how it actually works and where it's going.
Geoffrey Hinton built the neural networks behind every AI alive, then quit Google to warn the world about it.
The part nobody wanted to hear:
> AI is already developing abilities its creators didn't intend
> in most cognitive tasks it's already ahead of us
> the question is no longer if it surpasses us but when
> the only decision left is which side of that line you're on
Right now the average person opens Claude, types something, gets an answer, closes the tab.
They think they're using AI. they're using maybe 10% of it.
I went through his entire lecture, built a practical system from what he was describing.
18 steps to actually use Claude the right way, with copy-paste prompts that work today.
Full guide in the post below.