The inflation redistribution machine running full steam again. Energy prices are the largest contributor to inflation. Inflation pushes down real wages making the vast majority poorer. The flip side of high energy prices is windfall profits benefitting the richest of the rich.
Closing the Indexing-Decoding Gap in Multimodal Generative Retrieval via Prefix Retention Optimization
Introduces a framework that improves target prefix survival during beam search.
📝 https://t.co/BLMkzyConE
👨🏽💻 https://t.co/o5RfkCG9Ze
The SFT+RL stack everyone treats as a new paradigm is really a return to the BERT-era pre-train then fine-tune move, explicitly tailoring the model to the behaviors and benchmarks it gets graded on.
The uncomfortable implication is that a lot of what we call alignment or reasoning RL is, structurally, benchmark-shaped supervised fitting wearing fancier clothes.
If post-training is mostly tailoring to the eval, then the eval is doing more of the steering than the algorithm is, which is a quietly damning thing to say about the field's progress narrative.
Post-training is (Massive) Supervised Learning
Paper: https://t.co/pnT2Kcf3Qb
We made a collection @GoogleDeepMind scientific agent skils for research tasks, genomics, structural biology, cheminformatics, literature search, and more.
👉https://t.co/zkPuCtmwEE
https://t.co/zkPuCtmwEE
Test-Time Training for Zero-Resource Dense Retrieval Reranking
Presents a training-free reranker that adapts a bilinear scoring matrix at inference time using pseudo-labels from dense retrieval ranking itself, with under 10ms added latency per query.
📝 https://t.co/YdpaCDP1Ql
Stephen, @JigarShahDC, and @JaneAFlegal get into something that Whitehouse/McKibben are eliding in their jeremiads at the "climate hushers."
The debate is not about whether to talk about climate change. The debate is about whether to center the climate emergency in politics.
When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression
Systematically studies combining dimensionality reduction and quantization for text embeddings.
📝 https://t.co/NhyGj2PxEy
// The Efficiency Frontier //
Cool paper on context management.
As agents reuse the same documents and histories across many turns, the cheapest context strategy is not fixed. This work describes a principled rule for picking one per deployment instead of defaulting to whatever topped a benchmark in isolation.
Retrieval and compression methods are almost always benchmarked on accuracy and cost separately, so you never learn when one actually beats another under real load.
The Efficiency Frontier models context strategy selection as a single cost-performance problem, with a log-utility term for diminishing returns from extra context and a reuse parameter N that amortizes preprocessing across repeated queries.
Sweep N and the optimal strategy changes, exposing crossover regions where retrieval, compression, or full context each wins. On 5,000 HotpotQA instances, deployment-aware selection cuts effective token usage about 25 percent at the same performance, and amortized memory compression runs over 50 percent cheaper than full-context prompting in higher-performance settings.
Paper: https://t.co/CK19QYX79n
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
🍅🍅🍅Tomatoflation 🍅🍅🍅
Price of tomatoes is up 39.7%. That’s terrible news for food affordability *and* public health. Tomatoes are one of the most popular fresh vegetables.
🚀 Introducing SkillOpt — an optimizer for agent skills.
Instead of finetuning model weights, we treat a natural-language skill as a trainable external parameter.
Think of it as deep learning for the frontier-model + agent era: learning rate, LR schedule, mini-batch, batch size, epoch, momentum — all in text-space optimization.
SkillOpt enables stable, controllable skill updates through bounded edits, allowing the optimizer to summarize “gradient directions” from agent experience and continuously improve procedural capability.
We evaluate SkillOpt across 6 benchmarks and 7 models, under both direct model calls and real agent execution loops with Codex + Claude Code. SkillOpt achieves best or tied-best results in 52/52 settings.
Train the skill, not the model. 🛠️🤖
🌐 https://t.co/zinqcX2wfQ
📄 https://t.co/pCI4VWdpih
Language Models Need Sleep
"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache."
"increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."
A paper more people in AI eval / research workflows should read:
The Asymmetric Burden of Proof
LLMs can apply asymmetric evidential standards to matched positive vs null scientific claims.
Quiet failure mode. Loud implications.
https://t.co/XyNuaXEZrr
One of the admirable things about the way DeepSeek executed was gradually building credibility, never getting ahead of their skis and having to rush to meet some high external expectation.
Western neolabs have not learnt the lesson, and instead set expectations so high to begin with that they deprive themselves of room to iterate in public - frightened that their first release won’t clear the bar.
Optimising for growth is a better strategy than building for a “big bang” release. Progress is proportional to your toleration for embarrassment. And all beautiful things are multiplicative where work compounds, and where expectations follow you instead of getting ahead of you.
A 0.6B model learned to manage giants.
That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang.
The paper is not asking:
“How do we build one model that knows everything?”
It is asking something more interesting:
“How do we build a small intelligence layer that knows who should think, who should act, and who should verify?”
TRINITY is a lightweight coordinator for LLMs.
It does not merge weights.
It does not require architectural compatibility.
It does not need access to closed-model internals.
It does not try to turn the coordinator into the smartest model in the room.
Instead, it orchestrates a pool of strong models at test time, including closed and open models.
At each turn, TRINITY chooses a model and gives it one of three roles:
Thinker — plan and decompose
Worker — solve and execute
Verifier — critique and accept/revise
That may sound simple.
It is not.
Too many multi-agent systems are still prompts plus hope.
TRINITY learns the coordination policy.
A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse.
The result is not just better routing.
It is learned division of labor.
The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D.
The most important idea here is bigger than the benchmark.
The future of AI may not be a single supermodel.
It may be an organization of models.
A small conductor.
A team of specialists.
A protocol for planning, execution, and verification.
An intelligence layer that learns how to allocate cognition.
This feels like a real shift:
from bigger models
to better systems
from raw capability
to coordinated capability
from “which model is best?”
to “what structure makes many models better together?”
Full credit to the authors:
Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang.
Paper: TRINITY: An Evolved LLM Coordinator
https://t.co/H7YE67U67f
I’m attaching the first page because the abstract is worth reading closely.
The future of AI may not be monolithic.
It may be coordinated.
#ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms
Small recursive models like TRM solve hard reasoning puzzles using only about 7 million parameters by repeatedly refining a hidden internal state and the predicted answer — a different style of computation than the token-by-token generation used by large language models.
But this refinement is deterministic, so when the model settles on a wrong answer there is no mechanism to escape it, while LLMs can be sampled many times and combined through voting.
The authors visualize what TRM does across its refinement steps and find that many failures correspond to trajectories trapped in regions of the hidden state space that decode to incorrect answers, when a small change in path would have led to the right solution.
They also notice that TRM already contains a "Q head" — a small auxiliary output trained alongside the main network to estimate whether the current answer is correct, originally used only at training time to halt computation on already-solved examples — and that this head separates correct from incorrect trajectories reliably enough to serve as a verifier at inference time.
Their method, PTRM, adds Gaussian noise (random numbers drawn from a bell curve centered at zero) to the hidden state at each refinement step, runs K such perturbed copies in parallel, and picks the answer with the highest Q score.
Without any retraining or task-specific modifications, this raises accuracy on Sudoku-Extreme from 87.4% to 98.75% and on a set of pencil puzzles from 62.6% to 91.2%, outperforming an ensemble of seven frontier LLMs at less than one ten-thousandth of the cost.
Read with an AI tutor and quizzes: https://t.co/74S4z4Asc1
PDF: https://t.co/x5hP2Nm0gl
Also read this relevant paper: https://t.co/Meb0y7xr3D
Seeing a lot recently about whether info theory explains modern AI. This is *exactly* what our epiplexity paper is about. It shows how to resolve paradoxes around DPI, synthetic data, and emergence, by considering computation and structural info: https://t.co/IYNGcLCAqx
Every memory system for LLM agents evolves what it stores. None evolves how it retrieves.
🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat.
🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench.
🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package.
📄 Paper: https://t.co/BWCXebWhG1
💻 Code: https://t.co/hhdgvVjblP
Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie
NEW paper worth reading.
A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality.
The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure.
Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation.
This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction.
Paper: https://t.co/4k4urYOAeQ
Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c
The release candidate for MCP 2026-07-28 is out. The protocol is now stateless: no handshake, no session id, any request can hit any server instance. Plus extensions as first-class (MCP Apps, Tasks), auth hardening, and a proper deprecation policy so we don't have to do this again.
https://t.co/XRLTu1BSkB
Specialized time series models are good at picking up seasonal patterns and trends from raw numbers but have no way to react to events described in news or reports, while large language models can read that text and reason about it but are weak at the actual numerical extrapolation.
Most attempts to combine the two either glue a language model onto numerical data through heavy retraining or use one giant prompt that asks a single model to do everything at once, and the authors from Google and Penn State argue this is the wrong decomposition.
Their contribution is Nexus, a system that splits forecasting into separate cooperating agents (each agent is just a language model given a specific role and prompt): one cleans the messy mix of numbers and text into a structured timeline, one produces a broad long-range outlook, one walks step by step through near-term changes, and a final agent merges these into a single forecast, with an added loop that tests the system's past errors on held-out historical splits and writes down correction rules only if they actually improve accuracy.
Tested on stock prices and Zillow housing data drawn entirely from after the models' knowledge cutoff (so the models could not have memorized the answers), this arrangement matches or beats both a strong dedicated time series model and a single-prompt language model baseline, and it produces written explanations of why each forecast moves the way it does, which the authors check using one model family to judge the other.
Read with an AI tutor and quizzes for better retention: https://t.co/DUyNz04A2V
PDF: https://t.co/09LwW9WzDn