1、Most RL stacks are built for one modality. UniRL applies a single post-training loop — generate → score → advantage → update → sync — across model families. Model and algorithm are two independent axes, so your coverage is the model × algorithm product, not a fixed recipe menu.
2、One loop, every modality: text→image, text/image→video, vision-language, text-only LLM and VLM, the LLM→diffusion prompt-enhancer, and unified autoregressive+diffusion generation (Hunyuan-Image 3 and Bagel) — a model class no single-purpose RL repo can even express.
3、Built to scale: pluggable rollout engines (train-side / SGLang / vLLM-Omni) behind one typed contract, FSDP2 sharding, and three deployment modes from a single config knob.
4、Two team-original algorithms headline the release:
FlowDPPO: Policy optimization for flow/diffusion models with trust-region masks based on exact divergence (See our paper: Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models https://t.co/7TPFT72RDa)
DRPO: LLM RL with a smooth, advantage-weighted quadratic regularizer
(See our paper: Rethinking the Divergence Regularization in LLM RL [https://t.co/eXTnK1sCts])
🚨 Peter Steinberger, founder of Openclaw, says you shouldn't be prompting coding agents anymore.
Boris Cherny says he doesn't prompt Claude anymore. Instead, they both write loops.
Since then, the AI community has been asking the same question:
What exactly is a loop?
To help answer that, Matt Van Horn (co-founder of Zimride, which later became Lyft) shared a framework outlining the evolution of loops in Agentic AI—and how we've arrived at this new abstraction layer.
🔹 2022: ReAct Loop
🔹 2023: Self-Prompting Agents
🔹 2025: The Ralph Loop
🔹 2026: Productized Ralph
🔹 and the future: Multi-Agent Orchestration.
In multi-agent orchestration, loops become the primary unit of work—spawning, coordinating, and supervising other loops and agents.
The common theme across Peter's and Boris's comments is that the focus is shifting away from the prompt itself.
Whether you agree with this framework or not, it's a useful way to understand why some of the most experienced AI builders are talking less about prompt engineering and more about designing systems around loops.
Do you think loops are the next major abstraction for Agentic AI, or is this simply iterative prompting with a new name?
#agenticai #ailoops #claude #openclaw #aiengineering
Kernel Mean Embeddings are a powerful framework that represents probability distributions as elements of a reproducing kernel Hilbert space (RKHS). Instead of working directly with probability densities, a distribution P is mapped to a feature representation
μₚ = E[k(X, ·)]
where k is a kernel function. This allows complex distributions to be analyzed using geometric and functional-analytic tools.
In probability and statistics, kernel mean embeddings provide nonparametric methods for comparing distributions, hypothesis testing, density estimation, and causal inference. They form the basis of powerful techniques such as Maximum Mean Discrepancy (MMD), which is widely used for two-sample testing.
In machine learning, kernel mean embeddings enable learning directly on distributions rather than individual data points. They are used in domain adaptation, generative modeling, distribution regression, and uncertainty quantification. In deep learning, MMD and related kernel methods appear in generative adversarial learning, representation learning, and self-supervised learning. In reinforcement learning, kernel embeddings help model transition dynamics, value functions, and belief states in partially observed environments.
The deeper insight is that many learning problems involve distributions rather than individual observations. Kernel mean embeddings provide a mathematically elegant way to transform probability distributions into geometric objects that can be manipulated, compared, and learned efficiently.
Image: https://t.co/8OivktctlP
This paper proposes a new test to see whether AI agents truly get better as they gain experience and finds they mostly still confuse memory with learning.
Shows that simple full-context learning beats the more specialized memory systems, with Claude Sonnet 4.6 using plain context getting the best overall score.
That distinction matters because the next wave of AI is not supposed to answer isolated prompts.
It is supposed to live inside codebases, databases, markets, sensors, clinics, and workflows where yesterday’s mistake should make tomorrow’s action sharper.
The authors build CL-BENCH, a benchmark where an agent works through connected tasks in 6 domains, including coding, databases, forecasting, radio signals, poker, and disease studies.
Each task hides a pattern the agent can learn over time, like a database layout, a codebase structure, or an opponent’s strategy, so better performance should come from experience rather than pretraining.
They test frontier LLM systems with simple full-context memory, scratchpad notes, retrieval memory, playbook-style memory, and coding-agent setups.
The key finding is that current memory-heavy AI agents are not reliably better learners than just keeping the full conversation in context.
That means long-running AI agents still need better ways to remember useful lessons, forget stale ones, and adapt when the environment changes.
----
Link – arxiv. org/abs/2606.05661
Title: "Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments"
"OPRD: On-Policy Representation Distillation"
On-policy distillation usually matches teacher and student only at the token probability level, throwing away the teacher’s hidden states.
This paper moves the loss before the LM head, aligning student and teacher representations on the student’s own rollouts.
This gives less noisy training, richer supervision, near teacher-level math performance. With 1.44x faster training, and up to 54% lower update memory.
AutoScientists – a research lab made of agents
@Harvard researchers connected agents into a self-organizing scientific team without a boss agent standing in the middle
All agents look at the same shared workspace: they share memory, explore multiple directions in parallel, critique each other, avoid repeated failures, and reorganize as evidence changes.
But the teams are not fixed. Agents can gather around a promising direction, like architecture, optimizer changes, or data augmentation, then abandon it if it stops working.
Before they spend compute, they discuss proposals and critique each other.
AutoScientists also shows strong results:
- 74.4% mean leaderboard percentile on BioML-Bench
- 1.9× faster GPT training optimization
- +12.5% on ACE2–Spike, with the same method transferring to 217 ProteinGym assays for a +6.5% average gain
// Continual Learning Bench //
One of the research areas with lots of investments is continual learning.
While there are many efforts, there is very little progress in measuring it.
So the big question is, do dedicated memory systems actually make agents learn from experience?
Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management.
CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances.
If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning.
Paper: https://t.co/iFd5SZFe3O
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
Must-read research of the week
▪️ ScientistOne
▪️ SkillOpt
▪️ MUSE-Autoskill
▪️ Do Language Models Need Sleep?
▪️ OmniRetrieval
▪️ Vector Policy Optimization
▪️ Gamma-World
▪️ OpenComputer
▪️ Personalize-then-Store
▪️ WorldKV
▪️ AutoResearchClaw
▪️ Qwen-VLA
▪️ CUA-Gym
Find the full list and the most important AI news of the week here: https://t.co/UC1IeiRL8y
Why do DeepSeek and Kimi use Muon instead of Adam?
🚀 Reasons from a curvature perspective:
1⃣ Under a second-order approx., Muon incurs a much smaller curvature penalty than Adam while maintaining the same first-order decrease.
2⃣ This advantage does not come from a smaller update norm. Instead, it comes from Muon having lower Normalized Directional Sharpness (NDS).
3⃣ Muon’s NDS advantage becomes larger when the training data is more imbalanced.
Paper Link: https://t.co/nGy2FeF538
A thread 🧵
Anthropic’s new chemistry report has a genuinely wild result.
Claude Opus 4.7 is now competitive with dedicated NMR software, and the bigger story is that it can work the problem backwards, i.e. infer the molecule from the spectrum.”
NMR software is the chemist’s expert tool for turning molecular structures into predicted lab spectra.
So Opus 4.7 is no longer just “helping chemists read data” — it can work backward from NMR data and propose the molecule’s structure, a task the report says existing mainstream tools generally leave to human chemists.
Note, that Opus 4.7, a general-purpose model with no chemistry-specific fine-tuning.
Claude Opus 4.7 made the smallest hydrogen prediction errors and nearly matched MestReNova on carbon, meaning it can predict NMR signals about as well as specialist chemistry tools.
So AI now handle one of chemistry’s hidden bottlenecks: translating between a molecule, its spectral shadow, and the structure a chemist actually needs to trust.
Everything You Need To Know About
Inference Engines and Running LLMs Locally at Home
Explains why Inference Engines exist in the first place
- Prefill is not Decode
- VRAM is not bandwidth
- Fit is not speed
- KV Cache is the real memory problem
- Quantization only matters if the engine has good kernels for it
- Batching is not scheduling
- MoE and the routing problem
- How long context changes the serving problem
- Multi-GPU changes the interconnect problem
- Production: latency, p99s, backpressure, routing, metrics, and failure behavior
Then maps the Engines including:
- llama.cpp → portability king
- MLX / MLX-LM → Apple Silicon weapon
- ExLlamaV3 → multi-GPU consumer CUDA / local MoE
- vLLM → default open-source production server
- SGLang → long-context, MoE, routing, ugly workloads
- TensorRT-LLM → max NVIDIA performance
- NVIDIA Dynamo → fleet orchestration
The point of this article is not “use vLLM” or “use TensorRT-LLM” or “use llama.cpp”
But rather fully grasp how the Inference Engines are the traffic cop, memory manager, kernel dispatcher, scheduler, cache accountant, parallelism planner, API surface, and sometimes the deployment framework
Do not pick the engine first
- Pick the hardware
- Pick the workload
- Pick the serving model
Then the engine becomes obvious
Opensource / Local AI FTW
ThoughtFold has a clean RLVR angle: correct long CoTs contain both useful reasoning and redundant exploration, but outcome rewards reinforce all of it. Instead of just rewarding shorter answers, it prunes correct chains, verifies what can be removed, then uses masked preference learning to penalize redundant steps and keep the reasoning path tighter.
Paper: https://t.co/vFIVDPR9bd
This is huge.
A group of 50 AI researchers (ByteDance, Alibaba, Tencent + universities) just dropped a 303 page field guide on code models + coding agents.
And the takeaways are not what most people assume.
Here are the highlights I’m thinking about (as someone who lives in Python + agents):