🧩 How can we assign fine-grained credit over long tool-use trajectories and let agents learn from past attempts in agentic reinforcement learning when rewards are no longer verifiable?
Excited to share RubricEM, an RL framework for long-form deep research agents that plan, search, use tools, and write reports without exact answer checks.
📖 Paper: https://t.co/t1tksq5g30
(1/n)
Excited to share more of our technical notes in the spirit of open-research. Blog 2: How Muon Lost its Geometry
As Muon spread from speedruns to LLM-scale, it lost the µP scaling rule that keeps the best LR as width changes. The bug ships in PyTorch and Optax 😱. (1/n)
Excited to share that our work 📝 "PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf" has been accepted to #ACL2026 Demo!
Most AI writing tools either fix grammar or simulate peer review with a score. Neither gives drafting-stage, text-anchored feedback on narrative, structure and presentation.
PaperMentor comments rather than rewrites: It is a human-centered, multi-agent writing tutor that delivers expert-level, actionable feedback as native inline comments right inside Overleaf, while leaving every revision to you.
It pairs a curated library of 40+ expert skill files (distilled from senior researchers' writing advice) with 12 specialized agents covering methods, results, formatting, terminology, venue norms and more.
In a user study, 90.6% of comments were rated actionable and PaperMentor significantly outperformed a GPT-5.2 baseline without the skill library on both validity and actionability.
Anyone can extend or contribute to the skill library with simple text edits!
📝 Arxiv link: https://t.co/0STd6kasBw
🔗 Live demo: https://t.co/4MNrmxyE0U
💻 Code with skill library: https://t.co/iKKQ1orecq
🧵 How it works below 👇
We spend a lot of time anthropomorphizing AI.
I think the more interesting question is the reverse: what do humans look like when viewed through the lens of AI?
I tried to unpack this idea in my new blog post: Un-Anthropomorphizing Humans
What do humans look like if we describe them as compute, model, learning loop, and an energy problem?
Here's the link - https://t.co/RUBJiAvvjP
A real scientist doesn't look up how the world works — they intervene, observe, and revise until a theory holds for a case they've never seen.
CausaLab drops an LLM agent into a lab where memorized facts are useless ("Quantum Crystals on Planet X") and asks for the same.
https://t.co/uAIafaVXbW
Call For Papers for Lifelong Agent @ COLM 2026!
🚀After our first Lifelong Agent Workshop at ICLR 2026, we’re excited to announce that the Lifelong Agent Workshop is back for its 2nd edition!
🎉This time, we’ll be at COLM 2026 this October in San Francisco!
We will continue the conversation around lifelong agents: long-term learning, continual alignment, self-evolution, and stable growth, with the goal of exploring more sustainable paths for future agent systems.
We’re also thrilled to have an amazing lineup of speakers, organizers, and advisory board members from Stanford, UC Berkeley, CMU, Microsoft Research, Google DeepMind, McGill, Oxford, UIUC, and many other institutions.
We warmly welcome submissions to the workshop. Submissions are non-archival!
Topics include, but are not limited to:
agent post-training, agent RL, user-agent alignment, self-evolving agents, embodied agents, Agent4Science, applications, benchmarks, and more.
📌 Submission deadline: July 3, 2026 AoE
🌐 Workshop homepage: https://t.co/CSSUQ5Flx8
📖 OpenReview: https://t.co/OZxc9lO9nF
📮 Contact: [email protected]
Looking forward to your submissions and to seeing everyone at COLM 2026 in San Francisco! 🚀
#COLM2026 #AIAgents #AgenticAI #LLM #LifelongLearning #AIAlignment
✨ Creativity is not just recognizing what an object is — it is imagining what it could become.
🔧 A key edge can cut tape.
🛡️ A rubber pad can protect a wall.
🪮 A comb guard can clear a sink slot.
But can multimodal AI agents discover these hidden physical affordances from images?
🚀We introduce MM-CreativityBench, a benchmark designed to test whether LMMs can creatively repurpose everyday objects by interactively inspecting scenes, entities, and object parts.
🔍 Our findings show that today’s LMMs often identify the right object, but fail to ground their reasoning in the right part. They hallucinate properties, overlook physical constraints, or propose solutions that are not mechanically valid.
🧠 To move beyond plausible guesses, we propose affordance-grounded alignment: training models to explore visual evidence, reject hallucinated affordances, and reason from geometry, material, and mechanics.
📄 Paper: https://t.co/DW6J06yPHK
🌐 Project: https://t.co/KMDTLKaa0r
💻 Code: https://t.co/4L3LYObPZX
🤗 Hugging Face: https://t.co/XPmrfP0Gie
Updated our paper on the foundations of memory in sequence models (with fresh insights, clearer writing and ablations).
Our paper contrasts two distinct ways in which language models memorize and formulates the questions that arise from this.
Will be presented at #ICML.
What is the key bottleneck to scaling looped transformers (LT)? A major challenge is their speed: the looped operation is coupled w/ full quadratic attention. More loop, more powerful, but much slower.
Introducing LT2: linear-time looped transformers that loop over linear attention and sparse attention. Linear and sparse attention give the loop speed, making it a fast loop. The loop, in turn, gives linear attention iterative control over its recurrent memory and recursively enlarges the receptive field for sparse attention. Fast attention accelerating the loop, the loop enriching attention, making LT2 a pareto-frontier architecture compared to standard looped transformers.
This is a large paper. We did careful ablations in pretraining to find the best architecture, and we used this architecture to distill a hybrid looped transformer, Ouro-hybrid-1.4B, to deliver both industry-level performance and fast inference speed. To read more:
Paper: https://t.co/q1YeQnXjoS
Code: https://t.co/Dk7C4hBrXk
Project: https://t.co/jrM1IquIA4
Model: https://t.co/LQmcCsLP1T
Thanks for sharing our survey! We are also maintaining an Awesome Code as Agent Harness Papers repo for recent work on code-centric agentic systems and harness engineering: https://t.co/hBX3wv1Pzo
@maximelabonne Feel free to take a glance at our paper, where we discussed why losses like DFT can work and when it doesn’t work: https://t.co/QiALghsXOf
Excited to share our new paper 🧵MIXSD: Mixed Contextual Self-Distillation for Knowledge Injection
Supervised fine-tuning is the common way to teach LLMs new knowledge, but it often catastrophically forgets existing capabilities. We introduce MixSD: a simple, external-teacher-free method to inject knowledge with far less forgetting.
📄https://t.co/qRpaTiI9EU
Why does SFT forget? Targets written by humans or external systems diverge from the model's own autoregressive distribution, forcing the optimizer to imitate low-probability tokens. That's what drags pretrained capabilities down.
MixSD: We hypothesize that keeping supervision close to the model's own distribution is key to avoiding forgetting. Instead of training on fixed, externally authored targets, at every token we mix between two conditionals of the base model itself: an expert conditional that sees the injected fact in context, and a naive conditional reflecting the model's prior. The result is supervision the model already finds high-probability, while still carrying the new factual signal. A Bernoulli rate λ controls the balance between memorization and retention.
Findings: SFT only retains as little as 1% of held-out capability. MixSD retains far more, up to ~100% on larger models, with near-perfect training accuracy. It also beats on-policy self-distillation at a fraction of the compute, and holds across Qwen3 1.7B, 4B, 8B and Llama-3.2.
🚀Code as Agent Harness: A survey work from UIUC, Stanford, and Meta.
📄https://t.co/YReL1BMIoN
Code is no longer just the output of AI.
It is becoming the executable, inspectable, and stateful substrate through which AI agents reason, act, verify, remember, and self-correct over long horizons.
In our new survey, we examine this shift through the lens of Code as Agent Harness, focusing on how code serves as:
• 🧠 Harness Interface: coding for reasoning, acting, and environment modeling
• ⚙️ Harness Mechanisms: planning, memory, tool use, feedback, and optimization
• 🤝 Multi-Agent Harnesses: collaboration through shared code, tests, and execution traces
We review applications spanning:
💻 Coding Agents
🖥️ GUI/OS Agents
🤖 Embodied Agents
🔬 Scientific Discovery
🏢 Enterprise Workflows
If you find this survey helpful, feel free to explore our resource collection below.
🤗 Hugging Face Daily: https://t.co/cfuoQfzcj3
💻 GitHub: https://t.co/ZD156rWPbJ
🌍 Website: https://t.co/OfibqKL1en
Feedback, suggestions, and community contributions are warmly welcome!
#AI #Agents #LLM #Coding #AgenticAI #SoftwareEngineering
🧩 How can we assign fine-grained credit over long tool-use trajectories and let agents learn from past attempts in agentic reinforcement learning when rewards are no longer verifiable?
Excited to share RubricEM, an RL framework for long-form deep research agents that plan, search, use tools, and write reports without exact answer checks.
📖 Paper: https://t.co/t1tksq5g30
(1/n)
Takeaway: Agent and judge co-evolve through rubrics during RL.
RubricEM is not just using rubrics to score answers.
It treats rubrics as the interface for a coupled evolution loop:
Stronger agents generate better self-directed rubrics and expose more informative on-policy failures.
The judge distills these rollouts into sharper stagewise criteria.
Accepted reflections return to the agent as reusable memory.
The agent evolves through policy updates and rubric-bank updates.
The judge evolves through an on-policy rubric buffer.
For open-ended RL, rubrics become the scaffold for acting, the language for judging, and the memory for evolving.
📖 Paper: https://t.co/t1tksq5g30
(13/n)