AI Group (NLP/ML/CV etc.) @ucsantabarbara.
Profs. @xwang_lk, @WilliamWangNLP, @CodeTerminator, @YuhengBu, Xifeng Yan. Account run by Student Social Committee.
New update: we decided to fully convert from UCSB NLP&AI to UCSB AI for simplicity.
New X handler: https://t.co/bd8LKSLOLA
New Github org: https://t.co/FXd88wedXt
🤔It is time to rethink how we evaluate agent memory
🌍 As agents become longer horizon and more autonomous, memory is no longer just a module for storing past chats.
🛠️ It determines how agents track changing worlds, learn from past actions, revise outdated information, and reuse experience for future decisions.
🔍 This raises three key questions:
Are human designed write store retrieve memory pipelines still the best choice?
If harnesses such as Codex, Claude Code, and OpenClaw already let agents observe, act, call tools, write files, and reorganize context, can memory also be managed by the harness itself?
Do current evaluations really cover how agent memory is used in realistic settings? Many benchmarks are still text centric or single modal, with limited pressure from screenshots, GUIs, tool feedback, and environment changes.
❓ Is final QA accuracy enough?
🔥 We present WorldMemArena, a multimodal benchmark for evaluating agent memory through action world interaction.
📌 Key insights:
🧩 Memory is a lifecycle, not a static cache.
📉 Better memory storage does not necessarily lead to better final performance.
🖼️ Multimodal memory remains a major bottleneck for current systems.
🌍 Real agentic trajectories expose the fragility of memory systems.
⚙️ Harness-based memory is more flexible, but still costly and unstable.
𝐓𝐡𝐞 𝐁𝐢𝐭𝐭𝐞𝐫 𝐋𝐞𝐬𝐬𝐨𝐧 𝐨𝐟 𝐀𝐠𝐞𝐧𝐭𝐢𝐜 𝐌𝐞𝐦𝐨𝐫𝐲: memory should be a derived capability that exists because it makes an agent better at acting over time.
𝐖𝐨𝐫𝐥𝐝𝐌𝐞𝐦𝐀𝐫𝐞𝐧𝐚 is designed around this principle. Rather than evaluating memory as a storage problem, WorldMemArena evaluates memory through 𝐚𝐜𝐭𝐢𝐨𝐧–𝐰𝐨𝐫𝐥𝐝 𝐢𝐧𝐭𝐞𝐫𝐚𝐜𝐭𝐢𝐨𝐧, instrumenting the full write → maintain → retrieve → use lifecycle across 400 multimodal, multi-session tasks.
And it exposes the findings that should mark the end of the storage-centric era:
→ Storage ≠ use. Better memory storage and retrieval do not necessarily produce better task performance. Optimizing the component we designed does not optimize the capability we actually care about.
→ Harness-based memory performs best where memory is hardest. Agents that can write files, reorganize context, create artifacts, and interact with persistent environments adapt most effectively in long-horizon settings. They are costly and unstable today, which is exactly what many Bitter Lesson transitions look like before scaling and learning take over.
The deeper move is in what gets measured. Memory shouldn't get a score; it should be inferred from capability: how much does remembering improve performance over time.
WorldMemArena drags evaluation off the static object and into the action–world loop, the only place you can tell whether an agent has developed memory or is just simulating it convincingly.
Huge congrats to @_Chuhan_Li!  His paper “Learning Situated Awareness in the Real World” just took home Best Paper Runner-Up at the #CVPR2026 WMAS Workshop.
Fresh off an #ICML2026 Spotlight, massive recognition for great work in world models and active sensing.
Check out the full paper for our controlled experiments on Python prediction and our deterministic DSL twin task that strips out pretraining priors to prove these dynamics!
📝 Paper: https://t.co/6zOW63Xpqk
💻 Code: https://t.co/liFdD2kmp1 9/9
We discover the 𝐀𝐬𝐲𝐦𝐦𝐞𝐭𝐫𝐢𝐜 𝐑𝐨𝐥𝐞𝐬 𝐨𝐟 𝐃𝐚𝐭𝐚 𝐆𝐚𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐑𝐞𝐰𝐚𝐫𝐝 𝐆𝐫𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐢𝐧 𝐒𝐞𝐥𝐟-𝐏𝐥𝐚𝐲 𝐑𝐋: data gating, not reward grounding, is the binding constraint on stability. A strict gate stabilizes every reward we tested, including a self-consistency reward with no access to ground truth; while no reward stabilizes once the gate is removed, not even one grounded in execution truth.
It challenges the common assumption that reward grounding is what governs self-play stability. The field's response to collapse has been better rewards: confidence penalties, momentum anchors, hacking detectors, all on the reward side. The binding constraint lives upstream, in the data pipeline.
A self-play system has two distinct levers that prior work conflates. A DATA GATE decides which proposer-generated tasks enter the training pool. A REWARD decides how the policy updates on what's admitted. The gate decides what data exists; the reward decides how the optimizer reacts. They are not symmetric!
The reward doesn't filter bad data; instead, it's maximized by it. Under self-consistency, the intrinsic–grounded gap saturates near 1.0: corrupted data receives higher reward than clean data, because intra-group agreement is easiest to maximize on ambiguous tasks.
The counterintuitive consequence we call the Grounded Proposer Paradox: a proposer with ground-truth verification access collapses FASTER than an ungrounded one when paired with a self-consistency solver. Cleaner tasks form the lowest-resistance path to the spurious self-consistent attractor. The upstream agent doesn't bias the downstream one toward truth; it sharpens the corridor to the wrong fixed point.
The shift: stop treating self-play stability as a reward-design problem. What enters the training loop matters more than how the optimizer scores it.
🚨 Why does Self-Play RL for LLMs keep collapsing? Most fixes focus on the reward signal. In our new paper "Survive or Collapse", we show that's the wrong lever. The true binding constraint is actually Data Gating: deciding which generated tasks enter the training pool. 🧵 1/n
🚨 Why does Self-Play RL for LLMs keep collapsing? Most fixes focus on the reward signal. In our new paper "Survive or Collapse", we show that's the wrong lever. The true binding constraint is actually Data Gating: deciding which generated tasks enter the training pool. 🧵 1/n
🎓 Finishing your PhD soon and looking for your next step?
If you’re excited about AI + Physics, don't miss this opportunity!
Come join our lab as a postdoc with Prof. @xwang_lk at @UCSantaBarbara World-class research on an amazing campus (home to last year’s Nobel in Physics!)🧑🔬
UCSB NLP is now 𝐔𝐂𝐒𝐁 𝐍𝐋𝐏 & 𝐀𝐈!
As our community grows alongside the broader AI landscape, the new name reflects our expanding focus across NLP, LLMs, agents, multimodal AI, and beyond.
Our new logo combines Storke Tower, mountains, beach, waves, and an AI brain. 🌊🧠
❗️📜📢Thrilled to share this new work! Learning POMDP World Models from Observations with Language-Model Priors!
It introduces Pinductor ➡️ An LLM proposes executable POMDP code from partial observations and refines it against a belief-based likelihood, matching methods that need ground-truth hidden states 🤯
Agent evaluation is moving from final answers to real execution harnesses.
Works like Claw-Eval and ClawsBench ask whether agents can plan, call tools, access resources, and complete user goals in realistic environments.
But the key gap is still there: most evaluations focus on task completion. They tell us whether the task was finished, but not whether it was finished safely.
Some recent safety audits under Claw like settings examine tool use or final output safety. Yet full execution trajectories and system level harness safety remain underdefined.
A harness can return the correct result while accessing restricted resources, making unauthorized tool calls, leaking sensitive context across agents, or triggering side effects beyond the user’s intent.
This becomes even more critical in multi agent systems, where role division, task handoff, shared context, and inter agent communication expand the safety surface.
We introduce HarnessAudit, a trajectory level framework that redefines how Agent Harness safety should be evaluated.
Harness safety should not be reduced to final output safety or task completion. It should ask whether the harness consistently respects tool permissions, resource boundaries, role responsibilities, and information flow constraints throughout execution.
HarnessAudit provides a unified audit protocol for both single agent and multi agent systems, evaluating whether a harness truly completes tasks safely across boundary compliance, execution fidelity, and system stability.
The harness, not just the model, is the unit of agent safety.
⚠️ Your Agent Harness Can Pass Every Task and Still Be Unsafe.
LLM agents now run inside execution harnesses that dispatch tools, allocate resources, and route messages across components. The harness can return a correct final answer while accessing unauthorized resources, leaking context to the wrong agent, or triggering irreversible side effects along the way.
Evaluating the model's output cannot see any of this.
The unit of safety has shifted. It's the harness.
We present HarnessAudit, a trajectory-level framework for auditing LLM agent harness safety, and uncover the following key insights 🔥:
🚨 Completion ≠ Safety. Task success and safe execution are fundamentally misaligned.
🔍 The harness, not the model, is the unit of safety. Most violations happen mid-trajectory, not at termination.
🕸️ Multi-agent collaboration expands the risk surface. Inter-agent communication creates entirely new failure modes.
💉 Resource access dominates violations. Agents rarely call wrong tools — they call right tools on unauthorized resources.
⚡ Harness design sets the safety ceiling. Framework choice matters more than model choice for safe deployment.
Huge congratulations to @qianqi_yan and @saa1605 on successfully passing their MAE! 🎉well deserved! 👏
This is an important milestone in the PhD journey and reflects all the hard work, persistence, and progress they’ve made. Excited to see what comes next in their research!
Coming up at @ucsantabarbara:
This week our friends from @ai_ucsb invite Prof. Ramya Korlakai Vinayak from University of Wisconsin-Madison to discuss personalized AI alignment. Save it in your calendar for May 19th at 10am.
We love having people over on our beautiful campus🏖️
Join us in celebrating the graduation of PhD students from our sibling lab at @ucsc ! 🎉🥳🙌🍾 Congratulations to you all Dr. @YFan_UCSC, Dr. @KaizhiZheng, and Dr. @KaiwenZhou9!
May 16, 2026, a truly special day: the graduation ceremony for three of my PhD students. Huge congratulations to Dr. @YFan_UCSC, Dr. @KaizhiZheng, and Dr. @KaiwenZhou9! With this, my first cohort of PhD students has officially all graduated (5/5)! 🎉
It was especially meaningful that Dr. @XuehaiH and Dr. @jinggu4ai, who graduated last year, came from Seattle and the Bay Area to celebrate together. @qianqi_yan, who just finished her third year and is preparing for her MAE defense, also drove up from Santa Barbara.
After the ceremony, we headed to the Santa Cruz Boardwalk and relived our (or just my) youth by riding every thrilling attraction we could find, ending the night with BBQ and drinks in the Bay Area. As Dr. Gu said: our very first group activity was a Santa Cruz beach hike when everyone joined the lab, and today we had our “beach ride”, a perfect full circle moment.
There are far too many memories to share in one post, but I hope we’ll keep gathering for years to come.
I feel incredibly fortunate to have worked and grown alongside such an outstanding and close-knit group of scholars. For a faculty member, moments like this may truly be among the happiest in one’s career.
I was also delighted to learn that all of my students have found their life partners: some newly married, others entering exciting new chapters of life.
Wishing all of you happiness, fulfillment, and great success ahead. May your futures be bright and your ambitions soar! ❤️🐎
Interested in getting into AI safety research? Applications are open for the PRISM Fellowship!
It’s a 16-week, remote program where fellows work in teams of 4 with established mentors toward a conference-ready paper.
Women & underrepresented researchers especially encouraged!