A humble proposal for the AI labs: before you use automated systems to ban paying users maybe use those same automated systems to say "hey you're doing this thing. We can tell. Stop it." Then ban them if it happens again.
Super important paper from Univ of Texas.
AI agents can slowly become less reliable after deployment, even when the model itself does not change.
The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance.
An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house.
Every one of those steps can quietly rot.
A medication dose can become βa daily medication,β two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass.
The uncomfortable finding is that the agent may still sound competent while becoming less exact.
The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point.
It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory.
The deeper lesson is that βgive it more memoryβ is often the wrong repair.
If the fact was never written, retrieval cannot save it.
If the fact was written but crowded out, better summarization will not fix it.
If the fact is present but unused, the problem is not storage but the agentβs decision to trust or ignore what it retrieved.
This paper reframes deployed agents less like static models and more like aging infrastructure.
----
Link β arxiv. org/abs/2605.26302
Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"
Today weβre releasing DeepSWE, a new standard for agentic coding benchmarks.
On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
The most important thing the Gemini models do is create pressure on everyone else, keep it from being a pure two horse race, and make it clear Google isn't going to have their lunch eaten easily. Otherwise they are kinda meh
We found and fixed two issues that could explain this degradation of the capability of GPT-5.5 in Codex over the last ~ 48 hours.
We are monitoring over the coming hours to fully confirm and I will reset usage limits this evening.
Apologies and now is the time for /fast maxxing.
These posts that 60% of code is written by AI at X company blow my mind. Who tf is still opening up VIM and doing this shit one key press at a time? If it's less than 99% you have a performance problem.
I had gpt 5.5 xhigh organize all of the released UFO files, review all videos and images, categorize them into buckets of completely explained to not explained, etc.
Then asked it for a skeptical but curious analysis. Pretty interesting.