Amir Elaguizy @amirpc - Twitter Profile

A humble proposal for the AI labs: before you use automated systems to ban paying users maybe use those same automated systems to say "hey you're doing this thing. We can tell. Stop it." Then ban them if it happens again.

0

25

amirpc retweeted

Rohan Paul

@rohanpaul_ai

9 days ago

Super important paper from Univ of Texas. AI agents can slowly become less reliable after deployment, even when the model itself does not change. The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance. An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house. Every one of those steps can quietly rot. A medication dose can become “a daily medication,” two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass. The uncomfortable finding is that the agent may still sound competent while becoming less exact. The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point. It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory. The deeper lesson is that “give it more memory” is often the wrong repair. If the fact was never written, retrieval cannot save it. If the fact was written but crowded out, better summarization will not fix it. If the fact is present but unused, the problem is not storage but the agent’s decision to trust or ignore what it retrieved. This paper reframes deployed agents less like static models and more like aging infrastructure. ---- Link – arxiv. org/abs/2605.26302 Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"

rohanpaul_ai's tweet photo. Super important paper from Univ of Texas.

AI agents can slowly become less reliable after deployment, even when the model itself does not change.

The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance.

An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house.

Every one of those steps can quietly rot.

A medication dose can become “a daily medication,” two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass.

The uncomfortable finding is that the agent may still sound competent while becoming less exact.

The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point.

It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory.

The deeper lesson is that “give it more memory” is often the wrong repair.

If the fact was never written, retrieval cannot save it.

If the fact was written but crowded out, better summarization will not fix it.

If the fact is present but unused, the problem is not storage but the agent’s decision to trust or ignore what it retrieved.

This paper reframes deployed agents less like static models and more like aging infrastructure.

----

Link – arxiv. org/abs/2605.26302

Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"

29

170

51

100

10K

amirpc retweeted

Serena Ge (Datacurve)

@serenaa_ge

10 days ago

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

serenaa_ge's tweet photo. Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. https://t.co/HCDcjNuTFK

512

6K

742

3K

2M

amirpc retweeted

Simon Last

@simonlast

14 days ago

1/ Some things I've learned recently running coding agents on large-scale projects. Most of this contradicts advice from 6 months ago!

95

3K

210

6K

569K

Amir Elaguizy

@amirpc

16 days ago

Codex: Yes its fully implemented and tested and ready for review. Small note: The docker container is out of disk space so nothing was tested.

0

1

0

40

Amir Elaguizy

@amirpc

16 days ago

The most important thing the Gemini models do is create pressure on everyone else, keep it from being a pure two horse race, and make it clear Google isn't going to have their lunch eaten easily. Otherwise they are kinda meh

0

1

0

44

amirpc retweeted

Naval

@naval

17 days ago

The latest IQ test involves data centers and water.

566

14K

1K

873

2M

Amir Elaguizy

@amirpc

17 days ago

History will look back on "detecting AI generated content" as hilarious. It will be like detecting books typed vs written by hand on paper.

0

1

0

24

Amir Elaguizy

@amirpc

20 days ago

Spent all day working with 140 IQ models that can't tell left from right and let me tell you...AGI is not here.

0

2

0

35

Amir Elaguizy

@amirpc

20 days ago

@thsottiaux @reach_vb Thank you for treating your users with respect

0

2

0

218

amirpc retweeted

Tibo

@thsottiaux

21 days ago

We found and fixed two issues that could explain this degradation of the capability of GPT-5.5 in Codex over the last ~ 48 hours. We are monitoring over the coming hours to fully confirm and I will reset usage limits this evening. Apologies and now is the time for /fast maxxing.

828

8K

511

679

2M

Amir Elaguizy

@amirpc

23 days ago

Well its officially easier to migrate off of some SaaS than deal with their support.

1

2

0

80

Amir Elaguizy

@amirpc

27 days ago

@nwenzel 🔥

0

18

Amir Elaguizy

@amirpc

27 days ago

These posts that 60% of code is written by AI at X company blow my mind. Who tf is still opening up VIM and doing this shit one key press at a time? If it's less than 99% you have a performance problem.

1

0

99

Amir Elaguizy

@amirpc

27 days ago

I had gpt 5.5 xhigh organize all of the released UFO files, review all videos and images, categorize them into buckets of completely explained to not explained, etc. Then asked it for a skeptical but curious analysis. Pretty interesting.

amirpc's tweet photo. I had gpt 5.5 xhigh organize all of the released UFO files, review all videos and images, categorize them into buckets of completely explained to not explained, etc.

Then asked it for a skeptical but curious analysis. Pretty interesting. https://t.co/O8brjRWAe6

0

2

0

95

Amir Elaguizy

@amirpc

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users