I work at a much smaller company and have been rewriting our entire company from scratch basically, 10+ app ecosystem and trying to keep the whole system in memory and verifying (lots of screenshots). And I rarely trust sonnet for subtasks etc. I guess maybe I need to start trying out the cheaper models?
I use up my 200/ month limit every single week, usually with a day or two left so I have to switch to my codex for those few days. Running round the clock on 1/4th the usage limits doesn’t sound realistic. I could optimize my usage a bit definitely. But you’re talking 25% of my limit being more than enough
@shade_engine@steph_palazzolo Yeah the model that found 10,000 zero day exploits in nation-critical infrastructure within weeks of being released is probably ass
Just an example on Opus 4.8 outclassing GPT 5.5 in ways that are invisible to benchmarks. (When I post these, this is NOT an attack on OpenAI in any way, quite the opposite, I just want things to improve...)
Left both working in a generic goal: "optimize this file".
After 8 hours:
→ Opus 4.8 landed a solid +17%
→ GPT 5.5 landed +30%
I then checked file sizes.
→ GPT 5.5 *doubled* the file
→ Opus 4.6 grew it by 0.1% (!!!!!)
For most benches, 5.5 would have beat 4.8 here, but clearly Opus did a much better job. GPT produced a short-term win that would stale further progress if I merged it. Opus delivered a no-tradeoff, long-term win. And if I had asked GPT 5.5 to "keep the file size the same", it would just start hacking that, minifying, removing docs, etc. - something Opus 4.8 just doesn't do. Its file is as clear as it was when I set the goal.
---
About this: it is an "HVM5 v2" that is even simpler. Now the whole file is at <14k tokens, and consistently outperforming HVM4 by 5-fold. And this version does not have native constructors, only Unit (`()`), Either (`inl(x)/inr(x)`) and Pair (`tup(x,y)`) as primitives, which is very GPU friendly and means we might actually manage to run SupGen on it!
Opus's overnight progress:
@Gabriel78470020@elliotarledge Wrong. This benchmark will be the only one that really matters moving forward, as we’ve already heard confirmed from basically every lab
@Not_A_DoctorOk@fiago7@GlynnErnesto Well I live near an airport. There are airports everywhere in the U.S. having a hard time imagining someone who doesn’t see passenger planes in the air daily.
@Not_A_DoctorOk@fiago7@GlynnErnesto Yeah seeing a passenger plane fly over (the kind that we see 100 of every day) is just as exciting as b52 stratofortress bombers
@AnthonyGalli@ericernerstedt@TheCriticalDri2 House is really good honestly but it’s not the same genre as breaking bad. It is comfort / background watching like Seinfeld or something
@LostMyHats@AmericaOnlycast I was raised LDS and went to seminary and studied it all through high school etc. even if I don’t consider myself a member anymore, basically none of what you said is true.