Sunny @sunnypause - Twitter Profile

Sunny

@sunnypause

3 days ago

@OnlyTerp Thanks alot

0

1

sunnypause retweeted

Antidepressant Content

@depressionlesss

4 days ago

When things get weird, but you don't want it to stop

80

8K

654

1K

355K

sunnypause retweeted

Ibragim

@ibragim_bad

3 days ago

📊 More insights on GPT-5.5 vs Opus 4.8 based on SWE-rebench runs TL;DR: Opus 4.8 became much more token-efficient than 4.6, but GPT-5.5 is still the most efficient: more solved tasks, fewer tokens, fewer steps. 🏆 SWE-rebench is a live benchmark with fresh SWE tasks (issue+PR) from GitHub. Detailed table of the results and the leaderboard link are in the thread. Findings: > GPT-5.5 medium looks noticeably more efficient than Opus 4.8 high, if we compare the default reasoning-effort modes for both models. > Opus really became much more optimized from 4.6 → 4.8 on high: more solved tasks, 45% fewer tokens per task, and around 39% lower cost/problem. > Opus 4.8 high is almost not better than Opus 4.7 high by score, but it is much cheaper in compute. Tokens/task went down 1.53M → 1.01M, and steps went down 43.7 → 34.2. > GPT-5.5 medium also became more token-efficient than GPT-5.4 medium, but more expensive because the base pricing increased. Tokens per task went down by 15%, score increased, but the cost of solving a task increased by 63% while base pricing increased 2x. Another useful metric, when you have several runs, is pass^5. Here we count a task only if it was solved in all 5 runs. For GPT-5.5 medium, pass@5 almost did not change compared to GPT-5.4 medium: 77 vs 78. > But pass^5 increased a lot: 51 vs 39! This means GPT-5.5 medium solves tasks “randomly once” less often, and much more often solves the same task consistently in all 5 runs. For Opus, this number is almost the same between model versions, but it changes a lot depending on reasoning mode: high → xhigh. > Many people ask why GPT-5.5 xhigh gets a higher score than medium, or why one model beats another on these tasks. On the surface, it looks like one model solved the task and another did not. But usually it is not a full failure. Very often the model gets to an almost correct solution, but misses some edge cases or corner cases covered by tests. In xhigh reasoning, GPT makes many more steps to explore the repository and more actively tests its own solutions, including writing additional tests. This helps to catch these corner cases, but the price is high. GPT-5.5 medium: 58.9% → 62.7% pass@1, $0.98 → $2.25 > GLM 5.1 looks competitive by pass@5, but it has a very heavy trajectory. So I believe that it could be RL’ed to get even better results in terms of pass@1 and more efficient token count, like Composer 2.5, for example. P.S Please write if you have questions, or what hypotheses we should check on trajectories. We are working on releasing all the trajectories, so you can do some analysis on them as well

ibragim_bad's tweet photo. 📊 More insights on GPT-5.5 vs Opus 4.8 based on SWE-rebench runs

TL;DR: Opus 4.8 became much more token-efficient than 4.6, but GPT-5.5 is still the most efficient: more solved tasks, fewer tokens, fewer steps.

🏆 SWE-rebench is a live benchmark with fresh SWE tasks (issue+PR) from GitHub.

Detailed table of the results and the leaderboard link are in the thread.

Findings:
> GPT-5.5 medium looks noticeably more efficient than Opus 4.8 high, if we compare the default reasoning-effort modes for both models.

> Opus really became much more optimized from 4.6 → 4.8 on high: more solved tasks, 45% fewer tokens per task, and around 39% lower cost/problem.

> Opus 4.8 high is almost not better than Opus 4.7 high by score, but it is much cheaper in compute. Tokens/task went down 1.53M → 1.01M, and steps went down 43.7 → 34.2.

> GPT-5.5 medium also became more token-efficient than GPT-5.4 medium, but more expensive because the base pricing increased.
Tokens per task went down by 15%, score increased, but the cost of solving a task increased by 63% while base pricing increased 2x.
Another useful metric, when you have several runs, is pass^5. Here we count a task only if it was solved in all 5 runs.
For GPT-5.5 medium, pass@5 almost did not change compared to GPT-5.4 medium: 77 vs 78.

> But pass^5 increased a lot: 51 vs 39! This means GPT-5.5 medium solves tasks “randomly once” less often, and much more often solves the same task consistently in all 5 runs.
For Opus, this number is almost the same between model versions, but it changes a lot depending on reasoning mode: high → xhigh.

> Many people ask why GPT-5.5 xhigh gets a higher score than medium, or why one model beats another on these tasks. On the surface, it looks like one model solved the task and another did not. But usually it is not a full failure. Very often the model gets to an almost correct solution, but misses some edge cases or corner cases covered by tests.

In xhigh reasoning, GPT makes many more steps to explore the repository and more actively tests its own solutions, including writing additional tests. This helps to catch these corner cases, but the price is high.
GPT-5.5 medium: 58.9% → 62.7% pass@1, $0.98 → $2.25

> GLM 5.1 looks competitive by pass@5, but it has a very heavy trajectory. So I believe that it could be RL’ed to get even better results in terms of pass@1 and more efficient token count, like Composer 2.5, for example.

P.S
Please write if you have questions, or what hypotheses we should check on trajectories. We are working on releasing all the trajectories, so you can do some analysis on them as well

2

19

1

3

1K

sunnypause retweeted

Brillyluxe

@OvwomaB

4 days ago

Wait... CeraVe products cause cancer too? 🤔

138

2K

427

1K

892K

Who to follow

Jenny 🌹

@Jenny030318

#XRPL #XRPArmy #XRPCommunity

xrpjaybo🏴󠁧󠁢󠁥󠁮󠁧󠁿🇹🇭⭕👁🧭🤝

@jasonwr75235021

gym,good food,xrp,pure blood,my 3 beautiful girls ,family is my world

sunnypause retweeted

3 days ago

deepswe bench is the best benchmark in the world right now. and openai crush it. and 5.6 is leaps ahead of 5.5. opus is a cute chatbot companion though, such depth.

iruletheworldmo's tweet photo. deepswe bench is the best benchmark in the world right now.

and openai crush it.

and 5.6 is leaps ahead of 5.5.

opus is a cute chatbot companion though, such depth. https://t.co/IqWRjaiEM3

19

183

10

15

8K

sunnypause retweeted

FLAT OUT TRUTH

@TheFlatEartherr

4 days ago

And entire city exorcised all of it’s demons by building this giant bell and ringing it every day 🔔

23

1K

238

126

24K

Sunny

@sunnypause

4 days ago

@YashHustle_22 Gpt 5.5 xhigh is better

0

21

sunnypause retweeted

Financial Times

@FT

4 days ago

SoftBank’s commitment marks the largest AI investment by Masayoshi Son’s group outside the US and delivers a boost to Emmanuel Macron ahead of the French president’s Choose France event next week, an annual gathering of dealmakers and executives. https://t.co/9GWbpHUCbu

FT's tweet photo. SoftBank’s commitment marks the largest AI investment by Masayoshi Son’s group outside the US and delivers a boost to Emmanuel Macron ahead of the French president’s Choose France event next week, an annual gathering of dealmakers and executives.

https://t.co/9GWbpHUCbu https://t.co/3IWvkRbnO1

8

178

54

23

27K

Sunny

@sunnypause

4 days ago

@KaiXCreator Yep

0

85

sunnypause retweeted

낋이 @dakggi

5 days ago

쌓으면 대나무가 완성되는 컵이라니🎋🎋🎋

215

136K

12K

2M

sunnypause retweeted

Teknium 🪽

@Teknium

4 days ago

Found a way to save everyone 14% on input tokens on average during read file operations in Hermes Agent! This is now on main. `hermes update` to access now.

Teknium's tweet photo. Found a way to save everyone 14% on input tokens on average during read file operations in Hermes Agent!

This is now on main. `hermes update` to access now. https://t.co/2sjabsjNVY

110

2K

94

594

136K

sunnypause retweeted

Collin Burdick

@CollinBurdick

4 days ago

https://t.co/B9z8yTsmmO

0

1

0

204

sunnypause retweeted

Collin Burdick

@CollinBurdick

4 days ago

Who said you can't have cheap, fast, and good at the same time?? GPT-5.5 smashes Opus 4.8 on DeepSWE across all 3 at highest max reasoning. >> Higher score: 70% vs. 58% >> 2x faster >> 2x cheaper >> 3x fewer output tokens 5.5 high still beats 4.8 max 62% vs. 58% while being 3x faster and 3x cheaper That matters beyond software engineering. In life sciences, better models can help teams use scarce researcher time, budget, and experimental capacity more efficiently, find results sooner, and make more patient impact, faster. And we are just getting started.

CollinBurdick's tweet photo. Who said you can't have cheap, fast, and good at the same time??

GPT-5.5 smashes Opus 4.8 on DeepSWE across all 3 at highest max reasoning.

>> Higher score: 70% vs. 58%
>> 2x faster
>> 2x cheaper
>> 3x fewer output tokens

5.5 high still beats 4.8 max 62% vs. 58% while being 3x faster and 3x cheaper

That matters beyond software engineering. In life sciences, better models can help teams use scarce researcher time, budget, and experimental capacity more efficiently, find results sooner, and make more patient impact, faster.

And we are just getting started.

6

68

13

10

6K

Sunny

@sunnypause

4 days ago

@usr_bin_roygbiv @wolfiesch vs openai compute, they can't win for now

1

0

344

sunnypause retweeted

Peter Steinberger 🦞

@steipete

4 days ago

@iruletheworldmo very much depends on the skillset of the person driving the AI.

8

68

2

9

6K

sunnypause retweeted

CHOI

@arrakis_ai

4 days ago

Claude Opus 4.8 has landed on DeepSWE Bench, posting a 58% Pass@1 and taking #2 overall behind GPT-5.5. It continues a broader trend: slightly behind on raw score, but among the most reliable and efficient coding models across recent benchmarks.