Vladimir Belomorski (Bello) @bellodox - Twitter Profile

Pinned Tweet

Vladimir Belomorski (Bello) @bellodox

about 1 month ago

Three branches stripped bare. Just the wind whispers to them; Moon among the clouds..

0

1

0

43

Vladimir Belomorski (Bello) @bellodox

about 1 month ago

It shows exactly what the current problem with AI is: doing perfect job at the tests while performing subpar in real life tests. @claudeai #antropic

Artificial Analysis

@ArtificialAnlys

about 1 month ago

Claude Opus 4.8 takes the lead on the Artificial Analysis Intelligence Index at 61.4, with Anthropic retaking the #1 spot on GDPval-AA and advancing in terminal use and scientific reasoning To reach the leading position on the Intelligence Index, @Anthropic made large improvements in both real-world agentic work and frontier academic reasoning tasks. Key takeaways: ➤ Claude Opus 4.8 is the new leader on the Artificial Analysis Intelligence Index. Opus 4.8 scores 61.4, up +4.1 points from Opus 4.7 and +1.2 points ahead of GPT-5.5 (xhigh), the previous Index leader ➤ The new release is slightly more efficient than its predecessor on agentic tasks, but token efficiency varied by task type. We saw Opus 4.8 use fewer turns and output tokens on GDPval-AA, but approximately the same number of output tokens for the overall Intelligence Index to achieve significantly higher performance. ➤ Anthropic retakes the lead on GDPval-AA, our primary evaluation for agentic performance on knowledge work tasks. Opus 4.8 scored an 1,890 Elo, reflecting an implied win rate of approximately 67% against GPT-5.5 ➤ Claude is now among the top models for scientific reasoning. Previous releases have trailed peers on complex academic reasoning tasks, but with Opus 4.8, Claude sits slightly ahead of OpenAI and Google as the leader on Humanity’s Last Exam. It also scores higher than Gemini 3.1 Pro on CritPt, a frontier physics benchmark, but remains behind GPT-5.4 and GPT-5.5 ➤ Claude Opus 4.8 reaches #2 on AA-Omniscience, slightly ahead of Opus 4.7. Opus 4.8 scores 27.4 on the AA-Omniscience Index behind only Gemini 3.1 Pro (32.9). Accuracy ticked up slightly to 46.6% and hallucination rate held roughly flat at 35.9% - Anthropic continues to demonstrate substantially lower hallucination rates than peer models from Google and OpenAI ➤ Compared with Opus 4.7, Opus 4.8 also makes material gains on Terminal-Bench Hard (+6.8 points), τ²-Bench Telecom (+5.9 points), and IFBench (+3.6 points), with relatively flat scores across AA-LCR, GPQA, and SciCode. Other key model details remain the same as Opus 4.7: Context window of 1 million tokens (equivalent to Opus 4.7) Pricing of $5/$25 per million tokens of input/output; cache pricing remains at a 25% premium for cache writes ($6.25 per million tokens) with 5-minute time to live, and 90% discount for cache hits ($0.5 per million tokens) Effort remains the recommended way of configuring model performance and latency, with the same options as Opus 4.7 - we measured the model at its ‘max’ effort setting to test peak performance

ArtificialAnlys's tweet photo. Claude Opus 4.8 takes the lead on the Artificial Analysis Intelligence Index at 61.4, with Anthropic retaking the #1 spot on GDPval-AA and advancing in terminal use and scientific reasoning

To reach the leading position on the Intelligence Index, @Anthropic made large improvements in both real-world agentic work and frontier academic reasoning tasks.

Key takeaways:
➤ Claude Opus 4.8 is the new leader on the Artificial Analysis Intelligence Index. Opus 4.8 scores 61.4, up +4.1 points from Opus 4.7 and +1.2 points ahead of GPT-5.5 (xhigh), the previous Index leader

➤ The new release is slightly more efficient than its predecessor on agentic tasks, but token efficiency varied by task type. We saw Opus 4.8 use fewer turns and output tokens on GDPval-AA, but approximately the same number of output tokens for the overall Intelligence Index to achieve significantly higher performance.

➤ Anthropic retakes the lead on GDPval-AA, our primary evaluation for agentic performance on knowledge work tasks. Opus 4.8 scored an 1,890 Elo, reflecting an implied win rate of approximately 67% against GPT-5.5

➤ Claude is now among the top models for scientific reasoning. Previous releases have trailed peers on complex academic reasoning tasks, but with Opus 4.8, Claude sits slightly ahead of OpenAI and Google as the leader on Humanity’s Last Exam. It also scores higher than Gemini 3.1 Pro on CritPt, a frontier physics benchmark, but remains behind GPT-5.4 and GPT-5.5

➤ Claude Opus 4.8 reaches #2 on AA-Omniscience, slightly ahead of Opus 4.7. Opus 4.8 scores 27.4 on the AA-Omniscience Index behind only Gemini 3.1 Pro (32.9). Accuracy ticked up slightly to 46.6% and hallucination rate held roughly flat at 35.9% - Anthropic continues to demonstrate substantially lower hallucination rates than peer models from Google and OpenAI

➤ Compared with Opus 4.7, Opus 4.8 also makes material gains on Terminal-Bench Hard (+6.8 points), τ²-Bench Telecom (+5.9 points), and IFBench (+3.6 points), with relatively flat scores across AA-LCR, GPQA, and SciCode.

Other key model details remain the same as Opus 4.7:
Context window of 1 million tokens (equivalent to Opus 4.7)
Pricing of $5/$25 per million tokens of input/output; cache pricing remains at a 25% premium for cache writes ($6.25 per million tokens) with 5-minute time to live, and 90% discount for cache hits ($0.5 per million tokens)
Effort remains the recommended way of configuring model performance and latency, with the same options as Opus 4.7 - we measured the model at its ‘max’ effort setting to test peak performance

15

690

71

95

53K

0

67

bellodox retweeted

Artificial Analysis

@ArtificialAnlys

about 1 month ago

Claude Opus 4.8 takes the lead on the Artificial Analysis Intelligence Index at 61.4, with Anthropic retaking the #1 spot on GDPval-AA and advancing in terminal use and scientific reasoning To reach the leading position on the Intelligence Index, @Anthropic made large improvements in both real-world agentic work and frontier academic reasoning tasks. Key takeaways: ➤ Claude Opus 4.8 is the new leader on the Artificial Analysis Intelligence Index. Opus 4.8 scores 61.4, up +4.1 points from Opus 4.7 and +1.2 points ahead of GPT-5.5 (xhigh), the previous Index leader ➤ The new release is slightly more efficient than its predecessor on agentic tasks, but token efficiency varied by task type. We saw Opus 4.8 use fewer turns and output tokens on GDPval-AA, but approximately the same number of output tokens for the overall Intelligence Index to achieve significantly higher performance. ➤ Anthropic retakes the lead on GDPval-AA, our primary evaluation for agentic performance on knowledge work tasks. Opus 4.8 scored an 1,890 Elo, reflecting an implied win rate of approximately 67% against GPT-5.5 ➤ Claude is now among the top models for scientific reasoning. Previous releases have trailed peers on complex academic reasoning tasks, but with Opus 4.8, Claude sits slightly ahead of OpenAI and Google as the leader on Humanity’s Last Exam. It also scores higher than Gemini 3.1 Pro on CritPt, a frontier physics benchmark, but remains behind GPT-5.4 and GPT-5.5 ➤ Claude Opus 4.8 reaches #2 on AA-Omniscience, slightly ahead of Opus 4.7. Opus 4.8 scores 27.4 on the AA-Omniscience Index behind only Gemini 3.1 Pro (32.9). Accuracy ticked up slightly to 46.6% and hallucination rate held roughly flat at 35.9% - Anthropic continues to demonstrate substantially lower hallucination rates than peer models from Google and OpenAI ➤ Compared with Opus 4.7, Opus 4.8 also makes material gains on Terminal-Bench Hard (+6.8 points), τ²-Bench Telecom (+5.9 points), and IFBench (+3.6 points), with relatively flat scores across AA-LCR, GPQA, and SciCode. Other key model details remain the same as Opus 4.7: Context window of 1 million tokens (equivalent to Opus 4.7) Pricing of $5/$25 per million tokens of input/output; cache pricing remains at a 25% premium for cache writes ($6.25 per million tokens) with 5-minute time to live, and 90% discount for cache hits ($0.5 per million tokens) Effort remains the recommended way of configuring model performance and latency, with the same options as Opus 4.7 - we measured the model at its ‘max’ effort setting to test peak performance

15

690

71

95

53K

Vladimir Belomorski (Bello) @bellodox

about 1 month ago

Woo, the new Opus 4.8 behavior is quite erratic and inconsistent, @claudeai #aislop #antropic

0

39

Who to follow

Just Ra

@JustR3_

Acta non verba | Hunting for the mojo in life | Building in Finance and DeFi | 2nd LoD in a TBTF

PlanMG

@planMG

This is my Fuck you, I'm happy. PS: When you give each other everything, it's an even trade - each wins all. #kiflapower

IpIvanov

@ivanPivanov

Vladimir Belomorski (Bello) @bellodox

7 months ago

@godofprompt AI

0

3

Vladimir Belomorski (Bello) @bellodox

about 3 years ago

3 rules to express your thoughts so that everyone will understand you https://t.co/V1HG88ZMAb via @instapaper

0

71

bellodox retweeted

₵ryptospace ₮oday - All the news, analyzed daily @cryptospace2day

about 3 years ago

Sneak Peek into the Future: A Tech Demo of SpaceXpanse Multiverse Features: https://t.co/NdAbEfFELi

2

8

5

0

752

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

Building the metaverse of the future https://t.co/F7Wy2gEh7y via @instapaper

0

44

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

Compact nuclear fusion 1 million times more effective than other types, claims Israeli startup https://t.co/fF9dt1mglU via @instapaper

0

24

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

How freelancers in Austria can get a €100 health bonus in 2023 https://t.co/Q1pGRIWk88 via @instapaper

0

22

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

Is Vinegar an Acid or Base? And Does It Matter? https://t.co/XZoUuV0IVc via @instapaper

0

17

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

What's changing for tenants and homeowners in Austria in 2023? https://t.co/QUOgDvCWTN via @instapaper

0

29

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

5 Best Methods to Force Massive Muscle Growth | BOXROX https://t.co/hM099y0ipB via @instapaper

0

24

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

Най-добрите дни и часове за публикуване в социалните мрежи през 2023 г. (Инфографика) https://t.co/DWKpo3ziQF via @instapaper

0

17

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

EXPLAINED: How to pay Austria's TV and radio fee, or (legally) avoid it https://t.co/DhWuvyIfWD via @instapaper

0

15

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

European Union to Put a 10,000-Euro Limit on Cash Payments; Transactions Over €1,000 in Crypto Will Be Scrutinized –… https://t.co/TFl6WI92LH

0

21

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

Високоинтензивната физическа активност намалява риска от рак и метастази https://t.co/j7SklI1DP8 via @instapaper

0

17

bellodox retweeted

Bittrex Global @BittrexGlobal

over 3 years ago

Day 12: It's the final day of #12DaysOfCrypto. We're giving away $3,000 worth of $MATIC to our lucky winners. Enter now at: https://t.co/Tlat1xNJzg 🎁 Provide your KYC verified Bittrex Global email 🎁 Follow @BittrexGlobal, @OliverLinch and @MouradianMike on Twitter 🎁 Retweet!

85

165

248

1

34K

bellodox retweeted

Bittrex Global @BittrexGlobal

over 3 years ago

We're on day 11 of #12DaysOfCrypto! We're giving away $2,000 worth of $XRP to our lucky winners. Enter now at: https://t.co/Tlat1xNJzg 🎁 Provide your KYC verified Bittrex Global email 🎁 Follow @BittrexGlobal, @OliverLinch and @MouradianMike on Twitter 🎁 Retweet!

76

151

254

2

28K

Vladimir Belomorski (Bello) @bellodox

over 3 years ago

I just got my free Web3 Domain that matches my Twitter handle, bellodox.nft, - and you can, too! Follow the steps below. 1. Go here: https://t.co/RIezOx7lG4 2. Click "Verify Twitter" 3. Claim your domain #web3 #ownyourname #UDfam

0

33

Vladimir Belomorski (Bello)

@bellodox

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users