MG @mic_gta - Twitter Profile

mic_gta retweeted

about 1 year ago

A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and o3-mini 100k tokens. This includes the reasoning tokens they use before outputting their final answer! - all models will have 0 accuracy with more than 13 disks simply because they can not output that much! - the max solvable sizes WITHOUT ANY ROOM FOR REASONING (floor(log2(output_limit/10))) DeepSeek: 12 disks Sonnet 3.7 and o3-mini: 13 disks - If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large: "Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually" - At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps. - it's also interesting to look at the models as having a X% chance of picking the correct token at each move - even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size

scaling01's tweet photo. A few more observations after replicating the Tower of Hanoi game with their exact prompts:

- You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff.
- Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and o3-mini 100k tokens. This includes the reasoning tokens they use before outputting their final answer!

- all models will have 0 accuracy with more than 13 disks simply because they can not output that much!

- the max solvable sizes WITHOUT ANY ROOM FOR REASONING (floor(log2(output_limit/10)))
DeepSeek: 12 disks
Sonnet 3.7 and o3-mini: 13 disks

- If you actually look at the output of the models you will see that they don't even reason about the problem if it gets too large:
"Due to the large number of moves, I'll explain the solution approach rather than listing all 32,767 moves individually"

- At least for Sonnet it doesn't try to reason through the problem once it's above ~7 disks. It will state what the problem and the algorithm to solve it and then output its solution without even thinking about individual steps.

- it's also interesting to look at the models as having a X% chance of picking the correct token at each move
- even with a 99.99% probability the models will eventually make an error simply because of the exponentially growing problem size

82

2K

241

930

616K

mic_gta retweeted

Legit @legit_api

over 1 year ago

Gemini 2.0 Pro Exp has just been added behind the scenes ✨

18

608

47

85

173K

mic_gta retweeted

Chubby♨️

@kimmonismus

over 1 year ago

O3 failure rate on ARC AGI correlates with grid size rather than difficulty >via/r/singularity

28

961

54

222

113K

mic_gta retweeted

Dwarkesh Patel

@dwarkesh_sp

over 1 year ago

We’re way more patient in training human employees than AI employees. We will spend weeks onboarding a human employee and giving slow detailed feedback. But we won’t spend just a couple of hours playing around with the prompt that might enable the LLM to do the exact same job, but more reliably and quickly than any human.

114

2K

91

471

168K

Who to follow

itsAdam888

@714aac

͏ ͏͏͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ᅟᅟᅟᅟᅟᅟ𖾗𖽖𖽧𖾗𖽖𖽧𖾗𖽖𖽧𖾗𖽖𖽧

@saml_exe

͏ ͏ ͫͣ͛ͫͣͫͣ͛ͫͫͣ͛ͫͣ͛ͫͣ͛ͫͣ͛ ͏ ᅟᅟᅟProbably exploiting some weird site. ͏ ͏

Dr_RageQuit

@dr_ragequit

General Surgery resident, crypto enthusiast, NFT collector Rogue Society Bot #3240 #3241 #FFTB #JRNYclub #Voyagers #9936

mic_gta retweeted

simp 4 satoshi

@iamgingertrash

over 1 year ago

The first innings of AGI development was Mathematics (linalg + calc) The second was Computer Science (parallelization, tensor ops) The final third may turn out to be Philosophy

34

501

29

126

33K

mic_gta retweeted

Jim Fan

@DrJimFan

over 1 year ago

Thoughts about o3: I'll skip the obvious part (extraordinary reasoning, FrontierMath is insanely hard, etc). I think the essence of o3 is about *relaxing a single-point RL super intelligence* to cover more points in the space of useful problems. The world of AI is no stranger to RL achieving god-level stunts. AlphaGo was a super intelligence. It beats the world champion in Go - well above 99.999% of regular players. AlphaStar was a super intelligence. It bests some of the greatest e-sport champion teams on StarCraft. Boston Dynamics e-Atlas was a super intelligence. It performs perfect backflips. Most human brains don't know how to send such sophisticated control signals to their limbs. Similar statement can be made for AIME, SWE-Bench, and FrontierMath - they are like Go, which requires exceptional domain expertise above 99.99....% of average people. o3 is a super intelligence when operating in these domains. The key difference is that AlphaGo uses RL to optimize for a simple, almost trivially defined reward function: winning the game gives 1, losing gives 0. Learning reward functions for sophisticated math and software engineering are much harder. o3 made a breakthrough in solving the reward problem, for the domains that OpenAI prioritizes. It is no longer an RL specialist for single-point task, but an RL specialist for a bigger set of useful tasks. Yet o3's reward engineering could not cover ALL distribution of human cognition. This is why we are still cursed by Moravec's paradox. o3 can wow the Fields Medalists, but still fail to solve some 5-yr-old puzzles like the one below. I am not at all surprised by this cognitive dissonance, just like we wouldn't expect AlphaGo to win Poker games. Huge milestone. Clear roadmap. More to do.

DrJimFan's tweet photo. Thoughts about o3: I'll skip the obvious part (extraordinary reasoning, FrontierMath is insanely hard, etc). I think the essence of o3 is about *relaxing a single-point RL super intelligence* to cover more points in the space of useful problems.

The world of AI is no stranger to RL achieving god-level stunts.

AlphaGo was a super intelligence. It beats the world champion in Go - well above 99.999% of regular players.
AlphaStar was a super intelligence. It bests some of the greatest e-sport champion teams on StarCraft.
Boston Dynamics e-Atlas was a super intelligence. It performs perfect backflips. Most human brains don't know how to send such sophisticated control signals to their limbs.

Similar statement can be made for AIME, SWE-Bench, and FrontierMath - they are like Go, which requires exceptional domain expertise above 99.99....% of average people. o3 is a super intelligence when operating in these domains.

The key difference is that AlphaGo uses RL to optimize for a simple, almost trivially defined reward function: winning the game gives 1, losing gives 0. Learning reward functions for sophisticated math and software engineering are much harder. o3 made a breakthrough in solving the reward problem, for the domains that OpenAI prioritizes. It is no longer an RL specialist for single-point task, but an RL specialist for a bigger set of useful tasks.

Yet o3's reward engineering could not cover ALL distribution of human cognition. This is why we are still cursed by Moravec's paradox. o3 can wow the Fields Medalists, but still fail to solve some 5-yr-old puzzles like the one below. I am not at all surprised by this cognitive dissonance, just like we wouldn't expect AlphaGo to win Poker games.

Huge milestone. Clear roadmap. More to do.

98

2K

322

956

272K

mic_gta retweeted

Ryan Pream @AIMachineDream

over 1 year ago

@airkatakana There is a universe where Lecun was right, we just happen to not live in that one.

0

21

1

0

1K

mic_gta retweeted

Rhythm Garg 🚂

@rhythmrg

over 1 year ago

also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint

47

945

147

178

513K

mic_gta retweeted

Sebastien Bubeck

@SebastienBubeck

over 1 year ago

o3 and o3-mini are my favorite models ever. o3 essentially solves AIME (>90%), GPQA (~90%), ARC-AGI (~90%), and it gets 1/4th of the Frontier Maths. To understand how insane 25% on Frontier Maths is, see this quote by Tim Gowers. The sparks are intensifying ...

SebastienBubeck's tweet photo. o3 and o3-mini are my favorite models ever.

o3 essentially solves AIME (>90%), GPQA (~90%), ARC-AGI (~90%), and it gets 1/4th of the Frontier Maths.

To understand how insane 25% on Frontier Maths is, see this quote by Tim Gowers.

The sparks are intensifying ... https://t.co/sEXNIJIaNo

48

1K

162

383

282K

mic_gta retweeted

claude shannon

@catpoopburglar

over 1 year ago

@PP9401 @tejalpatwardhan o3 received a 2727 in code forces elo, which would be in the top 150 software engineers in the world

0

3

1

0

558

mic_gta retweeted

Greg Brockman

@gdb

over 1 year ago

o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now.

126

4K

352

369

380K

mic_gta retweeted

François Chollet

@fchollet

over 1 year ago

So, is this AGI? While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI -- there's still a fair number of very easy ARC-AGI-1 tasks that o3 can't solve, and we have early indications that ARC-AGI-2 will remain extremely challenging for o3. This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans, yet impossible for AI -- without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible.

35

2K

189

350

331K

mic_gta retweeted

Zhou Xian

@zhou_xian_

over 1 year ago

Everything you love about generative models — now powered by real physics! Announcing the Genesis project — after a 24-month large-scale research collaboration involving over 20 research labs — a generative physics engine able to generate 4D dynamical worlds powered by a physics simulation platform designed for general-purpose robotics and physical AI applications. Genesis's physics engine is developed in pure Python, while being 10-80x faster than existing GPU-accelerated stacks like Isaac Gym and MJX. It delivers a simulation speed ~430,000 faster than in real-time, and takes only 26 seconds to train a robotic locomotion policy transferrable to the real world on a single RTX4090 (see tutorial: https://t.co/bEkIlCKqdf). The Genesis physics engine and simulation platform is fully open source at https://t.co/DhBv7NdyqH. We'll gradually roll out access to our generative framework in the near future. Genesis implements a unified simulation framework all from scratch, integrating a wide spectrum of state-of-the-art physics solvers, allowing simulation of the whole physical world in a virtual realm with the highest realism. We aim to build a universal data engine that leverages an upper-level generative framework to autonomously create physical worlds, together with various modes of data, including environments, camera motions, robotic task proposals, reward functions, robot policies, character motions, fully interactive 3D scenes, open-world articulated assets, and more, aiming towards fully automated data generation for robotics, physical AI and other applications. Open Source Code: https://t.co/DhBv7NdyqH Project webpage: https://t.co/SBNyhFB0yn Documentation: https://t.co/3yuBoaealV 1/n

561

16K

3K

12K

4M

mic_gta retweeted

BURKOV

@burkov

over 1 year ago

I just cannot make myself code anymore. It feels too slow and requires too much energy. My brain refuses to concentrate and demands using an LLM shortcut. Do you feel something like this?

352

3K

90

630

3M

mic_gta retweeted

Tsarathustra @tsarnick

over 1 year ago

OpenAI CFO Sarah Friar says the company is leaving the door open to a $2000/month subscription to its AI product which could serve as a "replacement" to hiring humans due to its PhD-level intelligence

tsarnick's tweet photo. OpenAI CFO Sarah Friar says the company is leaving the door open to a $2000/month subscription to its AI product which could serve as a "replacement" to hiring humans due to its PhD-level intelligence https://t.co/WT1wFVzlGi

191

1K

172

469

794K

mic_gta retweeted

Bindu Reddy

@bindureddy

over 1 year ago

OpenAI has released ZERO features for developers in the last 7 days They went from open source to closed source to no API access AND give us $200 to access our top LLM!! EXTREMELY ALARMING!

126

1K

65

136

215K

mic_gta retweeted

Pika

@pika_labs

over 1 year ago

Our holiday gift to you: Pika 2.0 is here. Not just for pros. For actual people. (Even Europeans!) Now available at https://t.co/nqzjGy8AB5

94

1K

195

495

198K

mic_gta retweeted

Anthropic

@AnthropicAI

over 1 year ago

New research collaboration: “Best-of-N Jailbreaking”. We found a simple, general-purpose method that jailbreaks (bypasses the safety features of) frontier AI models, and that works across text, vision, and audio.

79

2K

241

1K

390K

mic_gta retweeted

DeepSeek

@deepseek_ai

over 1 year ago

🎉 DeepSeek-VL2 is here! Our next-gen vision-language model enters the MoE era. 🤖 DeepSeek-MoE arch + dynamic image tilling ⚡ 3B/16B/27B sizes for flexible use 🏆 Outstanding performance across all benchmarks 🧵 1/n

deepseek_ai's tweet photo. 🎉 DeepSeek-VL2 is here! Our next-gen vision-language model enters the MoE era.

🤖 DeepSeek-MoE arch + dynamic image tilling
⚡ 3B/16B/27B sizes for flexible use
🏆 Outstanding performance across all benchmarks

🧵 1/n https://t.co/C7FvQ7p0ym

92

2K

265

461

482K

mic_gta retweeted

NotebookLM

@NotebookLM

over 1 year ago

📢 NEW LAUNCHES📢 1. ✋Rolling out: "Join" an audio overview+engage directly with the AI hosts 2. 😎New UI optimized for managing+generating new content based on your sources 3. 💪NotebookLM Plus: a premium version for businesses, teams & enterprises Learn more ⬇️ https://t.co/1depPUsvpS

87

2K

208

418

612K

MG

@mic_gta

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users