nano

@nanulled

longtermism

United States

Joined October 2019

57 Following

2.9K Followers

3.4K Posts

nanulled retweeted

about 1 month ago

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

595

17K

2K

9K

2M

nanulled retweeted

about 2 months ago

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at https://t.co/GCdiMzk1Dl via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: https://t.co/drlDrxkYtp 🤗 Open Weights: https://t.co/T13Y8i7SDM 1/n

deepseek_ai's tweet photo. 🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length.

🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models.
🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice.

Try it now at https://t.co/GCdiMzk1Dl via Expert Mode / Instant Mode. API is updated & available today!

📄 Tech Report: https://t.co/drlDrxkYtp
🤗 Open Weights: https://t.co/T13Y8i7SDM

1/n

2K

46K

8K

10K

10M

nanulled retweeted

METR @METR_Evals

about 2 months ago

We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.

METR_Evals's tweet photo. We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks. https://t.co/M8pKFswhdx

37

773

59

178

269K

3 months ago

I seriously think that openai started purposely hurting ml research capabilities with this model, it's literally worse at taste than 5.2 high. I understand the competitive advantage of withholding capabilities but still they should just admit it and not waste anyone's time.

0

4

0

2

376

Who to follow

DnB producer, DJ & remixer, based Devon UK. Releases on Lizplay, Mathematica Records, Liquid Brilliants, Nexxus, Ransom Recordings and UltraTech LiquidNRG

aim for the moon 🕊

idk man, what's up?

3 months ago

5.4 xhigh is worse than 5.3 codex at ml research, running experiments, patching gated features and debugging inference and evals. It's maybe better at moonshooting proposals just like 5.2 high but it does not have a robust experimentation hygiene. Same with 5.4 pro vs 5.2 pro.

3

12

0

5

986

nanulled retweeted

Google DeepMind @GoogleDeepMind

4 months ago

Step inside Project Genie: our experimental research prototype that lets you create, edit, and explore virtual worlds. 🌎

975

34K

4K

17K

13M

nanulled retweeted

Google DeepMind @GoogleDeepMind

9 months ago

We’re announcing a major advance in the study of fluid dynamics with AI 💧 in a joint paper with researchers from @BrownUniversity, @nyuniversity and @Stanford.

GoogleDeepMind's tweet photo. We’re announcing a major advance in the study of fluid dynamics with AI 💧 in a joint paper with researchers from @BrownUniversity, @nyuniversity and @Stanford. https://t.co/HevQE7mKI8

178

5K

716

1K

1M

nanulled retweeted

Google DeepMind @GoogleDeepMind

10 months ago

What if you could not only watch a generated video, but explore it too? 🌐 Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt. From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵

812

13K

3K

4K

4M

10 months ago

Gemini 2.5 Deep Think Model Card: it's not superhuman but similar to gold IMO model & “approaches human level” on stealth evals more interested in learning “novel rl techniques that can leverage more multi-step reasoning,” (candidate: MARL with verification/voting for each step)

nanulled's tweet photo. Gemini 2.5 Deep Think Model Card:

it's not superhuman but similar to gold IMO model & “approaches human level” on stealth evals
more interested in learning “novel rl techniques that can leverage more multi-step reasoning,” (candidate: MARL with verification/voting for each step) https://t.co/TLn74zKjpq

nanulled's tweet photo. Gemini 2.5 Deep Think Model Card:

it's not superhuman but similar to gold IMO model & “approaches human level” on stealth evals
more interested in learning “novel rl techniques that can leverage more multi-step reasoning,” (candidate: MARL with verification/voting for each step) https://t.co/TLn74zKjpq

nanulled's tweet photo. Gemini 2.5 Deep Think Model Card:

it's not superhuman but similar to gold IMO model & “approaches human level” on stealth evals
more interested in learning “novel rl techniques that can leverage more multi-step reasoning,” (candidate: MARL with verification/voting for each step) https://t.co/TLn74zKjpq

nanulled's tweet photo. Gemini 2.5 Deep Think Model Card:

it's not superhuman but similar to gold IMO model & “approaches human level” on stealth evals
more interested in learning “novel rl techniques that can leverage more multi-step reasoning,” (candidate: MARL with verification/voting for each step) https://t.co/TLn74zKjpq

0

18

0

6

2K

10 months ago

The new stealth model, the Horizon Alpha, has the ability to think in cot, but you really have to try to get it to do so. Here is the COT it generated. It's very terse, and I see some O3 in its writing. I think it's safe to say it's an OpenAI open-source model.

nanulled's tweet photo. The new stealth model, the Horizon Alpha, has the ability to think in cot, but you really have to try to get it to do so.

Here is the COT it generated. It's very terse, and I see some O3 in its writing.
I think it's safe to say it's an OpenAI open-source model. https://t.co/yc7gGtZand

1

35

1

2

2K

11 months ago

good to see hle uselessness being confirmed after ~3 months of this thread actually, it's not just useless it's harmfull signal that somewhat slowed down the progress imo https://t.co/LjqjzPLVBY

Andrew White 🐦‍⬛

11 months ago

HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7

19

598

86

174

128K

0

3

0

1

424

about 1 year ago

Benchmarks like Humanity’s Last Exam, codeforces nerd-sniped researchers and could prevent AI labs from developing genuine AGI capable of performing real-world tasks.

1

21

0

7

2K

about 1 year ago

Yes there would be differences in taste and preferences but a horrible game/software can be seen and recognized by the majority. Vague Objectives would be set for ai to complete and most humans would be able to verify if said objectives were achieved fully or partially

1

4

0

1

671

11 months ago

GDM achieved the same score at IMO as OpenAI but it will be accessible to ultra subscribers in a trusted program.

Google DeepMind @GoogleDeepMind

11 months ago

An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇 It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵

GoogleDeepMind's tweet photo. An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇

It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵

152

4K

681

665

1M

1

16

0

1

852

11 months ago

@sama @doomslide

1

5

0

0

5K

11 months ago

@brutalmog @polynoamial @OpenAI isn't that a polymarket prediction? I was talking about manifold one

1

1

0

0

33

11 months ago

@bennetkrause they've said it's a reasoning model so scratchpad with some form of memory maintenance mechanism that they've probably rled using a general reasoning breakthrough i bet that it's not just a verifier, it would be much more bullish if it were a model itself so I bet on that

0

1

0

0

61

11 months ago

OpenAI’ reasoning model thinks for hours and got gold medal level performance on IMO Progress is faster than most thought

11 months ago

Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵

140

5K

510

2K

1M

6

120

3

12

7K

11 months ago

@kalomaze and embodied agi is: can it do it using mouse and keyboard or controller like a human would

0

5

0

0

244

11 months ago

@polynoamial Amazing, Noam q: this IMO was solved by LLM or a system of agents with MARL or something even better?

2

3

0

0

1K

11 months ago

@zephyr_z9 @epicarism If I didn't want to disclose something I would not say anything related to the multi-agent system Perhaps he really wanted to share it but couldn't do it directly

0

3

0

0

93

11 months ago

I think it's pretty safe to say that some form of RSI cycle has begun

11 months ago

8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.

43

2K

191

226

523K

2

28

1

4

3K

11 months ago

@epicarism @zephyr_z9 i know that pay attention to the wording "models" not a model or reasoning LLM they should've said that imo was solved by agents or system not by a "model"

1

2

0

0

129

11 months ago

@zephyr_z9 interesting from what I've seen they claimed that's a reasoning model not a system of agents

1

3

0

0

265

Last Seen Users on Sotwe

Trends for you

Most Popular Users