New Anthropic research: Natural Language Autoencoders.
Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read.
Here, we train Claude to translate its activations into human-readable text.
🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length.
🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models.
🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice.
Try it now at https://t.co/GCdiMzk1Dl via Expert Mode / Instant Mode. API is updated & available today!
📄 Tech Report: https://t.co/drlDrxkYtp
🤗 Open Weights: https://t.co/T13Y8i7SDM
1/n
We ran GPT-5.4 (xhigh) on our tasks. Its time-horizon depends greatly on our treatment of reward hacks: the point estimate would be 5.7hrs (95% CI of 3hrs to 13.5hrs) under our standard methodology, but 13hrs (95% CI of 5hrs to 74hrs) if we allow reward hacks.
I seriously think that openai started purposely hurting ml research capabilities with this model, it's literally worse at taste than 5.2 high. I understand the competitive advantage of withholding capabilities but still they should just admit it and not waste anyone's time.
5.4 xhigh is worse than 5.3 codex at ml research, running experiments, patching gated features and debugging inference and evals. It's maybe better at moonshooting proposals just like 5.2 high but it does not have a robust experimentation hygiene. Same with 5.4 pro vs 5.2 pro.
We’re announcing a major advance in the study of fluid dynamics with AI 💧 in a joint paper with researchers from @BrownUniversity, @nyuniversity and @Stanford.
What if you could not only watch a generated video, but explore it too? 🌐
Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt.
From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵
Gemini 2.5 Deep Think Model Card:
it's not superhuman but similar to gold IMO model & “approaches human level” on stealth evals
more interested in learning “novel rl techniques that can leverage more multi-step reasoning,” (candidate: MARL with verification/voting for each step)
The new stealth model, the Horizon Alpha, has the ability to think in cot, but you really have to try to get it to do so.
Here is the COT it generated. It's very terse, and I see some O3 in its writing.
I think it's safe to say it's an OpenAI open-source model.
good to see hle uselessness being confirmed after ~3 months of this thread
actually, it's not just useless it's harmfull signal that somewhat slowed down the progress imo
https://t.co/LjqjzPLVBY
HLE has recently become the benchmark to beat for frontier agents. We @FutureHouseSF took a closer look at the chem and bio questions and found about 30% of them are likely invalid based on our analysis and third-party PhD evaluations. 1/7
Benchmarks like Humanity’s Last Exam, codeforces nerd-sniped researchers and could prevent AI labs from developing genuine AGI capable of performing real-world tasks.
Yes there would be differences in taste and preferences but a horrible game/software can be seen and recognized by the majority. Vague Objectives would be set for ai to complete and most humans would be able to verify if said objectives were achieved fully or partially
An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇
It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵
@bennetkrause they've said it's a reasoning model so scratchpad with some form of memory maintenance mechanism that they've probably rled using a general reasoning breakthrough
i bet that it's not just a verifier, it would be much more bullish if it were a model itself so I bet on that
Today, we at @OpenAI achieved a milestone that many considered years away: gold medal-level performance on the 2025 IMO with a general reasoning LLM—under the same time limits as humans, without tools. As remarkable as that sounds, it’s even more significant than the headline 🧵
@zephyr_z9@epicarism If I didn't want to disclose something I would not say anything related to the multi-agent system
Perhaps he really wanted to share it but couldn't do it directly
8/N Btw, we are releasing GPT-5 soon, and we’re excited for you to try it. But just to be clear: the IMO gold LLM is an experimental research model. We don’t plan to release anything with this level of math capability for several months.
@epicarism@zephyr_z9 i know that
pay attention to the wording
"models" not a model or reasoning LLM
they should've said that imo was solved by agents or system not by a "model"