A pacifist interested in wars. RTs on technological, political & military topics such as ML, AI safety, OSINT, MENA and space research 🚀, also Internet memes
@tikgiau September 2025 is quite recent and it's difficult to get an accurate measurement from such a short period, could you please benchmark much older models, preferably into 2024? Maybe you could also do a separate open-weights frontier
Introducing EdgeBench, a benchmark designed to study how agents learn from environments over at least 12~72-hour runs. We find that performance follows a log-sigmoid function of environment interaction time with high precision.
EdgeBench is built with three ingredients:
- 🌍 Real & Diverse: 134 real-world tasks across 6 task categories, spanning scientific problems, professional knowledge work, software engineering, optimization, formal math, and games.
- ⏳ Ultra-Long-Horizon: Each task supports 12–72 hours of agent work. Recorded human effort averages 57.2 hours.
- 🔁 Informative Feedback: Agents receive real-world feedback for continuous improvement.
After 38,000 hours of agent runs on EdgeBench, a scaling law for learning from environments emerges:
- 📈 As agents interact with task environments over time, their aggregate performance is precisely fit by a log-sigmoid function.
- 🧠 This phenomenon can be explained by an elegant theory of graph exploration.
We are releasing an initial 51 of the 134 tasks, together with the full evaluation framework, to help advance long-horizon agent research. Check our blog & paper for more findings!
Blog https://t.co/nMOzFsOhbT
Paper https://t.co/rZb3eWuvik
GitHub https://t.co/oemXd4UrFw
Dataset https://t.co/P4SQMrM47o
Details below 👇🧵
Six offline RL distillation losses. One base model. The exact same math rollouts.
Do they actually learn the same thing?
Turns out most "new" losses — RFT, DFT, offline GRPO — write nearly the same direction in weight space as plain SFT. Only DPO learns something genuinely different: near-orthogonal, its own loss basin, rewires what the network computes.
Reward-weighting changes the step size. DPO changes the direction.
Accepted at @icmlconf (MechInterp workshop)
paper + interactive companion 👇
https://t.co/lWy2GRaVPs
@OrdoSeclorum@Aviation_Intel Engineering is bottlenecked by the need to actually manufacture the designed thing IRL, control the quality of manufacturing (how closely it conforms to the drawings and calculations) and measure its actual performance. That's a slow process, and no amount of AI can change that
@OrdoSeclorum@Aviation_Intel Sample efficiency is a very old research direction, and if it was easy, the problem would have been solved already. The things you allude to might not come for 10+ years (there have not been fundamental architecture breakthroughs since transformer)
@OrdoSeclorum I have read the letter and his point is entirely different: don't invest in productivity of your noncompetitive US factory if Chinese sweatshops with their low labor costs will win anyway
@ekat2468@amazonmilkfrog@north0fnorth Which kid of meat substitute have you tried in burgers?
I needed to grab a quick bite in a mall food court yesterday, and I specifically went to Wendy's because they offer Beyond Meat which is tasty and tender
friendly reminder that of the corn grown in the US, roughly 38% of it goes to animal feed and 36% to ethanol fuel. only like 10% goes to human consumption, half of which is sweeteners like high fructose corn syrup
@Rebel44CZ@Aviation_Intel But also, economists have studied the so-called superstar effect, when imperfect substitutability & economy of scale lead to a winner-takes-all equilibrium: https://t.co/JVymVG0MH3 https://t.co/46wXq3S7f5
This is arguably the case with AI tech, which implies a natural oligopoly
@Rebel44CZ@Aviation_Intel For one, this's wrong because the frontier labs have controlled the Pareto frontier (GLM-5.2 & Kimi K2.7 aren't worth their money in the most economically important uses), & will probably continue doing so. The largest models are the most cost-effective when set to low reasoning
@jacobrgoering@Aviation_Intel Most modern applications of smartphones didn't exist when they were PDAs, and even in the first years of the iPhone.
Jevons' paradox describes how those applications only really appear when the tech becomes cheap and good enough, and thus widely adopted
@OrdoSeclorum@Aviation_Intel Absolutely, but the competitive areas you list have large margins allowing the users to pay the frontier labs back. Some other competitive areas don't enjoy that privilege
@_ueaj@paxaral@voidshapes@teortaxesTex This section is last present in Haiku 4.5 model card (October 2025) and disappears from Opus 4.5 in November. In the same month this doc page states the current state of affairs: https://t.co/XgjAn9Wt6o
@Simon__Grimm@pietergaricano Where does that chart citing "NIST Figure 1" come from? It looks very dubious, as the open-weights gap is mostly stable across all benchmarks. The stability of this gap becomes especially notable if one includes non-top-3 proprietary models such as Grok, Qwen Max and Muse Spark
My dad gave me a piece of advice that has served me well: never make a comment about, or a play on someone's name. You will never impress them. You will only sink in their esteem.
I met an old guy named John Lennon once, shook his hand and said "good to meet you Mr. Lennon." You could see the respect and joy at not hearing yet another damned comment about his name.
Thanks dad.
@RalphieRaccoon@ObcnSuperMatt@TylerM@sollidnuclear Long-range HVDC lines financed by other countries could have allowed to (re)export excess electricity further north and east, if France allowed them.
No matter the cause for AC aversion, the future increase in summer electricity demand is inevitable
@RalphieRaccoon@ObcnSuperMatt@TylerM@sollidnuclear If France doesn't plan to solve their pension problem by just letting thousands of elder people die in heat waves, they will have to install a lot of ACs and summer electricity demand will skyrocket