Well I agree re: msft, but if a comparatively broke open lab (DeepSeek) serves between 3-15x cheaper tokens (83x cheaper if you count cache hits) at same quality as an anthropic model (who we're supposed to assume has the highest quality engineers of the west along with OAI) then we are also supposed to assume their margins are even higher than the factors [3,15,83] suggest, and that's even with credible people claiming DS still makes a profit.
Not arguing re: bubble, not interested in that, but it's a big stretch if all the geniuses at the big labs didn't manage to collect massive profits from their current offering, seems to me.
If they would run all tools used when solving GDPval it would represent real life use right?
That’s how I would do it at least. Maybe they’re setting a new metric and that’s the important part now, which Artificialanalysis should pick up to turn into a realistic workload.
Maybe this is all about KV weights sittting on the GPU memory taking up HBM. CPUs have to optimize for latency because the main cost driver of inference is the gpu usage.
The new trend is that nobody ends up investing as much into inference efficiency as the people who developed a given model (also due to RL), meaning they end up having the best inference code, meaning they end up having the cheapest API and best ability to serve a coding plan/sub.
Assuming a large enough gap, open source and small model development becomes an advertising strategy. Best exemplified by DeepSeek at the moment.
@Laz4rz think I liked polar more because they don't require a subscription (everyone else does) but not sure that's still true
if amazfit turns out to be good that's a game changer though
https://t.co/vaWjpozKsr
- X API for access to posts
- AirTable for recording already-annotated posts
- Openrouter for LLM inference
As you can see, the values used for automated judgement are nicely shown in the prompts.
Much to be learned from effective truth-seeking scaffolds, an agent continuously RL learning from this process might serve as a reward-hacking-resistant verifier for models being currently trained.
World-class forecaster (and fellow Nathan) develops a harness that auto-researches context for X posts, where the optimization target is one of the best: perceived helpfulness (link below)
Once an LLM is "helpful to the public", it can receive real-time human signal and thereby receive "RLHF for free". Worth studying.
AI Note-writer Progress
Our note-writer (wholesome-raspberry-stilt) has written community notes on X with 47M views.
The cost per helpful note is about $7. You can look at our notes here:
https://t.co/zRcusOyYry