🚀 Introducing ChronoBERT: A Chronologically Consistent Language Model ⏳📜
Excited to share ChronoBERT, a joint work with @LinyingLyu , @AsafManela , and @jimmywucm ! Our new pre-trained language model ensures no lookahead bias and strong language understanding! 🧠✨
🌘 Meet Kimi K2.7 Code HighSpeed!
A high-speed mode of our latest open-source multimodal coding model, Kimi K2.7 Code.
⚡️ Up to 6× faster: Around 180 tok/s on coding tasks with median-length inputs, and up to 260 tok/s on shorter-context tasks.
🔷 Rolling out to Kimi Code Beta Program members, Kimi API developers, and Kimi Business users. (Access will remain limited for now due to capacity constraints.)
🔷 No invite needed. Anyone who joins the Beta Program has a chance to get access 👉 https://t.co/eKogsFGJt6
Open intelligence should be instant, affordable, and borderless. We'll continue improving the model and expanding access as more capacity becomes available!
🔗 Kimi Code: https://t.co/uvoSJKyGCY
🔗 API: https://t.co/mzWxjgGO1h
LLMs are no longer created w/ human data alone. They rely on other models to generate & filter data, evaluate outputs, & guide dev work.
So what is a modern LLM built on? Olmo 3 → 89 model + 183 dataset dependencies; Nemotron 3 → 273 + 560
We made ModSleuth to trace this. 🧵
Today I'm publishing a new essay, Policy on the AI Exponential. AI is progressing extremely fast—much faster than the policy process was built to handle. The essay lays out where I think the technology is now, and the action needed to close the gap: https://t.co/Lh6PWae178
Today, the Stanford @DigEconLab launches the AI Economic Indicators, a new platform for tracking how AI is reshaping work, productivity, adoption, and the economy.
1/6
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use.
Its capabilities exceed those of any model we’ve ever made generally available.
We've known about LLM test-time compute scaling since @OpenAI o1.
Yet 2 years later labs still report scalar evals for models; safety orgs are still surprised when a scaffold does better via 100x inference; and RSPs still ignore inference budget when deciding critical thresholds.
Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.
🧩 US investors pay active asset managers about 17 times more in fees than informed traders could possibly earn by forecasting prices.
In a new paper with Ohad Kadan, we develop a simple, model-free measure of the total value of information in stock markets — an upper bound on what any active manager could capture by trading on superior information about future prices — and bring it to high-frequency US equity data.
The idea. Under mild assumptions of Kyle–Back models with competitive market making, the total value of information equals one observable object: the expected covariation between price changes and order flow, E∫dP·dY. Informed traders smooth their orders to disguise them; noise trading is jittery. At high frequency, cov(dP, dY) recovers exactly the losses of noise traders — and thus the gains of the informed — without needing to identify which trades were informed.
The numbers. Using intraday TAQ data on US equities, September 2003 to December 2024:
- Average value of information per stock: about $3.5 million per year
- Aggregate value of information: only 0.04% of market capitalization
- Investors pay active managers: about 0.67% of market cap each year (French 2008)
The attached figure shows the value of information over time.
Validation. The measure behaves the way a value of information should. It rises in turbulent times, is higher for large, growth, and momentum stocks, jumps more than 3-fold on earnings announcement days, and is robust to trade-signing, sampling frequency, and the inclusion of penny stocks.
The puzzle. In any reasonable model (e.g. Garleanu-Pedersen 2018), the value of information equals management fees plus the cost of searching for good managers — so fees can be at most as large as the value of information. Our estimate of the value of information is about an order of magnitude smaller than the fees investors actually pay. We work through the candidate explanations — non-competitive market makers, risk aversion, partial information, behavioral mistakes, non-informational services — and most either deepen the puzzle or leave it unresolved.
One reading: US equity markets may be so informationally efficient that there is little left for the informed to extract — and most of what investors pay for active management is not paying for information at all.
The wedge between the value of information and the cost of seeking it is a target for future theory.
This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc.
More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage:
1) raw text (hard/effortful to read)
2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default
3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default
...4,5,6,...
n) interactive neural videos/simulations
Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral https://t.co/z21CP5iQfu
There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen.
TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.
One of the best posts on AI compute I’ve read in a long time. It explains why semis stocks have become 40% of the market index weight from a technical perspective and lays out a roadmap for the tech evolution ahead.
One incredible (and unrelated) fact I learned a couple years ago that stuck with me was that some 90% (could be lower now) of the energy consumed by AI isn’t used on compute at all but by shuttling the model weights back and forth btwn the GPU and memory. This post explains using the analogy of airport shuttles btwn gates and airplanes and the strong complementarity btwn GPU throughput and HBM capacity and bandwidth.
The AI models have become powerful and genuinely useful for most everyday use incl enterprise production that requires high accuracy. We are now witnessing an explosion of “inference” or the actual usage or deployment of the models via AI agents or AI calling AI. The artificial intelligence is finally good enough in most contexts that we are going to see an explosive growth of the consumption of “intelligence”.
As long as we remain in this current architecture of LLM transformers running inference on discrete GPUs + off chip HBMs, AI compute will remain structurally “memory bound”. In fact, barring big architectural changes we can anticipate demand to persistently outstrip supply turning a historically strongly cyclical industry into one that’s well not cyclical anymore, an assertion that has drawn ridicule.
Now, “nature always finds a way”. And shortage is always the mother of innovations. We are now seeing a wide range of attempts to overcome the structural memory bound. Attempts are being made on both the hardware and the software fronts notably on-chip static RAM or SRAM such as the integration of Groq by NVidia, Amazon + Cerebras, and Google’s TPU (and TurboQuant).
Beyond that efforts are being made in adoption of optical interconnects to further disaggregate memory allowing for more efficient compute as well as potentially doing photonic compute directly in memory. These are ofc developments and possibilities that further out. For now the dynamic of heavy memory bound compute that @fi56622380 lays out still strictly dominates.
All of the above relates to the supply side ofc and the implicit key assumptions. There’s also importantly the demand side. There’s a presumption among AI hardware enthusiasts that we are in a paradigm of persistent “supply shortage”. That may well be the case as demand and adoption for machine intelligence grow exponentially while supply is constrained by supply chain bottlenecks.
However, it’s worth considering the ways in which demand may indeed fluctuate. After all oil demand grew secularly but oil prices also fluctuated significantly, notwithstanding the big differences in oil discovery/refining vs GPU/CPU/memory supply. Demand for AI could indeed slow or grow slower than expected if enterprise adoption ran into frictions or if the ROI proved initially elusive esp if the prices of AI inference keep growing to reflect their true economic costs let alone their economic value created. Demand curve slopes downwards. That logic hasn’t been tested yet. That’ll have to be for another Ackman-esque long post on a different sleepless morning. I hate pollens.
April was a pretty strong month for LLM releases:
- Gemma 4
- GLM-5.1
- Qwen3.6
- Kimi K2.6
- DeepSeek V4
All are now added to the LLM Architecture Gallery.
More details once I am fully back in May!
DeepSeek V4 hits it out of the park and addresses HBM shortage:
DeepSeek proves why it is such a fundamental research lab.
In addition to exceeding Opus 4.6 on Terminal Bench and virtually matching on other performance metrics, the most notable advancement is this statement:
"In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2"
To understand significance of this point, consider below diagram that shows memory layout for Prefill and Decode nodes.
If you implement Decode with Data and Expert parallelism (DEP16) with 16 GPUs on GB200 or GB300 NVL72 rack with DeepSeek v3.2, you are left with 104GB or 176 GB HBRAM per GPU respectively. Here we are assuming MoE parameters are in NVFP4.
The remaining HBRAM per GPU dictates how large batch size you can have for inference, which determines how many concurrent request you can serve.
Consider GB300 with 176GB left:
1. For 128K context, you need 4.45 GB HBRam for KV Cache, and you can serve only 36 concurrent requests.
2. For 256K context, you need 8.90 GB HBRam for KV Cache, and you can serve only 18 concurrent requests.
3. For 512K context, you need 17.80 GB HBRam for KV Cache, and you can serve only 9 concurrent requests.
4. For 1M context, you need 35.60 GB HBRam for KV Cache, and you can serve only 4 concurrent requests.
You see the point. Now you imagine, you actually required 10 times less KV cache somehow at 1M!
It basically enables you to server 10 times more requests with same resources. Recall Decode is memory bound and not compute bound, unlike Prefill.
This is probably the most important contribution of DeepSeek V4.
@teortaxesTex@jukan05@zephyr_z9
Big congratulations to DeepSeek V4 Pro for reaching Opus 4.6 level performance.
1M context, stronger agentic coding, and frontier reasoning in an open model package.
The open source race just got a lot more intense.
https://t.co/bl3kEcO3BH
🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length.
🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models.
🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice.
Try it now at https://t.co/GCdiMzk1Dl via Expert Mode / Instant Mode. API is updated & available today!
📄 Tech Report: https://t.co/drlDrxkYtp
🤗 Open Weights: https://t.co/T13Y8i7SDM
1/n