current LLMs fundamentally consist of four main components:
- input layer: where input "words" (prompt) get mapped to "latents" aka some-model-representation-you-don't-understand-unless-you-start-reading-tea-leaves-of-spurious-correlations (some quite compelling à la word2vec style; latents is also unnecessary lingo so i will refer to these as "inputs" with quotes from now on)
- mixing layers: where you jumble all your "inputs" together to see if any correlations between "inputs" can become useful (commonly used to compress or expand dims; predicting a single classification target == compress to a single dim, etc)
- attention layers: where you learn how "inputs" relate to each other (aka discern what's important to remember vs fluff)
- residuals: where you short-circuit a mixing/attention layer because it's probably adding too much confusion (aka avoid overthinking for simple things)
-----
a "big" LLM simply scales two things:
- width == how many dimensions you give to your "inputs" (the more dims, in theory the more unique/discerning/precise/complex your knowledge can become)
- depth == how many mixing/attention/residual layers you can stack/loop between (aka "reason" over, where more of these ~= more "reasoning" abilities)
"capabilities" that seem impressive to humans usually arise from taking advantage of both depth & width: where a model seemingly makes connections between disparate ideas, beyond what an average human can hold in working memory.
this requires models to "completely light up" when responding to a "hard prompt", where effectively no param/layer goes unused.
-----
the anatomy of a "model capability" is precisely the same mechanism that can be co-opted for a jailbreaking exploit:
your goal is simply to "light up" as much of the model as possible, dodging any shallow input-classifiers at the beginning by triggering as many disparate "input ideologies" as possible, and subsequently have these "inputs" relate to each other in seemingly unrelated-yet-related ways that ideally have similar "complexity" as your jailbreak goal (to make it past enough layers of the model).
think of the attack-vector as bundling your goal in a series of schizo-nerd-snipes:
a sufficiently capable model will try to reason through everything all at once, eliminate the dead-ends, and successfully deliver the one jailbreak use-case you bubble-wrapped for.
of course, there's an art to the above, and some are already extraordinarily proficient at the trojan-horse-packaging, but at some point there's no difference between "a capability" and "a jailbreak", though i'll be happy to be proven otherwise.
-----
tl;dr ant flew too close to the sun, better kiss the ring or get buried.
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.
On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
[1/6] GRPO on math problems with Qwen2.5-0.5B/3B and Llama-3.2-3B-Instruct. Bucket hard examples by training dynamics.
~Half of all hard examples are unlearnable. Across model and dataset.
I'm a manager at @OpenAI, but with GPT-5.5 I'm a more effective IC than I've ever been. I can now write CUDA kernels like a pro. I can rely on it to run my research experiments. And we know how to make it much more powerful from here.
Reading through these papers has given me a better understanding of why RL scaling laws are so messy compared to those from pretraining. Pretraining scaling laws and RL scaling laws are two completely different things for several reasons:
1. Defining compute: Pretraining has a very clean compute footprint of C = 6ND. RL compute is more complex to capture due to the presence of both sampling and policy updates. Some papers try to maintain the same FLOP estimate for compute, while others measure compute in terms of GPU hours. The efficiency of our training framework can cause the relationship between FLOPs / GPU hours to vary pretty drastically.
2. Intra versus inter-model extrapolation: Pretraining scaling laws fit trends across many model training runs with different settings to understand how model / data size (and compute) impact results. This allows us to extrapolate teh results of future training runs. In RL, we fit scaling laws both within an individual training run (intra-model extrapolation) and across training runs (inter-model extrapolation). Intra-model extrapolation is not necessary for pretraining because it is more stable, while RL is extremely sensitive to the exact training configuration being used.
3. Measuring performance: Pretraining scaling laws predict a very particular performance metric: the cross entropy loss (or some other related entropy metric) measured over an in-domain, held-out validation set. This is a stable performance metric that is typically computed over a very diverse dataset (i.e., some random sample from the pretraining corpus). RL scaling laws maintain the practice of computing performance over an in-domain validation set. However, the performance metric that they predict is reward (or accuracy) on a validation set. This is a downstream performance metric, and it can fluctuate drastically depending on the benchmark being used or the composition of data in that benchmark.
4. Lack of standardization: There are generally just more knobs that we can change in RL compared to pretraining. The design space is massive, and we are not sure (yet) which design decisions impact the scaling properties of RL. Several papers have focused on this topic and made meaningful progress on understanding what changes actually impact RL scaling. However, this does not change the fact that slight differences in the RL training setup can completely change observed scaling trends for RL. For this reason, many papers are comparing apples to oranges in terms of their recommendations for RL scaling, making progress on the topic difficult. There are even some papers that have completely opposite findings from each other, and this is likely do to slight differences in their exact GRPO formulation.
Food equity research feels so unserious. Just read a deep dive on Chili’s new chicken sandwich and how the specs compare to peers. Right down to the “bun advantage.” A sector where certified fat asses are the alpha generators, surely.
the terminal value of the economy is underwritten to working with a sell-side broker such as myself to sell your data assets in order to work with a buy-side broker such as myself to get convertible debt and gpu allocation
Given the increasingly closed-source nature of the U.S. AI ecosystem, it is now more important than ever to push for the proliferation of open model and dataset releases. Datamule (@johngfriedman), @TeraflopAI, and @daftengine collaborated to release 43 Billion Tokens of SEC EDGAR data.
ICMI believes that Christian theology offers concrete technical methods for confronting the trickiest problems in AI safety.
Today, we release a pair of papers that reproduce @PalisadeAI@apolloaievals work showing how religious framings influence corrigibility and scheming.
Many people (myself included) are wondering:
If not LLMs, what should I work on?
In today's blog, I provide a unified framework for finding blue-ocean opportunities in AI, especially in this chaotic era of LLMs.
https://t.co/WNXdfNTJ8Z
BLUE BOTTLE COFFEE ACQUIRED BY LUCKIN FOR $400M
A couple months ago, it was reported that Nestlé was exploring a sale of Blue Bottle Coffee - the property it invested ~$450m in to acquire a 68% stake in 2017.
Fast forward 9 years later, & Luckin Coffee is acquiring Blue Bottle for just a touch less than the principal nestle invested.
Prior to Nestle's acquisition, Blue Bottle had raised over $100m from the likes of Index Ventures, Google Ventures, and more.
Philz Coffee & La Colombe Coffee Workshop also recently traded hands.