There is no pre-training, post-training, or test-time training. There are only priors, updates, constraints, and compute budgets.
There is only TRAINING.
Last several years we shipped the org chart to fundamental optimization science.
This feels like a new tension for the field: what capabilities should stay internal to the model, given that capacity is expensive, and what should be delegated to external modules.
The older version of this tension was on inductive bias, and it helped lead us toward Transformers with less built-in structural bias.
I want to add two points:
* We were developing this thread of work as my PhD topic under the terms external modularity (modules outside the model that collaborate with the model) and internal modularity (circuits responsible for built-in capabilities).
* The second point is that most people think of external modularity mostly in terms of tools (programmed modules), but it can involve neural modules too. A simple example is PEFT modules. hence one of my first works was developing a module-card so they could be included in such discussions: PEFT-Ref, https://t.co/rpn2aLxXIq
I started this around two months ago, and I’d love to collaborate with the community on pushing it further.
My broader view is that, if we want more robust synthetic data pipelines, we should rely more on explicit rubrics and correctness-by-construction checks.
That is the motivation behind FORTGEN: a coverage-driven synthetic benchmark for Fortran modernization with executable semantic oracles.
GitHub: https://t.co/Uddc6q61aL
Slides: https://t.co/i3pWpjcAOQ
@AliesTaha @gaoj0017 that depends on whether the methods are inherently hardware-specific. If TurboQuant is designed to be GPU-native then that's part of it's value...and if the paper doesn't attribute the gains to the algo alone then no problem with this comparison either.
imo, it doesn’t seem like a big problem, the paper shows that larger D helps and makes the bottleneck less harmful. also, LLMs still train very well in practice, which may suggest that D is already large enough for language data (maybe because language is very low rank, so the effect isn’t as destructive as the norm numbers make it sound).
But first I would like to know the evaluation of other proxies for gradient information loss than norm...norm does not necessarily correspond to useful signal.
In an era where many frontier labs are converging towards conservative incremental approaches, it's heartening to see resources being directed into ambitious efforts and fresh ideas!
Continual learning has always been important, but people have not found effective ways. Tweaking loss objectives brings tons of ml papers a few year's back but cannot really achieve inf long context. Maybe we need more arch innovation like efficient attn and learnable memory.
most ppl don’t really think. they rearrange cached thoughts until they feel smart. true thinking is rare because it’s metabolically expensive (just like thinking models are computationally expensive).
Continual learning as a discipline seems to have catastrophic forgetting that it has been focused on catastrophic forgetting for a decade with virtually no progress. Time for some radically new ideas in that area.
New paper studies when spectral gradient methods (e.g., Muon) help in deep learning:
1. We identify a pervasive form of ill-conditioning in DL: post-activations matrices are low-stable rank.
2. We then explain why spectral methods can perform well despite this.
Long thread
Regardless of what direction the field takes, cutting-edge deep learning research will continue to require huge compute. The work is still fundamentally empirical: every promising idea demands careful ablations, stress-tests, and re-runs — even if experience, intuition, and taste help prune away many of the unnecessary experiments.
Yes... big labs love scaling laws because they offer predictability, which makes strategic and financial planning much easier. Once an Org builds entire structures around the “scaling stack”: pre-training teams, post-training teams, eval teams, infra teams, it becomes harder to pivot toward radically different paradigms unless there’s a significant external shock.
The @ilyasut episode
0:00:00 – Explaining model jaggedness
0:09:39 - Emotions and value functions
0:18:49 – What are we scaling?
0:25:13 – Why humans generalize better than models
0:35:45 – Straight-shotting superintelligence
0:46:47 – SSI��s model will learn from deployment
0:55:07 – Alignment
1:18:13 – “We are squarely an age of research company”
1:29:23 – Self-play and multi-agent
1:32:42 – Research taste
Look up Dwarkesh Podcast on YouTube, Apple Podcasts, or Spotify. Enjoy!
some hypotheses for what “better pretraining” could mean
- integration with other training stages: i’m guessing they’re finally at a point where post-training perf (eg SWE-Bench) can be used as signal for pretraining eng decisions
- filtering: scaling approaches like influence functions for getting rid of datapoints that don’t help eval perf
- synthetic data: using rephrasing to upsample certain useful documents and make them more amenable to reasoning
- mixing: more principled & scalable approaches for determining mixing coefficients
- new data: purchasing and scanning more books, transcribing YouTube, buying private token collections like news articles
- smart packing: there are various ways to group documents into batches that work better, especially for long-context stuff
- systems: more data, more flops
The ladder of intelligence is the ladder of abstraction.
L1: Memorizing answers (no generalization)
L2: Interpolative retrieval of answers, pattern matching, memorizing answer-generating rules (local generalization)
L3: Synthesizing causal rules on the fly (strong generalization)
L4: Discovering general principles, metacognition (extreme generalization)
To achieve compounding AI you need to reach L4.