patterns I’ve noticed in great engineers:
• they think in systems, not parts
• they reduce problems before trying to solve them
• they obsess over fundamentals more than tools
• they can explain complex things without sounding clever
• they build small, test early, break things on purpose
• they respect physics, constraints, and reality
• they document their thinking, not just results
• they care about edge cases because that’s where truth hides
• they borrow ideas across disciplines without ego
• they have taste, and they know when something feels wrong
• they value reliability over novelty
• they keep learning long after they’re “qualified”
none of this is flashy.
none of this trends well online.
but this is what quietly compounds
into real mastery over time.
People with short timelines sometimes shrug off models’ inability to perform basic, economically useful tasks end-to-end by saying, "Oh but we haven't trained models to specifically do those things."
But this misses the point. Human workers are valuable precisely because we don’t need to build bespoke schleppy training loops for every small part of their job.
Everyday, you have to do a hundred things that require judgment, situational awareness, and skills & context learned on the job. These tasks differ not just across different people, but from one day to the next even for the same person.
It is not possible to automate even a single job by just baking in some predefined set of skills, let alone all the jobs.
People will sometimes debate, "How much progress have we made so far between village idiot and AGI?" And I'm just thinking, what the fuck are you guys talking about? The models are currently so much dumber than the village idiot. Village idiots generate trillions of dollars in wages a year. These models generate $30B in revenue a year.
In fact, I think people are really underestimating how big a deal actual AGI will be because they're just imagining more of this current regime.
They're not thinking about billions of human-like intelligences on a server which can copy and merge all their learnings.
And to be clear, I expect this (aka actual AGI) in the next decade or two. That's fucking crazy!
The theoretical physics approach to neural nets was launched by @HopfieldJohn in this classic 1982 paper that introduced the "energy function" to associative memory models. https://t.co/HekdYdvvJc
There's three parts.
1. Fitting as large of a network and as large of a batch-size as possible onto the 10k/100k/1m H100s -- parallelizing and using memory-saving tricks.
2. Communicating state between these GPUs as quickly as possible
3. Recovering from failures (hardware, software, etc.) as quickly as possible
1. Fitting as large of a network and as large of a batch-size as possible onto the 10k H100s.
Parallelizing:
1. parallelize over batches
2. parallelize over layers (i.e. split a layer across GPUs)
3. parallelize across layers (i.e. 1 to N are on GPU1, N+1th layer to N+10th layer are on GPU2)
Keep parallelizing until you are able to use all GPUs well, with maximum utilization.
Checkpointing / Compute vs memorize:
* You need to save certain terms from forward to compute the backprop (save_for_backward). However, if the network is sufficiently large, it is more profitable to free these terms in order to fit a larger batch-size, and recompute them again when you need them to compute the backprop.
* Tricks like FSDP discard parts of weights that are held in one GPU (to save memory), and ask for the shards of weights from other GPUs right before they need them.
2. Communicating state between these GPUs as quickly as possible
Communication overlap:
When you need to communicate among GPUs, try to start communication as soon as you can:
* Exampel: when Nth layer is done with backward, while N-1th layer is computing backward, all GPUs with an Nth layer can all-reduce their gradients)
Discover and leverage the underlying networking topology:
Communicating large amounts of state (gradients, optimizer state) across multiple nodes is complicated. with Sync SGD, you have to communicate this state in a burst, as quickly as you can.
we might have multiple layers of switches, and have RDMA (ability to copy GPU memory directly to NIC, bypassing CPU ram entirely), and have frontend and backend NICs (frontend connects to storage like NFS, backend connects GPUs to other GPUs in cluster).
So, it's important to leverage all this info when running communication collectives like all-reduce or scatter/gather. All-reduce for example can be done algorithmically in log(n) if you tree-reduce; and the constant factors that change based on the type of fiber connecting one node to another in the tree of networking fiber is important to reduce overall time and latency.
Libraries like NCCL do sophisticated discovery of the underlying networking topology and leverage them when we run all-reduce and other collectives.
3. Recovering from failures (hardware, software, etc.) as quickly as possible
At 10k GPU scale, things fail all the time -- GPUs, NICs, cables, etc. Some of these failures are easy to detect quickly, some of them you can only detect because one node isn't replying back in time (say a NCCL all-reduce is stuck). We build various tools to monitor and detect fleet health, and remove failed nodes from the fleet as quickly as possible. This is quite hard.
Separately, at this large of a scale you can have silent data corruptions from memory bits flipping randomly (due to basic physics and amplifying the probability at this scale), and you suddenly have loss-explosions for no reason other than this random phenomenon. These happen at small-scale too, but very very infrequently so you barely notice. This is very hard to detect before-hand in software. Some hardware has hardware circuitry that does built-in checksums after it computes things -- this way if bit-flips occur the hardware can throw an interrupt. H100s and previous NVIDIA GPUs don't have this feature.
To counter all these failures, you would want to save your model state as frequently and as quickly as you can; and when a failure occurs, you want to recover and continue as quickly as you can. Usually, we save model state really quickly to CPU memory in a separate thread and in the background we save from CPU memory to disk or remote storage.
We also save model state in shards (this is torch.distributed's checkpointing feature), i.e. not every GPU needs to save all of the model weights; each GPU only needs to save a portion of weights -- and they can recover the other part of weights from other GPU shard checkpoints.
There's a big difference between solving a problem from first principles vs applying a solution template you previously memorized. It's like the difference between a senior software engineer and a script kiddie that can't code.
A script kiddie that has a gigantic bank of scripts might give you the illusion that they can program on their own -- until they encounter a problem for which they don't have the right script. And that's exactly what you see with LLMs. They're interpolative databases of millions of text-completion vector programs. They can do a lot, as long as they're in known territory. But give them something a bit unfamiliar, like an ARC task, and they fail.
Today, we’re launching Aya, a new open-source, massively multilingual LLM & dataset to help support under-represented languages. Aya outperforms existing open-source models and covers 101 different languages – more than double covered by previous models.
https://t.co/0WsC2i9C8a
Jeff Bezos on having a skeptical view of proxies and the problem with managing to metrics
“One of the things that happens in business is that you develop certain things that you’re managing to—a typical case would be a metric. And that metric isn’t the real underlying thing.”
To illustrate his point, he suggests a hypothetical example of company that designates “customer returns per units sold” to be an important metric:
“The person who invented that metric and decided it was worth watching had a reason. And then when you fast forward five years, that metric is the proxy. In this case, it’s a proxy for customer happiness. But that metric is not actually customer happiness.”
He continues:
“Five years later, a kind of inertia can set in and you forget the truth behind why you were watching that metric in the first place. And the world shifts a little. And now that proxy isn’t as valuable as it used to be—or it’s missing something. You have to be on alert for that.”
You have to keep in mind that you don’t really care about the metric. What you care about is customer happiness, and the metric is only worth putting energy into and scrutinizing to the extent that it actually improves customer happiness.
“It’s very common and it’s a nuanced problem—especially in large companies—that [people are] managing to metrics that they don’t really understand. They don’t really know why [these metrics] exist. And the world may have shifted out from under them a little, and the metrics are no longer as relevant as they were when somebody ten years earlier invented them.”
You need metrics and can’t ignore them, but you have to make sure you really understand them and why they were invented in the first place.