@pmddomingos Indeed - in the latter case, captured by Hilberg’s Law — Occam’s pertains to induction. LLMs operate transductively: https://t.co/THRewdnHy5
@getjonwithit This is all you need to see it: the connection of speed with algorithmic information (KC of the trained model), and then the scaling of speed (Hilberg’s Law)
It’s an evocative analogy, but unfortunately it doesn’t work: the prompt is not the “program”: it merely states the task, doesn’t provide instructions to solve it. The algorithmic complexity of datum is the description length of the program that generates it, not of the datum itself.
When viewing an LLM as a (stochastic) universal computer, which it is, the role played by the “program length” in a deterministic computer (such as a TM) is played by “proper time”, i.e. the ratio of the length of the chain-of-thought by the probability assigned by the model to that chain-of-thought.
Once you properly framed, (which has been done in https://t.co/h3xLPeXfEs) the conclusion is precisely the opposite: algorithmic complexity of human generated data is not only extremely large but growing unbounded, in accordance with Hilberg’s Law.
Congratulations to Charles Bennett for his Turing Prize! https://t.co/UB30V1zDsr His work on logical depth of a datum was the inspiration for the notion of “conceptual depth” of a trained LLM described in the work of AI Agents as Universal Solvers, which in turn was key to identifying the inversion of scaling laws also described there: https://t.co/h3xLPeXNu0
Specifically, the logical depth of a trained datum is the time it takes for a Turing Machine to generate it from a program that is not much more complex than its Kolmogorov complexity. This makes sense for bit-strings but not for a trained generative model. The trained model contains (algorithmic) information, so it could be thought of as “data”, but how long it takes a Turing Machine to generate it is irrelevant. What matters is how long a model takes to generate a token-stream that solves a task. That is proper time. So the “conceptual depth” of a trained model is defined as (loss + proper time), not just computation steps. (Curiously, proper time is vaguely related to relativistic proper time, but that’s a stretch).
Once the complexity of the trained model is evaluated using conceptual depth, scaling laws witness an inversion, where more data, compute, energy come at the expense of intelligence, not with it. This is also discussed in the AI Agents as Universal Solvers paper.
A trained model conflates memory and computation in the weights, so it can’t be thought as an ordinary computer with separate memory, processor, tape, etc.. The time it takes for an LLM to solve a task with chain-of-thought is stochastic and so is the outcome. Proper time captures that, and the Strands Coding Framework https://t.co/hc9Dvgh5Qb allows users to “program” AI Agents for what they are: stochastic dynamical systems that can perform universal computation.
The view of LLMs as universal computers (https://t.co/xbqFcgVlBm — or read about it here https://t.co/h3xLPeXNu0) pits AI Agents against the principle of Occam’s Razor. The compression view of learning only captures statistical information: regularize in order to generalize. But AI Agents don’t generalize, and that’s their power! They memorize and reason. By doing so they achieve generality, not generalization. The governing principle is Hilberg’s Law, and the key theorem connects algorithmic information in the trained model (the bigger the better) to time (the shorter the better). This is not a bound but an equality: Without a cost of time, you don’t need to learn (in fact optimal inference, a’ la Levin/Solomonoff, involves no learning). But by imposing a cost of time you are forced to accrue algorithmic information in the trained model. And since it is an equality, that is also the *only* way to accrue algorithmic information. Learning to reason is all about time!