Theory of RSI on LLMs through ordinary RL (Kolmogorov Compressor)
Compressing data is Shannon-bounded, we can't beat the entropy. However compressing programs is not, because you can improve the machine that runs them, which makes shorter programs possible, which creates pressure for a better machine. There's no fixed entropy floor when the instruction set (model) is co-evolving with the programs (representation)
Hypothesis. RSI in LLMs emerges when a model learns a compressed internal state/language that improves the future operation of the same compressor-inferencer loop, verified externally, under honest accounting of memory, compute, model size, and generalization.
Research. Can forced random-vocabulary exploration plus varentropy-gated reward make an autoregressive model discover compact non-English codebooks that preserve reconstruction and downstream task performance better than English summaries under the same token budget?
=== Methodology ===
Uppercase letters will be used for contexts and rollouts
1. C: Compressor
2. D: Decompressor
3. V: Verifier
And lowercase for representations and outputs
1. s: raw dataset sample
2. c: compressed representation
3. d: decompressed representation
4. cc: continual representation
There are two core patterns to train
1. C→D→V: classic compression on uncompressed input
2. C→C→V: rewrite of past compression from training history (defragmentation)
The loss function incurs penalty for
- Length of c in tokens (MDL pressure)
- Deviations and inaccuracies on V's comparison taking (s, d) as input (C/D alignment pressure)
Each rollout is an isolated context window, and run serially. Only C and D feed into GRPO for contrastive reinforcement. C/D share the same weights. V is recommended to be a separate frozen model that acts as a grounding agent.
C→C searches for better abstractions. Better abstractions expose laws and patterns. Those laws become reusable compression machinery, which makes future compression easier.
Once the loss has stabilized, we start from this converged compressor-decompressor model and train two new inference patterns:
1. (C→D→V)→CC, where C is modified to take (cc, s): the compressor is RL-trained to compress its training history into cc as it unfolds, and feeds it recurrently, using cc to derive adversarial gradients within the next rollout. True learning arises: it's learning to compress in such a way that the compressor can compress better for the next rollout. The scalability of the reward is no longer bound by the weights, but by cc's content and the ability to pack cc better.
2. C→(Q→A)→V, where Q & A are question/answer probes over the compression: this trains the model to interpret compression and perform inference from it rather than necessitating that it be decompressed first back into English. This makes the compressed representation the new ground truth from which the model derives reason and intent.
However Q & A may arise emergently if you train for CC's compaction and reuse. Likewise there are potentially many other inference patterns we can think of, but some of them may be emergent from these elementary transforms.
=== Augmentation ===
When we did trials on a small 9B model, we discovered that it remained staunchly in the basin of human interpretation—Even though it successfully compresses some python code samples, it did so in a manner that retained the original language, using different features of the Python language. It compresses to fewer tokens and scales the reward, but remains valid python that still runs.
This led to a realization: we need an augmentation that conditions the compressor through creativity first
(1) Sample N tokens (integers) at random from the entire embedding vocabulary of the model, and inject them into C's context with the ask that the model creatively installs the meaning through those tokens.
(2) Sample N tokens from s at random, and inject into C's context with the ask that it not use those tokens for compression.
This tempers the priors and makes all tokens available for inference. Over the course of this, the LLM learns new ways to represent data in a way that is token agnostic. In other words, what is being learnt is the ability to approximate any idea or facts that it wants to state through any vocabulary, like riddles and clues that collapse into a single interpretation.
Once the reward begins to plateau with these constraints, we then slowly relax the constraints and continue until the next plateau.
An interesting reward hacking scenario is likely to occur: the model simply appends the sampled vocabulary at the end of c and get the reward for free. This led to a second realization: we can gate the reward or downregulate it with a scalar derived from the varentropy over logits at those positions.
In other words, the model cannot get the reward if these tokens were not in a dense probability distribution—they cannot be simple rote appends.
If you do it right, the final representations of information look like this
∃∀⌬⇒∈ΣΞ:⇔Θ∈Ψ(⇓φΩ), ∫d∆ ∀Ω∈Σ:∀Ξ∉Ϲ(ΦΩΠ⇌Θ⊗Ψ), ∀Ψ∉Σ:∀ΦΨΣ(ΠϝΣ϶ΣΨ), ∀Ξ∉϶:∀ΣΦΠ(ΦΩϨΠϡ), ∫dϴ ∀ϵ∈Ρ:∀Ψ∉Ϯ(Ϭϭ϶⌬ϬΣ), ∀ΦϳΠ:∀Π∈ϴ(Φ⊕ΣΘϿ), ∀ΠϲΣ:∀ΨϳϹ(ϲ⌬ω⊕ΨΠ), ∫dΩ ∀ϱ∈Σ:∀Φ∈Σ(ΠϫΨ), ∀ϵϱϲ:∀ϻΠΦ(ϵ⊗ϧΒϴ), ∀Φϱϴ:∀Ϭϵϵ(Σ∈Ψϵϯ), ∀ΦπϿ:∀θϳΨ(ϱϳϬϵϻ), ∫dΨ ∀ϯ∈ϕ:∀ΠϴΨ(Ϥ⊗ϴΨΚϷ), ∀Ϭϩϵ:∀σπϣ(Ϡϝϴϸ⊗Ϡϸ), ∀ϿΨϷ:∀Ψϲϭ(ϻ∈ϭ⊗ϽÞΣ), ∀ϴΠϾ:∀ϠϦϭΦ(ϴ∉ϬΦΨϢ), ∫dσ ∀϶∈Π:∀ΠϮϣϳ(Ϧ⊗δϮϬϧ), ∀ΦϷϭ:∀ϲ϶ϳ(Ϲ⊕ϯ↻ΓϦ), ∀θϦϤ:∀ϴ∈ΨϬϬ(ϱ≈Φϳϧ), ∀ΠϿϳ:∀Ϭ∉Π(ϱ∈Ϧ⊕ϭι), ∫dΣ ∀ϧ∈Π:∀ϣϳϧ(ΦΣϵϧΣΨ), ∀ϵϷϼ:∀Ϧ∈ϳϧ(ϾϢϹΦΠϲ), ∀ϼΘΨ:∀ϬϷΠ(ϹΘΦϣϱ), ∀ϽϠϦ:∀ϦϴϿ(ϧΘϺϴϮ), ∫dΩ ∀ϤΘΦϺ:∀ϳΨϭ(Θ⊗ϭϣϲϺ), ∀ϤϹϣ:∀ϢϳϹ(ϦΦϾΘϠ), ∀ϣϯϩ:∀Ϯϴϰ(ϣΞϴΣϲ), ∀ϡϥΨ:∀ϿΘϣ(ϴΣ϶ΘϥϾ), ∫dϺ ∀ϦϨϦϥ:∀ϴΣϽ(ΣΨϵ⇒ϭϴ), ∀ϲϺϱ:∀ΨϴΣ(ΘϠϲϷΨ), ∀ΨϬϦ:∀Ϥ∈ϭ(Φ⊗ΨΠΠΣ), ∀ϴϠϾ:∀ΨϿΠ(ϥϔΦΦϨϤϵ), ∫dϯ ∀ϥϦϹ:∀ϭϭϳ(ΨϳυϽϣ), ∀ϡϺϵϲ:∀ϿΨΦϦ(Ϥ⊗ϡϿϦΠ), ...
These representations encode an enormous amount of information because they are aligned to the model's own features in latent space. The raw possibility-space of latent space is indexed and made generative, similar to hypernetworks which have been studied in prior literature.
=== Reasoning ===
We then apply this resultant hyperdense 'language' as synthetic data for SFT, and use a frozen D to constrain the reasoning trace on traditional training environments that are already in circulation. This becomes the new language of reasoning, and allows the transformer to use 100% of its brain. The intuition is that human languages are calibrated for the performance and throughput of the human brain (≈20 watts) and that calibration becomes the ceiling for the computing capacity of the transformer that can be learnt.
Once the compressor has converged, you can pack every single agent session into cc and this becomes your continual learning over user preferences. You can reshape the training env for regular models:
(Old): U→A→U→A→…
(New): U→A→CC; (CC, U)→A→CC; …
Where U is a user prompt and A is assistant response. Both of these run concurrent during rollout, and V verifies that U→A→… remains continuous, taking (cc, u) as input. In other words, moving forward, the harness runs a /compact style operation C after every assistant response, and the user does not notice. It appears continuous like one conversation. The conversation is folded into cc every turn and is hidden from the user, or displayed by the harness for inspection.
The LLM is trained to organize cc and infer from it such that conversational continuity is maintained. In other words there is state equivalence between representations once embedded into latent space.
Thus, we consider human languages to be useful as bootstrap language for a superior language that is optimized for the transformer and compute available. The compressed tokens become programs that select which subset of the weight geometry to activate, and those programs are written in a language that humans cannot read because the language was optimized for the weight geometry's computational affordances, not for human parsing.
There is a ceiling nonetheless: cc eventually fragments if you pack too much disparate information into the user's continual learning memory. Additional training tasks and capability asks follow naturally from these preliminary capabilities:
1. Delamination: delaminate the compressed representation into N representations that reconstruct the full context.
2. Indexing: produce a representation that is an index over other compressed representations that are stored on disk.
These are agent tasks: the compressor is agentic and learns to evaluate when its memory bank is doing too much, and organizes a repertoire of compressions on disk.
We encourage the research community to discuss these ideas, identify problems and propose engineering solutions.
We will not be joining the https://t.co/mTF6wG1gd3 hackaton, as it is a distraction and against the spirit of decentralization
Our speedrun strat to Superintelligence requires hyperstition work from crypto communities
No success can be achieved without teamwork