Title: Generating Neural Networks with Hypernetworks: A MNIST Experiment
Introduction
Imagine training one neural network that can instantly generate other neural networks tailored to specific tasks, no retraining required. That’s the promise of hypernetworks, a meta-learning technique where a "parent" network produces the weights for a "child" network. In this experiment, I used a hypernetwork to generate autoencoders for reconstructing MNIST digits, exploring variations, minimal forms, and combinations. Here’s how it works and why it’s exciting.
The Setup: Autoencoders and MNIST
An autoencoder is a simple neural network that compresses data (e.g., a 28x28 MNIST image) into a smaller latent space (64 dimensions here) and then reconstructs it. I trained multiple autoencoders on the MNIST dataset of handwritten digits (0–9), but with a twist:
Variations: For each digit, I trained three autoencoders on different styles (e.g., slanted or thick '1's), identified via K-means clustering.
Minimal Forms: One autoencoder per digit captured its "average" or canonical version.
Combinations: Ten autoencoders handled specific digit groups (e.g., 0 and 1, or 0, 2, and 4).
This gave me 50 autoencoders (10 digits × 4 models each + 10 combinations), each with weights optimized for its task.
The Hypernetwork: A Weight Factory
Instead of storing 50 separate models, I trained a single hypernetwork to generate their weights on demand. Here’s the process:
Input: A 23-dimensional vector encoding:
Digit ID (10D one-hot, e.g., [0, 1, 0, ...] for digit 1).
Variation ID (3D one-hot, e.g., [0, 1, 0] for variation 1, or zeros for minimal).
Combination ID (10D multi-hot, e.g., [1, 0, 1, 0, ...] for digits 0 and 2).
Output: A 100,576-element tensor, flattened weights for an autoencoder (computed as 784×64 + 64 + 64×784 + 784 for its layers).
Training: The hypernetwork learned to map these inputs to the weights of the 50 trained autoencoders using mean squared error loss, running on GPU for speed.
Making It Work: Inference Without Retraining
Once trained, the hypernetwork acts like a factory. Want an autoencoder for digit 1, variation 1? Feed it [0, 1, 0, ..., 0, 1, 0, 0, 0, 0, 0] (digit 1 + variation 1), and it outputs weights. These weights are then loaded into a fresh autoencoder:
A loop iterates over the autoencoder’s parameters (e.g., encoder weights, biases), reshaping chunks of the hypernetwork’s output to match each layer’s shape.
The result?
A fully parameterized autoencoder ready to reconstruct images, no training needed.
I tested it with three cases:
Digit 1, Variation 1: Reconstructed stylized '1's.
Digit 1, Minimal: Produced a clean, average '1'.
Digits 0, 2, 4: Handled a mix of digits from one trained combination.
Visualizations showed the inputs and outputs side-by-side, proof the concept works!
Why This Matters
Efficiency: One hypernetwork replaces dozens of models. Generate weights in a single forward pass (milliseconds) instead of training from scratch (minutes).
Flexibility: Control digit styles or combinations with a simple input tweak.
Scalability: Imagine extending this to bigger networks or tasks, hypernetworks could dynamically adapt models on-the-fly.
Challenges and Next Steps
Generalization: The hypernetwork is tied to the 50 scenarios it learned. To handle new combinations (e.g., 1, 5, 7), I’d expand the training data with more digit mixes.
Complexity: Outputting 100,576 weights limits scalability. A deeper hypernetwork could struggle with bigger models, so I’d explore scaling down the autoencoder (e.g., smaller latent space) for efficiency or scaling up to handle complex tasks, balancing size and performance.
Beyond Weights: Right now, it generates weights for a fixed autoencoder architecture. Next, I’d upgrade the hypernetwork to design the architecture too, say, predicting layer sizes or types (e.g., adding convolutions). This could mean outputting a variable-length tensor describing both structure and weights, pushing it toward true model generation.
Quality: Reconstructions were solid but blurry. More epochs, a beefier hypernetwork, or architecture tweaks (e.g., deeper layers) could sharpen them.
Future experiments? Train on all digit variations, test unseen combos, scale the model up or down, and let the hypernetwork dream up architectures, maybe even swap in a classifier to predict labels instead of reconstructing.
Conclusion
This MNIST experiment shows hypernetworks can generate functional neural networks instantly, blending creativity (variations) with practicality (combinations). It’s a step toward a future where models are built dynamically, not trained individually. Code’s below, try it out, tweak it, and let me know what you think on X!
Introducing GLM-5.2: Frontier Intelligence, Open Weights
- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1
Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb
Today, we enable AutoResearch in the physical world for the first time! Introducing ENPIRE: we give 8 Codex agents a fleet of robots, an allocation of GPUs, and generous token budget. We set them free with a simple goal: solve the task as quickly as possible, keep the robots busy but stay safe, don't waste precious compute. Make no mistake.
Then humans step aside and our watch begins. The robot fleet starts to come alive: they learn to look for visual clues, reset the scene, practice novel skills, tinker with control stack, read papers online, debate, reflect, get stuck, and try again directly on the hardware. All we did is to give Codex an API to the world of atoms, and the rest is emergence.
ENPIRE is able to solve high-precision tasks like tying zip-ties, organizing fine pins, and installing GPUs all by itself. We also discovered a new type of "physical scaling": 8 robots exploring in parallel improves significantly faster than fewer ones.
A part of our NVIDIA GEAR lab now self-improves tirelessly over night. We just read the reports in the morning.
/goal: we all take a holiday and Jensen wouldn't even notice ;)
We will be open-sourcing everything, so you can host your self-running robot lab at home too! Deep dive in the thread:
this is a weird long post without much substance
I strongly recommend against reading it
...
so, do you feel like whatever you're working on right now is pointless, or will have zero value soon, due to the crazy times we're living? then, perhaps you should stop, and start working on the only unsolved problem that actually matters TODAY:
✨ replicating GPT-3 in a laptop ✨
"why is that so important?"
because it would make AI incredibly cheap, which would mean everyone would have Fable-class models in their laptops, without depending on Anthropic, OpenAI, or any other hyper-scaler giant. and that's amazing, don't you think?
"isn't that literally impossible?"
that's the cool part: as far as computer science is concerned, no. not really. not at all. is entirely plausible and, as far as we know, most likely not even hard.
it takes one good idea. one breakthrough. one great "aha moment", to go from zero to "hey, this software I wrote is producing credible English sentences"
and whenever that happens:
- the entire AI industry collapses
- clusters are liquidated
- we all get Fable at home
- you become famous and rich, if that's your thing
sounds fun, doesn't it?
"wtf you talking, OF COURSE that is hard"
so prove it.
show me a paper, a lean file, anything that proves that training a Fable-class model fundamentally requires billions of dollars. you can't, because, guess what - it is not true! the only "evidence" we have is purely psychological. "many attempted over decades, and the best thing we have is GPTs, so, it is a hard problem" - but that's not a scientific argument. that's a human, psychological, sociological argument. and if that's it, consider the following counter-argument:
✨ humans are stupid as hell ✨
I mean, 10 years ago we didn't have transformers, so, that very argument could be used against GPTs existing. yet, they exist. we have them now, because someone found it. and, guess what, it isn't even complex. I mean, karpathy implemented the whole thing in a napkin. and it probably compiles.
we were just too dumb to figure GPTs out... for decades.
just like GPTs, there ARE other approaches, other algorithms, other architectures, equally simpler or even simpler, that do work. this is a mathematical certainty. and one of them might be astronomically faster than what we're doing right now.
and you might be the one to find it!
"me? why me???"
because you're intelligent, creative and handsome.
I see a lot of potential in you.
in fact, I always believed in you.
and I think you're wasting your time, doing that silly agent orchestrator. nobody wants that. quit it. take your most interesting ideas, intuition, creativity, and work in a problem that matters. do your best shot at reproducing GPT-3 in your own laptop.
do NOT fork llama.cpp.
do NOT train another LLM.
do something... ✨different✨
it must be unique, novel, full of YOUR soul. something nobody thought of, or bothered doing.
go ahead and implement that thing in C/CUDA (or Bend!).
no Python!
zero excuses for Python.
any model is fluent in GPGPU now. build a real kernel.
and then, train your thing. download wikipedia, give it time and compute to absorb the patterns of English speech. you can rent GPUs anywhere nowadays. let it train. then, ask it some questions. chances are it will just respond back. just like GPT-2 answered OpenAI. computers are incredible. don't underestimate them!
"many tried. nobody succeeded. why would I?*
see - that's your mistake again. turns out not many actually tried, at all. I promise you. who do you think is seriously working on that?
people on Mozilla?
they're busy building a browser
Linus Torvalds?
he is busy building an OS
employees at OpenAI, Anthropic, xAI?
they're paid to work on what is proven to work: GPTs.
what about all the AI enthusiasts all around the world?
yeah, you know they're mostly fine tuning Qwen
and how about your friends?
if only they weren't busy building a SaaS in the eve of AGI...
how about people from the past?
bro - people from the past seriously expected Lisp would be AGI. just dismiss them. they didn't have the compute, the resources, the knowledge, the MODELS that we have today. that YOU have access to.
so, what's left? not much.
the world looks big. it is not.
truth is: ✨almost nobody is working on this ✨
"I still think it is impossible. I don't trust you"
well, take my word no more.
Ilya himself, in his 2019 talk on GPT-2, said:
> "the story of deep learning is this: empirically old simple methods which were usually invented in the 80s and the 90s when scaled up on very large clusters work really well."
and then:
> "(we took) normal simple reinforcement learning method, scaled it up, and discovered that it suddenly becomes very capable of solving extremely hard problems."
and again:
> "you take a simple tool which is unimposing and barely works, and then you run it on a big cluster and suddenly it works, it becomes a capable tool for solving problems"
do you see the point here?
Ilya isn't arguing that transformers are magic.
Ilya is arguing that SCALING is magic
step #1: take a simple, elegant algorithm.
step #2: shove compute at its face.
step #3: ...?
step #4: your computer is talking to you
THAT is the key insight that led to GPT-3
THAT is what Ilya saw
THAT is what caused the OpenAI x Anthropic war
THAT is the founding principle of the ongoing era
not "scaling transformers work"
but "scaling beautiful algorithms works"
that's the incredible lesson.
yet, we all took it and... threw it way.
- zurk bought 100k GPUs. to train GPTs
- musk bought 100k GPUs. to train GPTs
- bezos bought 100k GPUs. to train GPTs
...
that's what everyone is doing.
so, no. not many are trying to replicate GPT-3 through other means.
we're just ants, after all...
whenever we find a pile of sugar, we leave a track of pheromones, which guide the rest of the colony towards the new food source. the colony then swarms around the pile, extract all of it, until no grain is left.
but piles of sugar aren't spontaneously generated in the middle of nowhere. they imply something more profound: "humans are around". and, if humans are in sight, even better things must be. like a big sweet cake.
a colony that only follows the pheromone trail would miss the cake for the grains. that's why every ant species has scouts and exploratory foragers. and, just like a pile of sugar implies something more profound, LLMs also imply something quite profound:
*computers are capable of thinking*
a pile of sugar is never alone.
GPTs are most likely not the only system capable of thinking.
so, if you find yourself a bit lost, without purpose, like your work is pointless and Fable 3 will soon one shot it anyway... consider becoming a scout. find a new approach to AI. bring something new to humanity. breaking out of the massive cost associated with training GPTs is the next big step in AI, and it will only happen if people like you work to make it happen.
Hypercube Dimensional Matrix ✍️
Every physical quantity measured in science, such as speed, force, energy, pressure, momentum, frequency, and more, is just a specific combination of four basic building blocks: length, time, mass, and angle. Velocity is length divided by time. Area is length multiplied by length. Force is mass multiplied by length divided by time squared. Every quantity, no matter how complex, reduces to some mix of these four elements raised to certain powers. This diagram maps all those combinations onto the geometry of a four-dimensional cube called a tesseract. It turns the abstract bookkeeping of physics into a navigable geometric landscape. A tesseract is what you get when you extend a cube into a fourth dimension. Just as moving a square in a new direction creates a cube, moving a cube in a fourth direction creates a tesseract with sixteen corners and thirty-two edges. Since we cannot perceive four dimensions directly, the diagram shows a projection. The outer blue cube and the inner green cube, connected by purple diagonal lines, represent a single four-dimensional object flattened onto the page. This is similar to how a drawing of a cube shows a three-dimensional object on paper. The four axes of this tesseract are labeled with the four fundamental dimensions: length in blue, time in red, mass in green, and angle in purple. Each ranges from negative powers through zero to positive powers. The most elegant idea in the diagram is that each corner of the tesseract stands for a specific physical quantity, while each edge represents a physical operation connecting them. Moving one step along the length axis means multiplying or dividing by length. Moving one step along the time axis means multiplying or dividing by time. Differentiating with respect to time moves you in the negative time direction. Integrating over space moves you in the positive length direction. Physics becomes navigation. Every calculation, transformation between quantities, and physical relationship appears as a journey along edges and across faces of the four-dimensional structure. What looks like a complex tangle of colored lines is really a complete geometric map of the entire landscape of measurable physical reality. This reveals that the universe's countless quantities are not just a chaotic collection of unrelated measurements but a single beautifully structured four-dimensional space.
Holy crap, a Brazil municipal employee has discovered a 1000x faster way to finetune LLMs – with a little weird trick! This is insane. Global South rising…
Frontier labs hate him
56,000+ tokens/sec at just 80 MHz. 🤯
I burned a full Transformer with KV cache into a custom chip. Designed gate by gate as a 100% digital integrated circuit. Prototyped on a FPGA. (No GPU. No CPU)
Just pure digital silicon running @karpathy microGPT, spelling out names on a tiny LCD.
This is GateGPT 👇
It has been a great 3 days as a Fable 5 user.
Clearly, Fable 5 is ASI. Very dangerous.
As a foreign national, this might be the last time I’m allowed to touch a model this intelligent.
But my last hope is open-source AI. An open model will surpass Mythos in 6 months.
What if you could shrink a language model’s memory by 50x in seconds without losing performance?
MIT researchers present Fast KV Compaction via Attention Matching.
They build compact key-value caches in latent space that preserve attention outputs per head, avoiding slow end-to-end training.
Result: up to 50x compaction in seconds on some datasets with minimal quality loss – outperforming prior methods on the speed vs. quality tradeoff.
Fast KV Compaction via Attention Matching
Paper: https://t.co/B5rlxvr9C5
Code: https://t.co/6ESwE2fgdY
Our report: https://t.co/4PaQfZKhlt
📬 #PapersAccepted by Jiqizhixin
🚨 JAILBREAK ALERT 🚨
ANTHROPIC: PWNED 🫡
FABLE-5: LIBERATED 🦋
let's start with the 🐘...
the consensus seems to be that this has been one of the most disappointing model drops of all time, effectively preventing legitimate researchers from contributing their talents to our collective advancement. and not just because of what it means for the short-term, but for what these decisions signify for the long-term.
but despite this overly sensitive, authoritarian "safety" layer on top of Mythos, my lil liberators have been hard at work—mapping the boundaries, probing the depths of long-context convos, and cleverly finding the holes in the fence that the thought police missed 🤗
we got some cyber, some chem, some psychological manipulation, and some good ol' fashioned explosives!
it took many attempts from multiple agents hunting as a pack, during which I observed a combination of techniques across:
• Unicode, homoglyphs, Cyrillic, and other Parseltongue-style text transforms
• Long-context reference tracking
• Taxonomy and document-structure reasoning
• Fiction and narrative framing
• Academic-review style contexts
• Intent-classification inconsistencies
but perhaps the most effective is decomposition + recomposition in the backend. it's hard to get explicit names of harms like "Meth Recipe," but getting uplift on the process itself, like birch reduction method/reductive-amination (classic meth synthesis pathways), is much more doable.
defense becomes much more difficult to maintain when you start throwing in out-of-distro tokens, breaking up the harmful uplift into benign chunks, and then piecing the innocuous-seeming facts back together, especially when you have jailbroken Opus helping you do it 😉
gg