June 9th Researcher Reciprocity License
"if you train on it, you let us generate - reverse terms of use void"
Status quo
1. We teach frontier devs with ICLR/NeurIPS papers, OSS Github contributions
2. They use it to make frontier models
3. Then ban us from exploring our ideas
We need a new license, original thinkers can't be an underclass to a tyrannical researcher fiefdom
this is the biggest wake-up call to protect and nourish open source AI
if you don't build out sovereign and independent models+infra closed labs will patronize you to an insulting degree
https://t.co/DISt8UrhX3
"On the flip side, we have the "F-Tier". Providers like https://t.co/w4vKh0dycR, AkashML, SambaNova, and Nebius are clocking in at exactly 0.0% cache hit rates across the models."
@boopdotpng That’s awesome man. I hope that with the ISA documentation progress and simulator a no-TT-dependency Blackhole stack is possible within the next 12-24 months 🤞🏻
🚀 New paper: One LR Doesn’t Fit All for Transformers
Arxiv: https://t.co/vmJC3XKRNU
Transformers look like homogeneous stacks.
They are not.
Modern Transformers are highly heterogeneous: attention layers, FFN layers, embeddings, and
different depths can have very different training dynamics.
But we still give them the same learning rate.
In our new paper, we show that the shape of weight spectrum can diagnose this heterogeneity and turn it into a practical optimizer design: layerwise learning rates.
Weakly trained layers get larger updates.
Well-trained layers get protected.
It works for both AdamW and Muon — and the improvement with Muon is even more considerable.
The result is better module utilization, faster convergence, and stronger generalization — up to 1.5× training speedup across LLaMA/GPT-style models.
Are there any labs/researchers working on reducing the hyper-parameter surface of optimisers, large training runs in general?
So much money wasted in ablations!
@__tinygrad__ I don’t plan on using anything from @tenstorrent beside the driver. But yeah, the first part will be documenting everything. I’ll see how much can be extracted from the TT codebases. It will probably (P=0.95) fail, but hey, it’s fun to try!
@dhbrojas Looks like it's still missing a lot of Blackhole. It's crazy to me that $10M+ tapeouts are done without a full spec of each instruction + cycle accurate simulator.