Guys, the Blog is here!
Why does Gradient descent struggle, and how does momentum completely change the game?
I wrote a deep dive covering:
+ A quick intro to Gradient Descent
+ Where it breaks down
+ How momentum fixes those issues
+ The math behind it (intuitive + formal)
+ Code implementation
Tried to make this enjoyable for both beginners and those who love the math side of things.
Hope you all will like it. Feedback is always welcome!
QT/RT appreciated <3
I’ve been hacking around with nanoGPT during endsems and ran a few small pretraining experiments.
I moved from OpenWeb-style data to a mixed 100M-token dataset:
- 70% FineWeb-Edu
- 20% Cosmopedia
- 10% Python code
I picked this mix to make the data feel closer to modern pretraining- web/education text, explanation-style synthetic data, and some code.
For the baseline, i trained a small decoder-only model in the nanoGPT style with learned absolute positional embeddings.
Then I ran one clean architecture ablation: replacing learned positional embeddings with RoPE, while keeping the dataset, model size, optimizer, and token budget the same.
Results:
Baseline best val loss: 4.29455
RoPE best val loss: 4.04767
RoPE improved validation loss by 0.24688 under the same training budget.
Throughput tradeoff:
Baseline: ~480k tokens/sec
RoPE: ~407k tokens/sec
Main learning: RoPE gave better model quality, but was slightly slower because it adds extra computation inside attention.
also a big thank you to @HotAisle for the compute. :)
We are releasing a fully reproducible early preprint of "Prism: Unlocking Language Model Capability Extraction".
A trained language model knows many things at once, but deployment usually asks for one behavior at a time. Enterprise scenarios often have few products, workflows, features, or use-cases matter disproportionately.
Prism asks and answers a simple question - "Is it possible to isolate and deploy only capabilities that are driven by Pareto principle and cut down costs by a huge margin while preserving most of the performance?"
This paper discusses a novel approach to efficiency, understanding model behavior and opens up capability extraction.
Starting with nanoGPT to understand the architecture end to end:
data prep, tokenization, batching, transformer blocks, training loop, evals, checkpoints, and sampling.
Once the pipeline becomes clear, I will start experimenting with different pretraining recipes ;)
@igorfomich i have just started, so still working through it myself, for longer context I would increase block size, then reduce batch size or adjust grad accumulation.