Gave a talk earlier today at BitDevs Kaduna β‘.
Did a deep dive into our @Breez_Tech integration at @evento_so: architecture, trade-offs, and the real-world issues we ran into building it.
Also got to demo Zap-All live π€.
Amazing crowd, great questions, peak BitDevs energy all round π.
Evento is now open-source too, so feel free to check it out, suggest improvements, or contribute.
Appreciate everyone who came through π«‘.
I was training this model on Modalβit took about two hours and cost around $8. When it finally finished, I meant to run the next cell... and accidentally reran the training one instead.
Guess I'm literally paying for my mistakes π
In today's episode of programming horror...
In the Python docs of random.seed() def, we're told
"If a is an int, it is used directly." [1]
But if you seed with 3 or -3, you actually get the exact same rng object, producing the same streams. (TIL). In nanochat I was using the sign as a (what I thought was) clever way to get different rng sequences for train/test splits. Hence gnarly bug because now train=test.
I found the CPython code responsible in cpython/Modules/_randommodule.c [2], where on line 321 we see in a comment:
"This algorithm relies on the number being unsigned. So: if the arg is a PyLong, use its absolute value." followed by
n = PyNumber_Absolute(arg);
which explicitly calls abs() on your seed to make it positive, discarding the sign bit.
But this comment is actually wrong/misleading too. Under the hood, Python calls the Mersenne Twister MT19937 algorithm, which in the general case has 19937 (non-zero) bits state. Python takes your int (or other objects) and "spreads out" that information across these bits. In principle, the sign bit could have been used to augment the state bits. There is nothing about the algorithm that "relies on the number being unsigned". A decision was made to not incorporate the sign bit (which imo was a mistake). One trivial example could have been to map n -> 2*abs(n) + int(n < 0).
Finally this leads us to the contract of Python's random, which is also not fully spelled out in the docs. The contract that is mentioned is that:
same seed => same sequence.
But no guarantee is made that different seeds produce different sequences. So in principle, Python makes no promises that e.g. seed(5) and seed(6) are different rng streams. (Though this quite commonly implicitly assumed in many applications.) Indeed, we see that seed(5) and seed(-5) are identical streams. And you should probably not use them to separate your train/test behaviors in machine learning. One of the more amusing programming horror footguns I've encountered recently. We'll see you in the next episode.
[1] https://t.co/srv1ZBlDsi
[2] https://t.co/qpnKdvfVNS
Of course there is no harm to experimenting, if you have time and a gpu, but here are few tips:
- Use when model is too complex.
- You can actually see the overfitting happening. (This is better because instead of holding on to regularization as default, you can see if the model performs better without regularization and have some sort of baseline to work with)
- A large datset is not available (if the data is relatively small and there is risk for the data to end up being memorized by the complex model, then yes).
- If you do not have enough time nor gpu and you are unsure about the characteristics of the model and data, then yes also.
Hope this helps.
I was reviewing a mentee's Deep learning model today and saw something i had wanted to talk about for a while.
A lot of hobby DL models we create do not get better performance when Regularization is implemented, in fact it decreases training accuracy and overall generalization.
I was reviewing a mentee's Deep learning model today and saw something i had wanted to talk about for a while.
A lot of hobby DL models we create do not get better performance when Regularization is implemented, in fact it decreases training accuracy and overall generalization.
Picked up fastai over the last weekend for a gig so I did a little something for y'all to check out.
A pneumonia detector from chest xrays...You could open on colab, run the cells and start testing it out right away and of course, improve it.
Colab link:
https://t.co/kauYTmJN1Q
everyone:
- βjust use the APIβ
PewDiePie:
- built a 10x GPU AI Server (8x modded 48GB 4090s, 2x RTX 4000 Ada)
- runs opensourcemodels with vLLM for TP
- vibe-coded his own Chat UI, including RAG, DeepResearch, and TTS
- is fine-tuning his own model
be like PewDiePie
Buy a GPU
Did a parametric experiment to understand really the reason why dropout reg works so well.
Didn't really document per se as I just wanted to check it out.
I feel like I'd go with the researchers that say this is because of the less reliance on individual nodes there by stabilizing the network, although I won't really say the other findings are really orthogonal so...
I'll do much more research and hopefully document this time so y'all can see.
I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter.
The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.
Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
- more information compression (see paper) => shorter context windows, more efficiency
- significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images.
- input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful.
- delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go.
OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa.
So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to.
Now I have to also fight the urge to side quest an image-input-only version of nanochat...
An exciting milestone for AI in science: Our C2S-Scale 27B foundation model, built with @Yale and based on Gemma, generated a novel hypothesis about cancer cellular behavior, which scientists experimentally validated in living cells.Β
With more preclinical and clinical tests, this discovery may reveal a promising new pathway for developing therapies to fight cancer.