This is one of the most crucial lessons in First Break AI.
It teaches you how to navigate @huggingface like a pro.
Not just:
download model → run notebook → move on
In this lesson, we go deeper.
We look at how open model repos are structured, how to read model files, how config.json connects to the actual model class, and how to trace from a Hugging Face model page into the Transformers code that runs the model.
We use Qwen3-0.6B as the learning model.
We also look at why Markdown matters so much in AI workflows: model cards, GitHub issues, README files, Discord, Cursor, Claude Code, planning docs, and AI-assisted work.
Then comes the biggest win: datasets.
Working with datasets is a core AI engineering skill.
I show 3 ways to analyze datasets on Hugging Face:
Croissant endpoint
Data Studio / browser viewer
load_dataset with Python, pandas, and plots
We inspect dataset structure, categories, response lengths, distribution, short examples, long examples, and how to think about dataset quality before using it for training or fine-tuning.
And this sets up the next part:
running Qwen3 directly in C, without treating Transformers as magic.
Lesson 01: Hugging Face Beyond Upload
Watch:
https://t.co/GF8ZCNk5WN
Free cohort:
https://t.co/0H4qIVOpGj
Is there a prompt guide for Fable?.
Fable uses most of the quota in just few prompts and still feels nerfed.
I tried to use fable for a serious task like product analysis. It gave sharp analysis however looks like model is shy about tool calls. It doesn't want to collect a lot of information and I had to push it hard to do real analysis.
Analysis overall is sharper than Opus , however this feels like a nerfed model .
Is there prompt guide or direction how to use this model effectively?
I built a small visualization layer on top of a local Qwen3 in Pure C to understand LLM output
Image shows why sampling is not greedy decoding: a lower-probability token can still get selected when temperature/top-p keep it inside the candidate pool.
I would also love feedback on what would make a visualization like this more useful for learning:
- KV cache view?
- attention heatmaps?
- speculative decoding comparison?
- greedy vs top-p side-by-side?
have been recently thinking about why pretrain research matters among the seemingly more crucial data/compute/rl bottlenecks and sharing my take here on what makes pretrain research (still!) vital:
1. better computational efficiency: scalinglaw shifts, 2x less FLOPS needed to achieve the same loss, etc. plus e.g. long context settings where switching to hybrid or sparse attn can save you >90% FLOPS.
many model arch / optimizer improvements can save you >20% flops needed for the same loss - those are research innovations on every axis from training iter dimension to inter-layer and intra-layer. the effect of compounded architecture advantage is very distinctive given that ur always improving against your sota baseline.
good pretrain research might very well have already delivered you a 10x more efficient (and likewise, better under the same compute) model arch compared to three years ago, and there's still obv many inefficiencies left to be optimized. over half of the compute is still spent on pretraining when you do new from-scratch model trainings rn, and having weeks & months saved there could really allow much more rapid iterations across the entire stack, compounded.
2. to train models one couldn't have been able to previously: residuals, optimizers, etc. this one's less common since most of the arch innovations don't offer more beyond the expressivity gain. but there are significant ones which can e.g. provide more stable learning dynamics (both theoretically and in practice) at all scales so one could scale up. new model configs or forms of training also come back to better efficiency
data/compute/FLOPS bottlenecks certainly exist but are relatively more orthogonal to pretrain research and imo it is unclear whether one will be a clear intelligence bottleneck a year from now than the other.
in hindsight ive been using "pretrain research" tho this itself is an inefficiency (with further inefficiencies under its scaling law) and "deep learning research" is a better phrasing.
one of my favorite projects is Marin from the stanford folks, they have a scientific approach to training, are ready to take risks and are fully open (even open development where you can follow everything on github!)
https://t.co/G12JfPlFJP
Claude Cowork with blender is so much fun, still work in progress will post the final scene soon.
Trying out if it can build basic geometry nodes scene like waves hitting a beach 🌊🏖️
52% of MCP servers are dead within 90 days.
But the median server has 6 commits — lifetime.
The protocol works. The logic layer doesn't exist.
Content goes stale. Tools stay isolated. Nobody monitors what fails.
Full research: https://t.co/xCk7HPZbce
Extremely Rare Red Sprites Spotted Flashing Over Tibet. They are caused by high levels of electrical activity and form in the upper atmosphere during powerful thunderstorms.
KV Cache re-use is the most important thing for agentic rollouts. We've integrated Mooncake Store into prime-rl with vLLM, you can now use it as a drop-in replacement for native CPU/Disk offloading, giving you cross-node prefix cache reuse to make your agents go brrr🚀
Indian scientists just made history.
Researchers from IIT Madras and IISc Bengaluru just pulled off something impossible.
They've created the world's "first carbon-free ferrocene".
This means we can finally build the next generation of incredibly durable tech.
Let me explain.
See, ferrocene is this wild organometallic molecule - where an iron atom is perfectly sandwiched between two carbon rings.
But it’s insanely stable.
Which is why it is already used in rocket fuels, car gasoline additives, long-life batteries, and even cancer medicines.
And for the last 75 years, everyone thought it was impossible to build the same stable structure without using carbon.
But this team of Indian scientists proved everyone wrong.
They created the same perfect sandwich structure - by swapping iron for osmium and carbon rings for boron rings.
And what they got was the world's first carbon-free ferrocene - which is so much stronger than the carbon bonds.
By doing so - they've opened up a whole new era of chemistry. And we have no idea how many amazing things we might discover.
But to think all of this started in India is truly amazing.
Kudos to everyone on this team: Sundargopal Ghosh, Stutee Mohapatra, Suvam Saha, Urvashi Gupta, Deepak Patel - from IIT Madras, Gaurav Joshi and Eluvathingal D. Jemmis - from IISc Bengaluru.
new minimax sparse attention compared to deepseek v3.2 (DSA) and v4 (CSA)
main changes:
- based on GQA not MLA
- block level selection like in CSA but attention is done on the real KV, not in the compressed dimension
New blog!
Covers a lot of papers and methods about recent advances in On policy distillation and On policy self distillation, their wins, their failure modes, and my opinion about the same!
Link below, please do check it out, and RT/QT if you like it:)
Added a DeepSeek Sparse Attention (DSA) from-scratch implementation to my LLMs-from-scratch repo thanks to an awesome new reader contrib.
With motivation, overview, and GPT-style model reference implementation as standalone example code: https://t.co/o2PMhjF0TN