This resonates a lot with my experience. My record was 60 books a year (not 80 in 6 months tho). Because I'm curious about a lot of things, many topics get my attention, so the "Parallelize" (books) tip is a really effective way to read more book. I read 3-4 at the same time, a bit every day, consistently. It turns out it is much easier to do, and in the long-term, I accomplish more.
Reading a lot also made me rethink about which books I choose to read (reading less โ reading better books: https://t.co/15bp8ZjRIm). And because I usually read technical and non-fiction books, it's great to re-read them, take notes, and think in way to apply the ideas in my life (https://t.co/4r4rNBruhE).
"How To Read More" by Borretti: https://t.co/DW22tUxm7j
๐๐๐ถ๐น๐ฑ๐ถ๐ป๐ด ๐ฎ ๐๐ฃ๐ง ๐ ๐ผ๐ฑ๐ฒ๐น
For the past few weeks, I've been reading about Foundation Models [0] and decided to work on the implementation of the GPT architecture [1] to understand its building blocks and how it works under the hood.
Here are the concepts I worked on in this implementation:
Tokenization โ Embeddings โ Self-Attention โ Multi-Head Attention โ Transformer Block โ GPT Model โ Pretraining.
โ The tokenization part was focused on building tokens from the input text and transforming them into token IDs; Then using a BPE tokenizer algorithm [2]
โ Embeddings: representing tokens with a simple scalar value (ID) is too simplistic. Embeddings come to build richer representations. I built small embeddings for learning purposes and then increased the representation to scale that
โ Multi-Head Self-Attention: this was one of the most interesting parts, creating attention scores and building relationships between tokens to produce context vectors
โ Transformer blocks have the attention heads, dropout, layer norm, and the feed-forward network
โ Pretraining is a standard training process used for deep learning models. But in this case, we update the weights end-to-end, from the embeddings to the attention layer to the feedforward network
The implementation was highly inspired by the Language Modeling from Scratch course [3] and the Build a Large Language Model book [4]. It's still very rudimentary, but very useful if you plan to learn these concepts in depth.
๐ Article Link: Self-Attention, Foundation Models, and the GPT Architecture from Scratch: https://t.co/R69Heqle3P
---
In the future, I plan to write about finetuning (using foundation models and finetuning for other tasks) and optimizations (attention blocks optimization, GPU and kernel optimization).
[0] Foundation Models at Nubank: https://t.co/xLRWLPOl3o
[1] LLM implementation repo: https://t.co/STPyVTKlM4
[2] Tokenizers lecture: https://t.co/YtBodQ2SEW
[3] Language Modeling from Scratch: https://t.co/TMGTAXJmqS
[4] Build a Large Language Model: https://t.co/TawYd8Zi8M
I've just read the "Let Me Convince You to Be Prolific" post about the benefits of being prolific, especially for creative people in the digital age.
The idea is that we should create and release more experiments, creating this long tail of acceptable work:
โ Experiment > Failure > Refine > Loop
โ Publishing work helps people find you
โ Early drafts, faster feedback loop > faster improvement
โ Each experiment contributes to the following one
I noticed this about my blog, where I've been writing for +10 years now. All the technical blogs I wrote helped improve the next one. Any of them is perfect, but I can see how much progress I have made over time.
The things you learn, the feedback you get, and the will to refine your work lead to mastery. And the long tail of work starts to compound and help discover you.
There are these two quotes I liked:
> "Giving up on perfectionism doesnโt mean that you will not produce anything perfect, but rather that perfection will happen from time to time because of the sheer mass of output." โ Dean Keith Simonton
> "If you can write one short story a week โ it doesnโt matter what the quality is to start, but at least youโre practicing, and at the end of the year you have 52 short stories, and I defy you to write 52 bad ones." โ Ray Bradbury
I found this blog in @noghartt's bookmarks. There's an awesome curation there.
โ Blog: https://t.co/LBI4yNF1d8
โจ I worked on this article the whole day and made a lot of progress. I'm almost there.
A lot of work, with many experiments, but it's getting traction. "Make Something Wonderful" inspired me to keep building and sharing.
I've just found out about this course on Foundation Models and Generative AI. Quite interesting lectures. I plan to watch the lectures as soon as I finish the Language Modeling from Scratch course. So many interesting things to learn.
Many people have already pointed out, but this course by Stanford is remarkable. It's been part of the first hour of my morning. Watching the lecture, taking notes, spawns new tabs with different papers mentioned, and coding to build the intuition behind each lecture.
Mixture of experts was a nice lecture, but the one I liked the most so far was about PyTorch and resource accounting and how to make sense of CPU/GPU, memory, runtime/compute (FLOPs), etc., from first principles.
๐ link: https://t.co/TMGTAXIOBk
[Paper Reading: Your Spending Needs Attention]
I've just finished reading the "Your Spending Needs Attention" paper by Nubank, and not only are the results impressive, but the ML and engineering approach is also very interesting. It shows the power of self-supervised representation learning to automatically understand user behavior from raw (transaction) data, which made me think about how many insightful representations we are missing by not using it, and why (engineering and money trade-offs come to mind).
Here's the research breakdown: causal self-attention + tabular feature embedding + fine-tuning for RecSys.
Transformer-based model:
> Text is All You Need: Individual transactions are tokenized, concatenated into a transaction string, and fed through a Transformer [0] to produce a transaction sequence embedding.
> No Positional Embeddings (NoPE) [1]: drop the temporal information
> FlashAttention [2] + NoPE = Efficient Long Contexts (transaction = ~14 tokens โ the sequence gets large very fast): the model can train on much larger context lengths
Tabular Features:
> Feature embeddings for numerical and categorical variables
> LightGBM: gradient-boosted tabular modeling
> Deep Cross Network V2 (DCNv2) [3]: learn feature interactions
Fine-Tuning โ classification task for RecSys:
> Low-Rank Adaptation (LoRA) [4]: injecting trainable low-rank matrices into attention layers to handle the "overfitting and catastrophic forgetting" issues.
> Late Fusion: freeze the transformer embeddings and use them as static features passed into LightGBM or DCNv2 independently.
> Joint Fusion (nuFormer): keep the transformer embeddings trainable end-to-end alongside the tabular features.
It's very insightful how joint fusion trains the entire system end-to-end using a DNN, so gradients can flow through the embeddings compared to GBT.
Other insightful ideas from the paper:
> Context window problem: adding more data sources (e.g. financial products) can lead to worse results because each data source will "compete" for the available tokens for a fixed context window.
> Scaling laws: larger model size, context lengths, and data volume lead to improved performance.
There are still many interesting avenues they will explore, especially scaling laws and scaling the application to other products. It was also insightful how they are not just following the state of the art, but doing research to find new ideas [5].
---
Paper: https://t.co/QJYpVN6NBD
---
[0] https://t.co/VNdFcLByqi
[1] https://t.co/xZ4C4eBVhp
[2] https://t.co/gR1GWBelnO
[3] https://t.co/TCT2b0633O
[4] https://t.co/jeZHOn9EgR
[5] https://t.co/CAWJePsYXQ
[ML Grind]
Finished:
> Foundation Models: finished transformer-based model implementation from scratch + finetuning
> Finished reading the Attention-based model in the industry paper: interesting insights about context length, scaling laws, and joint fusion
Have been working on:
> ML monitoring + alerting system for ML models
> AI agent for business flow: interesting engineering learnings (agent/prompt refinements <> MCP <> backend + infra)
> Real estate liquidity model: interesting learnings about temporal splits, model calibration, model optimization, and dataset exploration
Plan for today:
> Continue writing the blog post about the foundation model implementation
> Continue the "Language Modeling from Scratch" course by Stanford
> Read a new ML paper
As long as I can remember, I have always had this desire to do great things. Not only making something wonderful, but striving to become great.
Yet another day, I wake up with these thoughts. Let's refine my skills, work on my projects, and go one step further in this infinity game of life.
[ML Grind]
Yesterday I took the day to work on the model training of the GPT-like model. I built the tokenization/embedding layers, the multi-head attention mechanism, added the transformer blocks to the GPTModel, and trained it on input text of 5k tokens (not big but useful for learning purposes).
Continuing my ML progress
> LLM from scratch: worked on this all day (built a self-attention and multi-head attention mechanism)
> Finished the monitoring system this week
> AI Engineering: continue the book โ I'm currently working on an AI agent product and I need to learn more about this one
> Got a mentor at work: he shared many papers and resources I should read (tons of work to do!)
> ML Bootcamp: working on the first project with my pair โ first part (EDA) is done. Now I need to move to the model training phase
It's been almost 2 months since I started working on ML, and it's been one of the best decisions of my career. The learning curve, the knowledge gap, the interesting projects, and the people I'm working with are all exciting. I'm having so much fun at and outside work.
The cherry on top is the ML/AI bootcamp provided by my company. They built a bootcamp based on ML theory and hands-on projects, and we need to study and deliver the exercises and projects. It's an intensive 3-month bootcamp on traditional ML and AI-agents.
I keep following my curiosity and opportunities for growth. So much to learn.
[ML Grind] Today's study session.
> Building a LLM from scratch
> RL course + book
> Decoding Alphafold + ML research
> Finishing the feature store and monitoring system implementation
So much to learn.
[ML Grind] Goals for Today
> Continue studying alphafold 2 and 3
> Finish the first coding assignment for the Language Modeling from Scratch course
> Continue designing the ML monitoring system for my model
> Continue RL course
> Read alphafold cases: starting with IsoLabs
---
Besides the ML grind, I still need to run my 5k, clean the house, and do meal prep for the week. Let's go!
The Infinity Machine will definitely be the next book I want to read. It's a book about Demis, DeepMind, and their work on AI.
If you got curious about it, you should give this Founders podcast a try: https://t.co/W8mngJrGp8
I like this podcast in general, but this one about Demis and how he works is fascinating.
The passage I liked the most was about his determination and being mission-driven 24/7: "There is no 50 percent mode in Demis. There is no 99 percent mode in Demis. There is only 100 percent."
As Kpaxs said, "Some people are playing a completely different game, 24/7. No off switch".
Last week, I read this very insightful blog post by @ZyWang25 titled "How I become a Research Engineer at Google DeepMind".
It's not only an inspiring and amazing accomplishment, but it resonates with us who are following our curiosity, looking for this inner motivation (or passion, as other people say), improving our craft, and reaching our purpose.
Before I have the chance to write and share my own post about my experience, read this piece to feel inspired and motivated to keep pushing and grinding.
Here are the topics that resonated with me:
> Find your 'why'
> Upskill relentlessly. Do the work!
> Productivity = Progress: move closer to your goal
> Create your opportunities. Manufacture luck by working hard on your craft and being strategic about your goals
[ML Grind]
Focusing on foundation work:
> Deep Learning/LLM/ML foundation studies
> Bio x AI research: unwrapping AlphaFold
> Finished the Machine Learning System Design book
Documenting everything in my physical notebook and the ML research repo: https://t.co/yKNjawriYI
2 years ago, I started learning ML for fun, and then, after learning more about Hamming's ideas, I decided to take it seriously to accomplish my life's big goals.
I'm still in the process, but starting to get the rewards and making progress.
โ post my ML learning experience: https://t.co/qL4R1olj41
โ post about my learning roadmap: https://t.co/7oLBZJN1RI
There is so much to learn, still.
So many books, so little time.
Besides The Art of Doing Science and Engineering, I'm excited to read Sutskever's List and The Infinity Machine, a book released a couple of days ago.
Time to remove all distractions and focus.
๐ Started a new book today.
I'm on the first few pages, and the way it was written already caught my attention.
"Teachers should prepare the student for the student's future, not for the teacher's past. Most teachers rarely discuss the important topic of the future of their field, and when this is pointed out, they usually reply: 'No one can know the future'. It seems to me the difficulty of knowing the future does not absolve the teacher from seriously trying to help the student to be ready for it when it comes."
Excited to be educated on styles of learning and thinking, and then get back to training, applying those principles.