As an applied ML engineer who is learning more about research and theory, I found two interesting resources I read this week that are worth sharing.
The first one is the "On Research Taste"¹ article by Albert Ying. I liked how he defines what 'taste' really is: "the ability to find the node that would affect the largest number of other nodes [...] over a network", where the graph is a collection of "hypotheses and analyses you could pursue". I think the missing part of this short article is "how to develop 'taste'".
The second one is the "An Unofficial Guide to Prepare for a Research Position Application"² by Sakana AI. That was the most insightful blogpost I've read this year. It lays down all the core principles to be a great researcher, how to approach ideas, the importance of clear communication, and having a good balance between technical ability (engineering skills) and creativity.
The post is more than how to prepare for their interview. It's their way of doing great research.
¹ https://t.co/rwKGjvhYY7
² https://t.co/qUyg8IMCTF
His playlist is also really good. But the resources I used are not really documentation. One is a course by Stanford, and the other is a book by Sebastian Raschka. The course is a great complement, because it goes beyond LLMs, it talks about resource management, GPUs, tensor optimization, parallel computation. Fun stuff.
𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗚𝗣𝗧 𝗠𝗼𝗱𝗲𝗹
For the past few weeks, I've been reading about Foundation Models [0] and decided to work on the implementation of the GPT architecture [1] to understand its building blocks and how it works under the hood.
Here are the concepts I worked on in this implementation:
Tokenization → Embeddings → Self-Attention → Multi-Head Attention → Transformer Block → GPT Model → Pretraining.
— The tokenization part was focused on building tokens from the input text and transforming them into token IDs; Then using a BPE tokenizer algorithm [2]
— Embeddings: representing tokens with a simple scalar value (ID) is too simplistic. Embeddings come to build richer representations. I built small embeddings for learning purposes and then increased the representation to scale that
— Multi-Head Self-Attention: this was one of the most interesting parts, creating attention scores and building relationships between tokens to produce context vectors
— Transformer blocks have the attention heads, dropout, layer norm, and the feed-forward network
— Pretraining is a standard training process used for deep learning models. But in this case, we update the weights end-to-end, from the embeddings to the attention layer to the feedforward network
The implementation was highly inspired by the Language Modeling from Scratch course [3] and the Build a Large Language Model book [4]. It's still very rudimentary, but very useful if you plan to learn these concepts in depth.
🔗 Article Link: Self-Attention, Foundation Models, and the GPT Architecture from Scratch: https://t.co/R69Heqle3P
---
In the future, I plan to write about finetuning (using foundation models and finetuning for other tasks) and optimizations (attention blocks optimization, GPU and kernel optimization).
[0] Foundation Models at Nubank: https://t.co/xLRWLPOl3o
[1] LLM implementation repo: https://t.co/STPyVTKlM4
[2] Tokenizers lecture: https://t.co/YtBodQ2SEW
[3] Language Modeling from Scratch: https://t.co/TMGTAXJmqS
[4] Build a Large Language Model: https://t.co/TawYd8Zi8M
📝 I hope with this new post, you can steal some ideas, and insights, and put them into practice in your life. This is my reflection about reading 47 books in the first 6 months of 2023 and how I am focusing on reading less + applying them in my life.
https://t.co/8f2hUvHOzx
𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗮 𝗚𝗣𝗧 𝗠𝗼𝗱𝗲𝗹
For the past few weeks, I've been reading about Foundation Models [0] and decided to work on the implementation of the GPT architecture [1] to understand its building blocks and how it works under the hood.
Here are the concepts I worked on in this implementation:
Tokenization → Embeddings → Self-Attention → Multi-Head Attention → Transformer Block → GPT Model → Pretraining.
— The tokenization part was focused on building tokens from the input text and transforming them into token IDs; Then using a BPE tokenizer algorithm [2]
— Embeddings: representing tokens with a simple scalar value (ID) is too simplistic. Embeddings come to build richer representations. I built small embeddings for learning purposes and then increased the representation to scale that
— Multi-Head Self-Attention: this was one of the most interesting parts, creating attention scores and building relationships between tokens to produce context vectors
— Transformer blocks have the attention heads, dropout, layer norm, and the feed-forward network
— Pretraining is a standard training process used for deep learning models. But in this case, we update the weights end-to-end, from the embeddings to the attention layer to the feedforward network
The implementation was highly inspired by the Language Modeling from Scratch course [3] and the Build a Large Language Model book [4]. It's still very rudimentary, but very useful if you plan to learn these concepts in depth.
🔗 Article Link: Self-Attention, Foundation Models, and the GPT Architecture from Scratch: https://t.co/R69Heqle3P
---
In the future, I plan to write about finetuning (using foundation models and finetuning for other tasks) and optimizations (attention blocks optimization, GPU and kernel optimization).
[0] Foundation Models at Nubank: https://t.co/xLRWLPOl3o
[1] LLM implementation repo: https://t.co/STPyVTKlM4
[2] Tokenizers lecture: https://t.co/YtBodQ2SEW
[3] Language Modeling from Scratch: https://t.co/TMGTAXJmqS
[4] Build a Large Language Model: https://t.co/TawYd8Zi8M
✨ I worked on this article the whole day and made a lot of progress. I'm almost there.
A lot of work, with many experiments, but it's getting traction. "Make Something Wonderful" inspired me to keep building and sharing.
[ML Grind]
Finished:
> Foundation Models: finished transformer-based model implementation from scratch + finetuning
> Finished reading the Attention-based model in the industry paper: interesting insights about context length, scaling laws, and joint fusion
Have been working on:
> ML monitoring + alerting system for ML models
> AI agent for business flow: interesting engineering learnings (agent/prompt refinements <> MCP <> backend + infra)
> Real estate liquidity model: interesting learnings about temporal splits, model calibration, model optimization, and dataset exploration
Plan for today:
> Continue writing the blog post about the foundation model implementation
> Continue the "Language Modeling from Scratch" course by Stanford
> Read a new ML paper
@yash1_ I finished the writing, but still working on the illustrations. And then, I will carve out some time to refine it before publishing. Hopefully, tomorrow! (or this week).
I finally finished this book today. What a remarkable last chapter! I'm getting all my notes to share it online.
Also, I'm looking for the next book! I accept recommendations.
📚 Started a new book today.
I'm on the first few pages, and the way it was written already caught my attention.
"Teachers should prepare the student for the student's future, not for the teacher's past. Most teachers rarely discuss the important topic of the future of their field, and when this is pointed out, they usually reply: 'No one can know the future'. It seems to me the difficulty of knowing the future does not absolve the teacher from seriously trying to help the student to be ready for it when it comes."
Excited to be educated on styles of learning and thinking, and then get back to training, applying those principles.
Many people have already pointed out, but this course by Stanford is remarkable. It's been part of the first hour of my morning. Watching the lecture, taking notes, spawns new tabs with different papers mentioned, and coding to build the intuition behind each lecture.
Mixture of experts was a nice lecture, but the one I liked the most so far was about PyTorch and resource accounting and how to make sense of CPU/GPU, memory, runtime/compute (FLOPs), etc., from first principles.
🔗 link: https://t.co/TMGTAXIOBk
My notes on the repo: https://t.co/ZzZ028d7jL
Even though most of my notes are written in my physical notebook. Still lacking time to move all to the repo.
notes: https://t.co/2NGwUFU576
I've just read the "Let Me Convince You to Be Prolific" post about the benefits of being prolific, especially for creative people in the digital age.
The idea is that we should create and release more experiments, creating this long tail of acceptable work:
— Experiment > Failure > Refine > Loop
— Publishing work helps people find you
— Early drafts, faster feedback loop > faster improvement
— Each experiment contributes to the following one
I noticed this about my blog, where I've been writing for +10 years now. All the technical blogs I wrote helped improve the next one. Any of them is perfect, but I can see how much progress I have made over time.
The things you learn, the feedback you get, and the will to refine your work lead to mastery. And the long tail of work starts to compound and help discover you.
There are these two quotes I liked:
> "Giving up on perfectionism doesn’t mean that you will not produce anything perfect, but rather that perfection will happen from time to time because of the sheer mass of output." — Dean Keith Simonton
> "If you can write one short story a week — it doesn’t matter what the quality is to start, but at least you’re practicing, and at the end of the year you have 52 short stories, and I defy you to write 52 bad ones." — Ray Bradbury
I found this blog in @noghartt's bookmarks. There's an awesome curation there.
→ Blog: https://t.co/LBI4yNF1d8
I've just found out about this course on Foundation Models and Generative AI. Quite interesting lectures. I plan to watch the lectures as soon as I finish the Language Modeling from Scratch course. So many interesting things to learn.
@CausalFlops28 Manually, unfortunately. This is why most of my notebook notes are not transferred into the repo. But it's still a great tool to augment my thinking.