Futurality AI | AI-Powered Design Workflows for Automotive + Industrial Design | Sketch → 3D → Prototype in minutes | Helping car designers & makers ship 10x fa
Build a Large Language Model from scratch!
This repository contains the code examples for developing, pretraining, and finetuning a LLM from scratch.
It is the official codebase for the book Build a Large Language Model (From Scratch).
Notebook examples are included for each chapter:
Chapter 1: Understanding Large Language Models
Chapter 2: Working with Text Data
Chapter 3: Coding Attention Mechanisms
Chapter 4: Implementing a GPT Model from Scratch
Chapter 5: Pretraining on Unlabeled Data
Chapter 6: Finetuning for Text Classification
Chapter 7: Finetuning to Follow Instructions
Link to the repo in the comments!
Research papers every LLM engineer must read:
- Attention Is All You Need
- BERT
- GPT-3: Language Models are Few-Shot Learners
- Scaling Laws for Neural Language Models
- Chinchilla
- InstructGPT
- Chain-of-Thought Prompting
- Retrieval-Augmented Generation
- LoRA: Low-Rank Adaptation
- LLaMA
- FlashAttention
- DPO: Direct Preference Optimization
🤖 RAG gives LLMs access to your data. Agentic RAG gives them judgment about what to do with it.
The difference shows up in how each system handles uncertainty. Standard retrieval-augmented generation runs a fixed pipeline: encode the query, search the vector database, retrieve similar documents, generate a response. It works well when the query is well-formed and the right context already exists in one place.
Agentic RAG adds a layer of reasoning at almost every step — rewriting the query before retrieval, deciding whether the initial results are actually sufficient, choosing between multiple source types (vector DB, APIs, live web), and evaluating whether the final answer is relevant before surfacing it. If any of those checks fail, it loops back rather than proceeding.
That feedback loop is what makes it genuinely different, not just incrementally better. A standard RAG pipeline doesn't know when it's about to give a bad answer. An agentic one can at least ask the question.
#agenticai #agenticaibootcamp #aibootcamps #rag #agenticrag #aiengineering
SSO vs OAuth vs OIDC vs SAML
𝗦𝗦𝗢 is a user experience, not a protocol. It lets users log in once and access multiple apps without re-authenticating; providing seamless access across tools. It relies on protocols like SAML or OIDC.
𝗢𝗔𝘂𝘁𝗵 is for authorization. It lets apps access user data or services without sharing credentials. It controls what an app can access, not identity.
𝗢𝗜𝗗𝗖 is an authentication layer on top of OAuth 2.0. It verifies user identity and provides user info via ID tokens (usually JWTs). It’s the standard for login + identity in modern apps.
𝗦𝗔𝗠𝗟 is an older, XML-based authentication protocol used for enterprise SSO. It’s still widely used in legacy and enterprise systems, though newer applications increasingly adopt OIDC. It’s powerful but more complex than OIDC.
If you remember one thing: OAuth = access, OIDC = identity, SAML = enterprise SSO, SSO = the experience.
Learn more here: https://t.co/sNCN5cdt6V
What else would you add?
——
♻️ Repost to help others learn system design.
➕ Follow me ( Nikki Siapno ) + turn on notifications.
26 Claude rules I learned the hard way:
(so you don't have to)
1. The chat
☑ The longer your chat, the dumber Claude gets:
Every message reloads the entire conversation.
↳ Start fresh. Don't drag a 50-message thread.
2. The prompt
☑ A 500-word prompt isn't good. It's lazy:
If you can't explain in 30 words, Claude can't build it
"Make it better" is the worst prompt ever written.
↳ Short & specific beats long and vague.
3. The files
☑ Dumping 50 files makes Claude dumber:
Claude re-reads everything before every answer.
That's the expensive part - not the response.
↳ Give it 3 sharp files, not 50 messy ones.
4. The memory
☑ Claude doesn't remember last week:
No carryover. No "as we discussed."
↳ Context lives in your files, not Claude's head.
5. The setup
☑ The model selector matters more than prompts:
Adaptive thinking is off by default.
That's why your answers feel generic.
↳ Switch it on. Pick the right model. Then prompt.
6. The shift
☑ Most people quit Claude before they try Cowork:
The day you start uploading, everything changes.
Skills replace 80% of what you keep retyping.
↳ Stop typing into a chat. Start building a system.
I teach you how Claude works at https://t.co/Vn60ElPZ2i.
Copy my folder & download my 3 personal .md files:
Step 1: Subscribe for free at https://t.co/psB7XxB2Y4.
Step 2: You will have two choices: free or paid.
Step 3: Choose the free tier. Don't pay for anything.
Step 4: Open your welcoming email. Reply to it.
Step 5: Trace the Notion link. Open '.md files' folder.
Step 6: Access my entire folder + 3 files template.
Step 7: Send this image to your team's channel.
Step 8: Read 2x newsletter per week (for free).
Step 9: Become the "AI guy" at work, forever.
Andrej Karpathy: "90% of Claude's mistakes come from missing context, not a weak model."
41% mistake rate without a CLAUDE.md. 11% with the 4-rule baseline. 3% with the 12-rule version below
here are the 12 rules senior engineers settled on:
1. think before coding: state assumptions, don't guess. the model can't read your mind, stop hoping it will
2. simplicity first: minimum code, no speculative abstractions. the moment you let Claude add "for future flexibility," you've added 200 lines you'll delete next quarter
3. surgical changes: touch only what you must. don't let it improve adjacent code, that's how PRs blow up
4. goal-driven execution: define success criteria upfront, loop until verified. without them Claude either loops forever or stops too early
5. use the model only for judgment calls: classification, drafting, summarization, extraction. NOT routing, retries, status-code handling, deterministic transforms. if code can answer, code answers
6. token budgets are not advisory: per-task 4000, per-session 30000. by message 40 of a long debug, Claude is re-suggesting fixes you rejected at message 5
7. surface conflicts, don't average them: two patterns in the codebase? pick one. Claude blending them is how errors get swallowed twice
8. read before you write: read exports, callers, shared utilities. Claude will happily add a duplicate function next to an identical one it never read
9. tests verify intent, not just behavior: a test that can't fail when business logic changes is wrong. all 12 of Claude's tests can pass while the function returns a constant
10. checkpoint every significant step: Claude finished steps 5 and 6 on top of a broken state from step 4. nobody noticed for an hour
11. match the codebase conventions: class components? don't fork to hooks silently. testing patterns assumed componentDidMount, hooks broke them without surfacing
12. fail loud: "completed successfully" with 14% of records silently skipped is the worst class of bug. surface uncertainty, don't hide it
what actually compounds instead of the next framework:
- the CLAUDE.md file as institutional memory across sessions
- eval-driven changes, not vibe-driven
- checkpoints over speed
- explicit conflicts over silent blending
- discipline over framework, every time
- one repo, one rules file, no exceptions
you don't need a better AI
you need better context engineering
complete playbook below ��
LLM vs Agent vs Agentic Workflow vs Multi-Agent System ⚡
People throw these four terms around like they mean the same thing. They don't — and the difference decides your cost, your latency, and whether you can actually debug the thing when it breaks.
Here's the clean mental model 👇
🧠 LLM → GENERATE
A model that produces text from the context it's given. Single-step, no real autonomy.
→ Best for: chat, summarization, drafting, Q&A
→ Autonomy: Low
🤖 AGENT → ACT
Reasons, chooses actions, uses tools, iterates toward a goal. Keeps working memory.
→ Best for: task execution, research, troubleshooting
→ Autonomy: Medium
🔀 AGENTIC WORKFLOW → ORCHESTRATE
A structured flow where AI runs predefined steps. Deterministic, controllable, human approvals optional.
→ Best for: business processes, document pipelines, repeatable tasks
→ Autonomy: Medium–High
👥 MULTI-AGENT SYSTEM → COLLABORATE
Multiple specialized agents working together. Parallel, powerful, but more overhead.
→ Best for: complex projects, large multi-step problems
→ Autonomy: High
The biggest mistake? Reaching for the most autonomous option because it sounds impressive — a multi-agent system for a job a single prompt could handle. You pay all the coordination cost and get none of the benefit.
Production AI isn't one thing. It ranges from simple generation to coordinated autonomous systems. The skill is matching the architecture to the real problem.
Save this for the next time someone calls a chatbot an "agent." 🔖
Where do you draw the line between an agentic workflow and a true agent? Tell me below 👇
Karpathy said something you'll regret ignoring:
"Remove yourself as the bottleneck. Maximize your leverage. Put in very few tokens, and a huge amount of stuff happens on your behalf."
Loop engineering is the exact thing that does that.
In a hand-run session, the operator handles two things:
- deciding what the agent runs next
- and checking its output before the next step
Both are manual, and both decide how far the agent gets on its own without the operator.
Loop engineering moves both steps into the system.
A core operating structure surrounds the loop, and the diagram below depicts it.
- A schedule decides what to run
- Loop is the maker that produces the work
- A separate checker agent grades the output
- A file on disk holds the state they both read.
The loop runs until either done, max iterations, or an exhausted budget.
Here are some practical engineering considerations:
1) A model grading its own output justifies what it already did instead of catching where it failed.
That's why a separate checker's findings return to the maker as the next instruction. And the cycle repeats until the checker finds nothing left to fix.
2) A loop with no stop condition burns tokens, and the cost climbs fast once sub-agents and long runs add up.
That's why the exit must be set before the loop runs, not while it is running.
A simple exit could be:
↳ fix only the major issues, run one final pass, and stop after two loops, with "all tests pass and lint clean" as the rule that ends it.
3) State has to live on disk, not in context.
The model forgets everything between runs, so an MD file or a knowledge graph holds what is done and what is still open.
Each run reads it and writes back to it, which lets a loop pick up again after days.
4) The lower the verification bar, the safer the loop.
Boring, repetitive checks like a stale version string or a missing test are trivial to verify, so a loop runs them with little risk while the operator is away.
Judgment-heavy work is loopable too, but only as far as the checker can confirm the result.
Let's look at how an unattended loop fails in two ways.
1) It reports done when nothing is actually verified.
The separate checker exists to prevent it, but it merges code faster than anyone reads it, so over weeks, the team stops understanding its own codebase while every check stays green.
Green tests say the code passed the tests, not that anyone knows what shipped. Someone still has to read what the loop merges.
2) The checker keeps a running loop honest, but it only catches failures inside a run.
The harness around the loop, like the prompts, tools, and checks wrapped around the model, still drifts and breaks in production as models change.
That repair loop is usually run by hand based on observability traces.
My co-founder wrote a detailed walkthrough (with code) on making that harness repair itself, where a failing trace gets diagnosed, the fix is verified against the exact input that failed, and the failure is locked as a regression test so it cannot recur.
Read it below.
If you want to get dangerously good at system design, learn these concepts:
1 Scalability
2 Availability
3 Reliability
4 Latency
5 Throughput
6 Database
7 SQL vs NoSQL
8 Load Balancing
9 Caching
10 Cache Invalidation
11 API Design
12 REST
13 GraphQL
14 gRPC
15 Authentication
16 Fault Tolerance
17 High Availability
18 CAP Theorem
19 Consistency Models
20 Replication
21 Erasure Coding
22 Consensus
23 Leader Election
24 Secrets Management
25 RBAC
26 Sharding
27 Indexing
28 Denormalization
29 ACID
30 BASE
31 Event-Driven
32 Message Queue
33 Pub/Sub
34 Sync vs Async
35 Idempotency
36 Bulkhead
37 Retry Logic
38 Timeout
39 Service Discovery
40 API Gateway
41 Blue-Green Deployment
42 Canary Release
43 Feature Flags
44 Observability
45 Logging
46 Correlation ID
47 Monitoring
48 Alerting
49 Full-Text Search
50 Time Series
(...and many more!)
What else should make this list?
===
👋 PS - Want a simple breakdown of each concept?
Read right now in my newsletter:
→ Part 1: https://t.co/u7BsUK307i
→ Part 2: https://t.co/CJAwmrUXdI
→ Part 3: https://t.co/DOQpnNOnjc
===
💾 Save & RT to help others get good at system design.
👤 Follow @systemdesignone + turn on notifications.