This is great news for employees on the front of non-compete clauses in employment agreements and anticompetitive no-hire agreements among companies. In gaming, Activision was notorious for obstructive anti-competes before the Microsoft acquisition.
๐ฅ Llama3 is out ๐ฅ
8B and 70B models available today.
8k context length.
Trained with 15 trillion tokens on a custom-built 24k GPU cluster.
Great performance on various benchmarks, with Llam3-8B doing better than Llama2-70B in some cases.
More versions are coming over the next few months.
https://t.co/EkU9aIHdZE
Congrats to @AIatMeta on Llama 3 release!! ๐
https://t.co/UBwFPTJM6V
Notes:
Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @ @lmsysorg :))
400B is still training, but already encroaching GPT-4 territory (e.g. 84.8 MMLU vs. 86.5 4Turbo).
Tokenizer: number of tokens was 4X'd from 32K (Llama 2) -> 128K (Llama 3). With more tokens you can compress sequences more in length, cites 15% fewer tokens, and see better downstream performance.
Architecture: no major changes from the Llama 2. In Llama 2 only the bigger models used Grouped Query Attention (GQA), but now all models do, including the smallest 8B model. This is a parameter sharing scheme for the keys/values in the Attention, which reduces the size of the KV cache during inference. This is a good, welcome, complexity reducing fix and optimization.
Sequence length: the maximum number of tokens in the context window was bumped up to 8192 from 4096 (Llama 2) and 2048 (Llama 1). This bump is welcome, but quite small w.r.t. modern standards (e.g. GPT-4 is 128K) and I think many people were hoping for more on this axis. May come as a finetune later (?).
Training data. Llama 2 was trained on 2 trillion tokens, Llama 3 was bumped to 15T training dataset, including a lot of attention that went to quality, 4X more code tokens, and 5% non-en tokens over 30 languages. (5% is fairly low w.r.t. non-en:en mix, so certainly this is a mostly English model, but it's quite nice that it is > 0).
Scaling laws. Very notably, 15T is a very very large dataset to train with for a model as "small" as 8B parameters, and this is not normally done and is new and very welcome. The Chinchilla "compute optimal" point for an 8B model would be train it for ~200B tokens. (if you were only interested to get the most "bang-for-the-buck" w.r.t. model performance at that size). So this is training ~75X beyond that point, which is unusual but personally, I think extremely welcome. Because we all get a very capable model that is very small, easy to work with and inference. Meta mentions that even at this point, the model doesn't seem to be "converging" in a standard sense. In other words, the LLMs we work with all the time are significantly undertrained by a factor of maybe 100-1000X or more, nowhere near their point of convergence. Actually, I really hope people carry forward the trend and start training and releasing even more long-trained, even smaller models.
Systems. Llama 3 is cited as trained with 16K GPUs at observed throughput of 400 TFLOPS. It's not mentioned but I'm assuming these are H100s at fp16, which clock in at 1,979 TFLOPS in NVIDIA marketing materials. But we all know their tiny asterisk (*with sparsity) is doing a lot of work, and really you want to divide this number by 2 to get the real TFLOPS of ~990. Why is sparsity counting as FLOPS? Anyway, focus Andrej. So 400/990 ~= 40% utilization, not too bad at all across that many GPUs! A lot of really solid engineering is required to get here at that scale.
TLDR: Super welcome, Llama 3 is a very capable looking model release from Meta. Sticking to fundamentals, spending a lot of quality time on solid systems and data work, exploring the limits of long-training models. Also very excited for the 400B model, which could be the first GPT-4 grade open source release. I think many people will ask for more context length.
Personal ask: I think I'm not alone to say that I'd also love much smaller models than 8B, for educational work, and for (unit) testing, and maybe for embedded applications etc. Ideally at ~100M and ~1B scale.
Talk to it at https://t.co/KmKRlZeTHQ
Integration with https://t.co/RD6MRWT2zz
# explaining llm.c in layman terms
Training Large Language Models (LLMs), like ChatGPT, involves a large amount of code and complexity.
For example, a typical LLM training project might use the PyTorch deep learning library. PyTorch is quite complex because it implements a very general Tensor abstraction (a way to arrange and manipulate arrays of numbers that hold the parameters and activations of the neural network), a very general Autograd engine for backpropagation (the algorithm that trains the neural network parameters), and a large collection of deep learning layers you may wish to use in your neural network. The PyTorch project is 3,327,184 lines of code in 11,449 files.
On top of that, PyTorch is written in Python, which is itself a very high-level language. You have to run the Python interpreter to translate your training code into low-level computer instructions. For example the cPython project that does this translation is 2,437,955 lines of code across 4,306 files.
I am deleting all of this complexity and boiling the LLM training down to its bare essentials, speaking directly to the computer in a very low-level language (C), and with no other library dependencies. The only abstraction below this is the assembly code itself. I think people find it surprising that, by comparison to the above, training an LLM like GPT-2 is actually only a ~1000 lines of code in C in a single file. I am achieving this compression by implementing the neural network training algorithm for GPT-2 directly in C. This is difficult because you have to understand the training algorithm in detail, be able to derive all the forward and backward pass of backpropagation for all the layers, and implement all the array indexing calculations very carefully because you donโt have the PyTorch tensor abstraction available. So itโs a very brittle thing to arrange, but once you do, and you verify the correctness by checking agains PyTorch, youโre left with something very simple, small and imo quite beautiful.
Okay so why donโt people do this all the time?
Number 1: you are giving up a large amount of flexibility. If you want to change your neural network around, in PyTorch youโd be changing maybe one line of code. In llm.c, the change would most likely touch a lot more code, may be a lot more difficult, and require more expertise. E.g. if itโs a new operation, you may have to do some calculus, and write both its forward pass and backward pass for backpropagation, and make sure it is mathematically correct.
Number 2: you are giving up speed, at least initially. There is no fully free lunch - you shouldnโt expect state of the art speed in just 1,000 lines. PyTorch does a lot of work in the background to make sure that the neural network is very efficient. Not only do all the Tensor operations very carefully call the most efficient CUDA kernels, but also there is for example torch.compile, which further analyzes and optimizes your neural network and how it could run on your computer most efficiently. Now, in principle, llm.c should be able to call all the same kernels and do it directly. But this requires some more work and attention, and just like in (1), if you change anything about your neural network or the computer youโre running on, you may have to call different kernels, with different parameters, and you may have to make more changes manually.
So TLDR: llm.c is a direct implementation of training GPT-2. This implementation turns out to be surprisingly short. No other neural network is supported, only GPT-2, and if you want to change anything about the network, it requires expertise. Luckily, all state of the art LLMs are actually not a very large departure from GPT-2 at all, so this is not as strong of a constraint as you might think. And llm.c has to be additionally tuned and refined, but in principle I think it should be able to almost match (or even outperform, because we get rid of all the overhead?) PyTorch, with not too much more code than where it is today, for most modern LLMs.
And why I am working on it? Because itโs fun. Itโs also educational, because those 1,000 lines of very simple C are all that is needed, nothing else. It's just a few arrays of numbers and some simple math operations over their elements like + and *. And it might even turn out to be practically useful with some more work that is ongoing.
I am delighted to announce publication of the 4th edition of Linear Algebra Done Right as an Open Access book. The electronic version is legally free to the world at https://t.co/ii1ovMFKvH.
That website also has links to pre-order the print version of the book.
#linearalgebra