neel

@nlbte

engineering

Joined April 2018

299 Following

474 Followers

177 Posts

nlbte retweeted

Andrej Karpathy

@karpathy

18 days ago

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

150K

11K

14K

27M

nlbte retweeted

Google Research

@GoogleResearch

2 months ago

Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: https://t.co/CDSQ8HpZoc

39K

22K

19M

nlbte retweeted

Dwarkesh Patel

@dwarkesh_sp

8 months ago

The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self driving took so long 1:57:08 - Future of education Look up Dwarkesh Podcast on YouTube, Apple Podcasts, Spotify, etc. Enjoy!

536

19K

25K

11M

nlbte retweeted

Andrej Karpathy

@karpathy

8 months ago

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

karpathy's tweet photo. Excited to release new repo: nanochat!
(it's among the most unhinged I've written).

Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI.

It weighs ~8,000 lines of imo quite clean code to:

- Train the tokenizer using a new Rust implementation
- Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics
- Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use.
- SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval)
- RL the model optionally on GSM8K with "GRPO"
- Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI.
- Write a single markdown report card, summarizing and gamifying the whole thing.

Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc.

My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved.

Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

684

24K

18K

Who to follow

Zen Trades

@Zensored_

Unprofitable trader’s trade journal | BOS + FVG TRADER

nlbte retweeted

8 months ago

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. https://t.co/fYV4FPi71m

thinkymachines's tweet photo. LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.

https://t.co/fYV4FPi71m

559

nlbte retweeted

Thinking Machines

@thinkymachines

9 months ago

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. https://t.co/lrJioBmpbT

thinkymachines's tweet photo. Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly.

The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains.

https://t.co/lrJioBmpbT

230

nlbte retweeted

alterego

@alterego_io

9 months ago

Introducing Alterego: the world’s first near-telepathic wearable that enables silent communication at the speed of thought. Alterego makes AI an extension of the human mind. We’ve made several breakthroughs since our work started at MIT. We’re announcing those today.

885

11K

nlbte retweeted

Mistral AI

@MistralAI

12 months ago

Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.

MistralAI's tweet photo. Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning. https://t.co/SwKEEtCIXh

103

438

564

733K

nlbte retweeted

SkalskiP @ CVPR2026

@skalskip92

about 1 year ago

I can finally map @NBA player's position from the camera perspective onto the court map it's still a bit shaky... I'll smooth it out later it's time to detect shooting motions and mark the shot location! some of the code has already been migrated to: https://t.co/VK0RQFWud1

264

10K

815

nlbte retweeted

Jack Morris

@jxmnop

about 1 year ago

excited to finally share on arxiv what we've known for a while now: All Embedding Models Learn The Same Thing embeddings from different models are SO similar that we can map between them based on structure alone. without *any* paired data feels like magic, but it's real:🧵

124

592

909K

nlbte retweeted

alphaXiv

@askalphaxiv

about 1 year ago

Introducing Deep Research for arXiv Ask questions like 'What are the latest breakthroughs in RL fine-tuning?' and get comprehensive lit reviews with trending papers automatically included Turn hours of literature searches into seconds with AI-powered research context ⚡

541

373K

nlbte retweeted

Kevin Patrick Murphy

@sirbayes

over 1 year ago

I am happy to announce that the first draft of my RL tutorial is now available. https://t.co/SjMdabl0yW

724

321K

nlbte retweeted

Dwarkesh Patel

@dwarkesh_sp

over 1 year ago

The @gwern interview. 0:00:00 – Anonymity 0:01:09 – Automating Steve Jobs 0:04:38 – Isaac Newton's theory of progress 0:06:36 – Grand theory of intelligence 0:10:39 – Seeing scaling early 0:21:04 – AGI Timelines 0:22:54 – What to do in remaining 3 years until AGI 0:26:29 – Influencing the shoggoth with writing 0:30:50 – Human vs artificial intelligence 0:33:52 – Rabbit holes 0:38:48 – Hearing impairment 0:43:00 – Wikipedia editing 0:47:43 – Gwern dot net 0:50:20 – Counterfactual careers 0:54:30 – Borges & literature 1:01:32 – Gwern's process 1:19:17 - Gwern's finances 1:25:05 - Random

100

287

586K

nlbte retweeted

aarya

@gd3kr

over 1 year ago

introducing BLENDERGPT - the fastest way to generate 3D assets and import them seamlessly into Blender. text to 3D in ~20 seconds. blendergpt dot org.

183

511

835K

nlbte retweeted

shyamal

@shyamalanadkat

over 1 year ago

introducing swarm: an experimental framework for building, orchestrating, and deploying multi-agent systems. 🐝 https://t.co/97n4fehmtM

107

522

952K

nlbte retweeted

AI at Meta

@AIatMeta

over 1 year ago

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike. More details and examples of what Movie Gen can do ➡️ https://t.co/M19x2ndwnr 🛠️ Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt. Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment. Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes. Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video. We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

527

nlbte retweeted

xjdr

@_xjdr

over 1 year ago

i made a repo, its very naive as i wasn't planning on releasing this when i started. This does not have the new sampler yet, but i will add it once its stable. It has both the jax and pytorch implementations. If y'all want to make it better, submit PRs. https://t.co/1sc6fFWgf1

611

377

79K

nlbte retweeted

Konstantin Mishchenko

@konstmish

over 1 year ago

A student reached out asking for advice on research directions in optimization, so I wrote a long response with pointers to interesting papers. I thought it'd be worth sharing it here too: 1. Adaptive optimization. There has been a lot going on in the last year, below are some papers I personally found interesting. First of all, this paper by Li and Lan on Nesterov's acceleration of adaptive gradient descent: https://t.co/D6hykeK2tw Check Corollary 1 for a simple description of their method. There is one thing I don't like about it: the amount by which we can increase the stepsize at each iteration decreases as t grows. That being said, I don't know if this restriction can be lifted, and perhaps it's the best thing we can get. Yura Malitsky and I also did some work on adaptive gradient descent, making the stepsizes a bit larger, roughly sqrt(2) improvement over our previous result: https://t.co/exhFgbjChk We still don't know if that's the best we can do or if a tighter analysis can give us better methods. I should also mention that there is more push in the literature on Polyak stepsize, see for instance these two papers: https://t.co/8tKRReEpx2 (a stepsize very similar to Polyak) https://t.co/kZhWGqI1sE (Polyak stepsize with momentum) 2. Adagrad-like methods still can be studied, I believe it's an underexplored direction. I wish there was more papers on studying the importance of coordinate-wise stepsizes. One paper on the topic I really liked is this study of when Adam is more useful than SGD: https://t.co/sF5Abi08h5 There is also some research on new practical methods, for instance, acceleration of DoG is interesting: https://t.co/VMOdfbL95Z And I also enjoyed reading this paper by Rodomanov et al. on line-search-inspired stochastic methods: https://t.co/85uLHGZErQ 3. I also like the direction of getting better assumptions for optimization theory and studying the implications. A good example is the gradient clipping literature: https://t.co/NMTXzFJScs ((L₀, L₁)-smoothness) https://t.co/dG8xIoTFPN (same revisited) https://t.co/goKclD80WG (on heavy-tailed noise) We need to bridge optimization assumptions with what we know about neural networks, so read about properties of neural networks themselves like this: https://t.co/6M8l1avBOJ (on scales of layers and how their type affects Lipschitz constants) 4. These days, people are using deep networks of all scales for their tasks, and they have discovered a lot of tricks that haven't been studied thoroughly in optimization literature: quantization, Straight-Through Estimator, (https://t.co/7UK2gsojhm), low-rank techniques such as LoRA, learning-rate warm-up, etc. You should expose yourself to those tricks to get a better understanding of what the current theory is lacking. If you're considering choosing optimization as the topic for your PhD, here are some extra thoughts. Right now there is less activity than about 5 years ago, most low-hanging fruits seem to have been taken, and the remaining questions seem quite challenging. So if you're looking for a field where it is easy to get publications, it might not be perfect. However, it's still a good field to produce meaningful theory. It's also important who you would work with, i.e. if you can find a good advisor, that often affects one's satisfaction to a larger degree than the topic itself, so make your decision carefully. As my last word of advice, I definitely encourage testing new methods on neural networks (and preferably not on CIFAR10/CIFAR100, because they give misleading results), at least something like nanoGPT (https://t.co/NTk9KAAqd4). When I was a PhD student, I did a lot of theoretical research testing my methods on logistic regression and that was useful to understand the theory, but I also had the wrong impression about what works and what doesn't because of that. If you can, do both, understand the theory as much as you can, but also learn its limits and failure modes.

904

138

109K

nlbte retweeted

AI at Meta

@AIatMeta

over 1 year ago

📣 Introducing Llama 3.2: Lightweight models for edge devices, vision models and more! What’s new? • Llama 3.2 1B & 3B models deliver state-of-the-art capabilities for their class for several on-device use cases — with support for @Arm, @MediaTek & @Qualcomm on day one. • Llama 3.2 11B & 90B vision models deliver performance competitive with leading closed models — and can be used as drop-in replacements for Llama 3.1 8B & 70B. • New Llama Guard models to support multimodal use cases and edge deployments. • The first official distro of Llama Stack simplifies and supercharges the way developers & enterprises can build around Llama to support agentic applications and more. Details in the full announcement ➡️ https://t.co/1bnEeLY9qf Download Llama 3.2 models ➡️ https://t.co/DZoTQvESbG These models are available to download now directly from Meta and @HuggingFace — and will be available across offerings from 25+ partners that are rolling out starting today, including @accenture, @awscloud, @AMD, @azure, @Databricks, @Dell, @Deloitte, @FireworksAI_HQ, @GoogleCloud, @GroqInc, @IBMwatsonx, @Infosys, @Intel, @kaggle, @NVIDIA, @OracleCloud, @PwC, @scale_AI, @snowflakeDB, @togethercompute and more. With Llama 3.2 we’re making it possible to run Llama in even more places, with even more flexible capabilities. We’ve said it before and we’ll say it again: open source AI is how we ensure that these innovations reflect the global community they’re built for and benefit everyone. We’re continuing our drive to make open source the standard with Llama 3.2.