Bipin Krishnan

@bkrish_

ML @ Desia

London, England

Joined April 2020

211 Following

268 Followers

1.8K Posts

Pinned Tweet

Bipin Krishnan @bkrish_

almost 4 years ago

A book full of recipes to solve common NLP & computer vision tasks: https://t.co/5ltyC0vHqX Discusses the following libraries: - @huggingface transformers, datasets, accelerate, @Gradio - @PyTorchLightnin - @wandb - pytorch image models(timm) - albumentations

249

146

bkrish_ retweeted

elie

@eliebakouch

6 days ago

microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale. this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab. the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale let's look at all of this in this likely very long thread 🧵

eliebakouch's tweet photo. microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale.

this model uses zero synthetic data or distillation from previous models. this means reasoning, agentic behavior, tool use are all learned fully during post-training with no cold start. bold choice that makes it harder and requires more iterations to reach sota, but you get FULL control over your model series and it proves they are serious about being a frontier lab.

the tech report is insanely detailed and precise about numbers. to give an example, they give the exact MFU across all the iterations of the model, with the exact changes etc. they also share the full scaling ladder recipe, to my knowledge this is the first time i've seen this in a tech report at this scale

let's look at all of this in this likely very long thread 🧵

264

278K

bkrish_ retweeted

hardmaru

@hardmaru

12 days ago

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://t.co/PK5h0mqQSo), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

154

642

739K

bkrish_ retweeted

Jeff Dean

@JeffDean

over 1 year ago

Welcome, AlphaChip! Today, we are sharing some exciting updates on our work published in @Nature in 2021 on using reinforcement learning for ASIC chip floorplanning and layout. We’re also naming this work AlphaChip. Since we first published this work, our use of this approach internally has grown significantly. It has now been used for multiple generations of TPU chips (TPU v5e, TPU v5p, and Trillium), with AlphaChip placing an increasing number of blocks and with larger wirelength reductions vs. human experts from generation to generation: AlphaChip has also been used with excellent results for other chips across Alphabet, including Google’s Axion chip, an Arm-based general-purpose data center CPU. In 2022, as a companion to the Nature paper, we open-sourced the code for the AlphaChip algorithms described in the Nature paper (see link below). Since then, external researchers could use this repository to pre-train on a variety of chip blocks and then apply the pre-trained model to new blocks, as was done and described in our original paper. Today we’re also releasing a pre-trained AlphaChip checkpoint for the open source release that makes it easier for external users to get started using AlphaChip for their own chip designs. Original Nature paper w/ wonderful joint first authors @Azaliamirh + @annadgoldie, and @mnyazgan, @joesmemory, @ESonghori, @ShenWangURC, @xylophi, @efjohnson, @pathomkar, @Azade_na, @PakJiwoo, Andy Tong, @kavyasrinivas23, @willhang_, @emretuncer, @quocleix, @JamesLaudon, @rh00, Roger Carpenter, and myself): https://t.co/QmJA56ZKOE (PDF: https://t.co/HP7y1LhAh4) Today’s Addendum to the paper published in Nature: https://t.co/BuGacrq57J (same authors) AlphaChip blog post: https://t.co/oLBq1J8oXj Open source release: https://t.co/cW1YMSHI57 Pre-trained checkpoint: https://t.co/iXtLqEjsH3 Three things we have observed in the external community are described in the Nature Addendum: (1) not doing any pre-training (circumventing the learning aspects of our method by removing its ability to learn from prior experience) (2) not training to convergence (standard practice in ML methods), and (3) using fewer computational resources than described in our Nature paper (using fewer resources is likely to harm performance, or require running for considerably longer to achieve the same performance). Pre-training the model for it to learn the craft of chip layout and to be able to generalize to new designs is an important part of our method. The pre-training process requires some effort to perform, since one has to find representative blocks and then run a lengthy computational process to pre-train the model to be good at placing those blocks. To avoid external users having to perform this process and make it easier for the external community to use AlphaChip, today we are releasing an AlphaChip model checkpoint pre-trained on 20 TPU blocks. This will enable users to get good zero-shot performance and faster convergence for novel blocks right out of the box. (For best results, however, we continue to recommend that developers pre-train on their own in-distribution blocks, and we provide a tutorial on how to perform pre-training with our open-source repository: see the Addendum). Many organizations have used AlphaChip as a building block for their own chip design efforts. For example, MediaTek, one of the top chip design companies in the world, extended AlphaChip to accelerate development of their most advanced chips (e.g. the Dimensity Flagship 5G used in Samsung mobile phones), while improving power, performance and chip area. We’re very excited about the increasing impact of AlphaChip internally and externally, and we look forward to continued work in this space to make custom higher performance, more efficient, and more capable chips dramatically easier to design and build.

JeffDean's tweet photo. Welcome, AlphaChip!

Today, we are sharing some exciting updates on our work published in @Nature in 2021 on using reinforcement learning for ASIC chip floorplanning and layout. We’re also naming this work AlphaChip.

Since we first published this work, our use of this approach internally has grown significantly. It has now been used for multiple generations of TPU chips (TPU v5e, TPU v5p, and Trillium), with AlphaChip placing an increasing number of blocks and with larger wirelength reductions vs. human experts from generation to generation:

AlphaChip has also been used with excellent results for other chips across Alphabet, including Google’s Axion chip, an Arm-based general-purpose data center CPU.

In 2022, as a companion to the Nature paper, we open-sourced the code for the AlphaChip algorithms described in the Nature paper (see link below). Since then, external researchers could use this repository to pre-train on a variety of chip blocks and then apply the pre-trained model to new blocks, as was done and described in our original paper.

Today we’re also releasing a pre-trained AlphaChip checkpoint for the open source release that makes it easier for external users to get started using AlphaChip for their own chip designs.

Original Nature paper w/ wonderful joint first authors @Azaliamirh + @annadgoldie, and @mnyazgan, @joesmemory, @ESonghori, @ShenWangURC, @xylophi, @efjohnson, @pathomkar, @Azade_na, @PakJiwoo, Andy Tong, @kavyasrinivas23, @willhang_, @emretuncer, @quocleix, @JamesLaudon, @rh00, Roger Carpenter, and myself):
https://t.co/QmJA56ZKOE (PDF: https://t.co/HP7y1LhAh4)

Today’s Addendum to the paper published in Nature: https://t.co/BuGacrq57J (same authors)

AlphaChip blog post: https://t.co/oLBq1J8oXj

Open source release: https://t.co/cW1YMSHI57

Pre-trained checkpoint: https://t.co/iXtLqEjsH3

Three things we have observed in the external community are described in the Nature Addendum: (1) not doing any pre-training (circumventing the learning aspects of our method by removing its ability to learn from prior experience) (2) not training to convergence (standard practice in ML methods), and (3) using fewer computational resources than described in our Nature paper (using fewer resources is likely to harm performance, or require running for considerably longer to achieve the same performance).

Pre-training the model for it to learn the craft of chip layout and to be able to generalize to new designs is an important part of our method. The pre-training process requires some effort to perform, since one has to find representative blocks and then run a lengthy computational process to pre-train the model to be good at placing those blocks. To avoid external users having to perform this process and make it easier for the external community to use AlphaChip, today we are releasing an AlphaChip model checkpoint pre-trained on 20 TPU blocks. This will enable users to get good zero-shot performance and faster convergence for novel blocks right out of the box. (For best results, however, we continue to recommend that developers pre-train on their own in-distribution blocks, and we provide a tutorial on how to perform pre-training with our open-source repository: see the Addendum).

Many organizations have used AlphaChip as a building block for their own chip design efforts. For example, MediaTek, one of the top chip design companies in the world, extended AlphaChip to accelerate development of their most advanced chips (e.g. the Dimensity Flagship 5G used in Samsung mobile phones), while improving power, performance and chip area.

We’re very excited about the increasing impact of AlphaChip internally and externally, and we look forward to continued work in this space to make custom higher performance, more efficient, and more capable chips dramatically easier to design and build.

329

616

474K

Who to follow

Nat McAleese

@__nmca__

Research @AnthropicAI. Previously @OpenAI, @DeepMind. Views my own.

Rob

@Rob_Mulla

4x Kaggle GM | ML Fameworks @Google - Views are my own.

Gilberto Titericz Jr

@Giba1

💥@Kaggle 13x Grandmaster. Former #1 in Competitions. 😍 DataScience and MachineLearning. ❤️Father, Movies, shows and games. https://t.co/sGLUQwS93v

bkrish_ retweeted

luthira

@luthiraabeykoon

about 1 month ago

We implemented @karpathy 's MicroGPT fully on FPGA fabric. No GPU. No PyTorch. No CPU inference loop. Just a transformer burned into hardware, generating 50,000+ tokens/sec. The model is small, but the idea is not: inference does not have to live only in software 👇

266

697

850K

bkrish_ retweeted

Adam Mainz

@MainzOnX

about 2 months ago

https://t.co/gppuN3xKjn

324

494

73K

bkrish_ retweeted

Jia-Bin Huang

@jbhuang0604

about 2 months ago

Modern Transformer architecture explained I compiled a list of videos on the Transformer architecture into a short "YouTube course". Hopefully, this would be helpful for beginners in the community. 🧵

jbhuang0604's tweet photo. Modern Transformer architecture explained

I compiled a list of videos on the Transformer architecture into a short "YouTube course".

Hopefully, this would be helpful for beginners in the community. 🧵 https://t.co/V1p2esqbtu

138

55K

bkrish_ retweeted

POM

@peterom

3 months ago

Deepseek got called out for scraping 150k Claude messages. So I'm releasing 155k of my personal Claude Code messages with Opus 4.5. I'm also open sourcing tooling to help you fetch your data, redact sensitive info & make it discoverable on HF - link below to liberate your data!

peterom's tweet photo. Deepseek got called out for scraping 150k Claude messages. So I'm releasing 155k of my personal Claude Code messages with Opus 4.5.

I'm also open sourcing tooling to help you fetch your data, redact sensitive info & make it discoverable on HF - link below to liberate your data! https://t.co/9yLmcIWTfg

628

15K

24M

bkrish_ retweeted

Boris Cherny

@bcherny

4 months ago

I'm Boris and I created Claude Code. I wanted to quickly share a few tips for using Claude Code, sourced directly from the Claude Code team. The way the team uses Claude is different than how I use it. Remember: there is no one right way to use Claude Code -- everyones' setup is different. You should experiment to see what works for you!

923

51K

104K

bkrish_ retweeted

Mark Saroufim

@marksaroufim

4 months ago

GPU MODE 2026: we’re post-training Kernel LLMs in public and are building all the infra we need to make GPU programming more accessible to all. We're doing this in close collaboration with some of my favorite communities @PrimeIntellect @modal and @LambdaAPI 2025 recap: 26K YouTube subs, 92 lectures, 24K Discord, 3x $100K+ kernel comps, 400K KernelBot submissions,3 events (NVIDIA / Jane Street / Accel) and 10 active working groups! So for 2026 our concrete goal is to post-train a Kernel LLM and get kernels merged into real repos (PyTorch / vLLM). We plan to share our first results by GTC (San Jose, March 2026). second by ICML (Seoul, July 2026). work-stream 1: de-slopify LLM kernels (with PyTorch / vLLM / NVIDIA). most generated kernels are verbose, fragile, and non-deterministic. the bar is “maintainer can review + merge” work-stream 2: post-training Kernel LLM (with Prime Intellect / Modal / Lambda / MIT). We’re betting on two levers: profiler-guided optimization + memory work-stream 3: competitions as evals. We want more end-to-end system optimizations, they're trickier to design good problems for but the results will be more interesting work-stream 4: “from scratch” repos. Think: 80% of the performance with 10% of the code (teenygrad, penny ) starting with a minimal RL library optimized for B200 2026 is the year we turn Kernel LLMs from a meme into one of the most reliable ways of improving the performance of AI systems. So if you're interested please join our weekly meetings!

436

199

49K

bkrish_ retweeted

Jack Morris

@jxmnop

10 months ago

curious about the training data of OpenAI's new gpt-oss models? i was too. so i generated 10M examples from gpt-oss-20b, ran some analysis, and the results were... pretty bizarre time for a deep dive 🧵

jxmnop's tweet photo. curious about the training data of OpenAI's new gpt-oss models? i was too.

so i generated 10M examples from gpt-oss-20b, ran some analysis, and the results were... pretty bizarre

time for a deep dive 🧵 https://t.co/t5pNnsSh8V

121

495

bkrish_ retweeted

Philipp Schmid

@_philschmid

11 months ago

Here are 5 practical tips for Context Engineering, which apply to @GoogleDeepMind Gemini 2.5 as well from @manusai! 1. Context Ordering Matters: Try to use "append-only" context, adding new information to the end. This maximizes cache hits reducing cost (4x) and latency. 2. Manage Tools Statically: Avoid changing tool order or availability mid-task, if not explicitly needed. This might break context caching and can/will confuse models if used tools in the history are no longer defined. 3. Use External Memory: Write explicitly or implicitly context/goals to external storage to. Preventing information loss. A typical task in Manus requires around 50 tool calls on average. 4. Recite Goals to not get lost: Prevent the model from "getting lost" by having it periodically restate its objectives. This keeps the primary goal in its recent attention span. 5. Embrace Errors: Keep error messages in the context. This allows the model to learn from its mistakes and avoid repeating them.

_philschmid's tweet photo. Here are 5 practical tips for Context Engineering, which apply to @GoogleDeepMind Gemini 2.5 as well from @manusai!

1. Context Ordering Matters: Try to use "append-only" context, adding new information to the end. This maximizes cache hits reducing cost (4x) and latency.

2. Manage Tools Statically: Avoid changing tool order or availability mid-task, if not explicitly needed. This might break context caching and can/will confuse models if used tools in the history are no longer defined.

3. Use External Memory: Write explicitly or implicitly context/goals to external storage to. Preventing information loss. A typical task in Manus requires around 50 tool calls on average.

4. Recite Goals to not get lost: Prevent the model from "getting lost" by having it periodically restate its objectives. This keeps the primary goal in its recent attention span.

5. Embrace Errors: Keep error messages in the context. This allows the model to learn from its mistakes and avoid repeating them.

521

581

41K

bkrish_ retweeted

Lucas Beyer (bl16)

@giffmana

11 months ago

Oh wow, did you guys know that torch.compile can compile numpy code? And even run it on GPU? This is pretty neat for all kinds of "surrounding" code besides the model (like evals and fancy metrics) that I used to do with numba/numexpr (cuz CPU-XLA was pretty meh). Poll below

giffmana's tweet photo. Oh wow, did you guys know that torch.compile can compile numpy code? And even run it on GPU?

This is pretty neat for all kinds of "surrounding" code besides the model (like evals and fancy metrics) that I used to do with numba/numexpr (cuz CPU-XLA was pretty meh).

Poll below https://t.co/B3TaqRRsWN

936

457

100K

bkrish_ retweeted

Jeremy Howard

@jeremyphoward

about 1 year ago

When I was optimising ULMFiT, I came up with a trick where I ran lots of ablations and fed all the hyperparams and results to a random forest. That told me which were most important. I told @l2k about it, and @wandb added it to their product! :D https://t.co/iCt92Bfc7f

jeremyphoward's tweet photo. When I was optimising ULMFiT, I came up with a trick where I ran lots of ablations and fed all the hyperparams and results to a random forest. That told me which were most important.

I told @l2k about it, and @wandb added it to their product! :D
https://t.co/iCt92Bfc7f https://t.co/l4zOnlXplF

244

137

18K

bkrish_ retweeted

Percy Liang

@percyliang

about 1 year ago

What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision:

percyliang's tweet photo. What would truly open-source AI look like? Not just open weights, open code/data, but *open development*, where the entire research and development process is public *and* anyone can contribute. We built Marin, an open lab, to fulfill this vision: https://t.co/racsvmhyA3

225

486

207K

bkrish_ retweeted

Guido van Rossum

@gvanrossum

about 1 year ago

The official trailer for the official Python documentary is out!!! ❤️to Ida.

524

948

177K

bkrish_ retweeted

Cihang Xie

@cihangxie

about 1 year ago

Still relying on OpenAI’s CLIP — a model released 4 years ago with limited architecture configurations — for your Multimodal LLMs? 🚧 We’re excited to announce OpenVision: a fully open, cost-effective family of advanced vision encoders that match or surpass OpenAI’s CLIP and Google’s SigLIP on multiple multimodal benchmarks! 🏆 Plus, we’re releasing 25+ model configurations (sizes from 5.9M to 632M parameters, various input resolutions & patch sizes) to fit every need. 👇 Thread:

cihangxie's tweet photo. Still relying on OpenAI’s CLIP — a model released 4 years ago with limited architecture configurations — for your Multimodal LLMs? 🚧

We’re excited to announce OpenVision: a fully open, cost-effective family of advanced vision encoders that match or surpass OpenAI’s CLIP and Google’s SigLIP on multiple multimodal benchmarks! 🏆

Plus, we’re releasing 25+ model configurations (sizes from 5.9M to 632M parameters, various input resolutions & patch sizes) to fit every need.

👇 Thread:

198

207K

bkrish_ retweeted

Mistral AI

@MistralAI

about 1 year ago

Introducing Mistral AI Classifier Factory! 🚀 We've created a user-friendly and simple way to build your own classifiers. Utilize our small yet highly efficient models and training methods to develop custom classifiers for moderation, intent detection, sentiment analysis, data clustering, fraud detection, spam filtering, recommendation systems, and more.Check out our docs and cookbooks for details. Check out our docs and cookbooks for details 🧵:

395

411K

bkrish_ retweeted

Jeremy Howard

@jeremyphoward

about 1 year ago

Announcing fasttransform: a Python lib that makes data transformations reversible/extensible. No more writing inverse functions to see what your model sees. Debug pipelines by actually looking at your data. Built on multi-dispatch. Work w/ @R_Dimm https://t.co/OGDrBFhnfP

839

135

496

44K

bkrish_ retweeted

Daniel Han

@danielhanchen

over 1 year ago

We made 5 challenges and if you score 47 points we'll offer you $500K/year + equity to join us at 🦥@UnslothAI! No experience or PhD needed. $400K - $500K/yr: Founding Engineer (47 points) $250K - $300K/yr: ML Engineer (32 points) Challenges: 1. Convert nf4 / BnB 4bit to Triton 2. Make FSDP2 work with QLoRA 3. Remove graph breaks in torch.compile 4. Help solve Unsloth issues! 5. Memory Efficient Backprop If you have any questions about the challenges, please feel free to ask! We're looking for people to help push Unsloth forward - so come join us to democratize AI further! Our past work includes: 1. 1.58bit DeepSeek R1 GGUFs: https://t.co/gALGkUg5Cg 2. GRPO with Llama 3.1 8B in a Colab: https://t.co/LFdkNxwAYg 3. Gemma bug fixes: https://t.co/7kX94PyKQR 4. Gradient accumulation bug fixes: https://t.co/Tq4c5Qwqyw Details & submission guide: https://t.co/iXxRUTijWV

danielhanchen's tweet photo. We made 5 challenges and if you score 47 points we'll offer you $500K/year + equity to join us at 🦥@UnslothAI!

No experience or PhD needed.

$400K - $500K/yr: Founding Engineer (47 points)
$250K - $300K/yr: ML Engineer (32 points)

Challenges:
1. Convert nf4 / BnB 4bit to Triton
2. Make FSDP2 work with QLoRA
3. Remove graph breaks in torch.compile
4. Help solve Unsloth issues!
5. Memory Efficient Backprop

If you have any questions about the challenges, please feel free to ask! We're looking for people to help push Unsloth forward - so come join us to democratize AI further!

Our past work includes:
1. 1.58bit DeepSeek R1 GGUFs: https://t.co/gALGkUg5Cg
2. GRPO with Llama 3.1 8B in a Colab: https://t.co/LFdkNxwAYg
3. Gemma bug fixes: https://t.co/7kX94PyKQR
4. Gradient accumulation bug fixes: https://t.co/Tq4c5Qwqyw

Details & submission guide: https://t.co/iXxRUTijWV

183

780

bkrish_ retweeted

Thomas Wolf

@Thom_Wolf

over 1 year ago

After 6+ months in the making and burning over a year of GPU compute time, we're super excited to finally release the "Ultra-Scale Playbook" Check it out here: https://t.co/dekxY4BQZO A free, open-source, book to learn everything about 5D parallelism, ZeRO, fast CUDA kernels, how and why overlap compute & communication – all scaling bottlenecks and tools introduced with motivation, theory, interactive plots from our 4000+ scaling experiments and even NotebookLM podcasters to tag along with you. - How was DeepSeek trained for $5M only? - Why did Mistral trained an MoE? - Why is PyTorch native Data Parallelism implementation so complex under the hood? - What are all the parallelism techniques and why were they invented? - Should I use ZeRO-3 or Pipeline Parallelism when scaling and what's the story behind both techniques? - What is this Context Parallelism that Meta used to train Llama 3? Is it different from Sequence Parallelism? - What is FP8? how does it compares to BF16? In this book, our goal was to gather, in a single place, a coherent, easy to read yet detailed story of all the techniques that make today's LLM scaling possible. The largest factor for democratizing AI will always be teaching everyone how to build AI and in particular how to create, train and fine-tune high performance models. In other word making accessible to everybody the techniques that power all recent large language models and efficient training is possibly one of the most essential of them. What started as a simple blog-post ended up becoming an interactive writing piece containing 30k+ words. So we've decided to actually print it as a real 100-pages physical book as well: the physical ultrafast playbook –containing all the science of distributed and fast AI training. We plan to send free copies as gifts to the first readers of the online version so feel free to add your email in the form linked in the blog post.

Thom_Wolf's tweet photo. After 6+ months in the making and burning over a year of GPU compute time, we're super excited to finally release the "Ultra-Scale Playbook"

Check it out here: https://t.co/dekxY4BQZO

A free, open-source, book to learn everything about 5D parallelism, ZeRO, fast CUDA kernels, how and why overlap compute & communication – all scaling bottlenecks and tools introduced with motivation, theory, interactive plots from our 4000+ scaling experiments and even NotebookLM podcasters to tag along with you.

- How was DeepSeek trained for $5M only?
- Why did Mistral trained an MoE?
- Why is PyTorch native Data Parallelism implementation so complex under the hood?
- What are all the parallelism techniques and why were they invented?
- Should I use ZeRO-3 or Pipeline Parallelism when scaling and what's the story behind both techniques?
- What is this Context Parallelism that Meta used to train Llama 3? Is it different from Sequence Parallelism?
- What is FP8? how does it compares to BF16?

In this book, our goal was to gather, in a single place, a coherent, easy to read yet detailed story of all the techniques that make today's LLM scaling possible.

The largest factor for democratizing AI will always be teaching everyone how to build AI and in particular how to create, train and fine-tune high performance models. In other word making accessible to everybody the techniques that power all recent large language models and efficient training is possibly one of the most essential of them.

What started as a simple blog-post ended up becoming an interactive writing piece containing 30k+ words. So we've decided to actually print it as a real 100-pages physical book as well: the physical ultrafast playbook –containing all the science of distributed and fast AI training.

We plan to send free copies as gifts to the first readers of the online version so feel free to add your email in the form linked in the blog post.

109

688

369K

Bipin Krishnan

@bkrish_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users