Madhu Bagroy

@MBagroy

🤖 Interests: AI , fashion | Care about health ❤️, climate change 🌎, education 📚 & children 🧒 | 📖 Forever curious: #StayHungryStayFoolish 💡✨

London

Joined May 2015

5.4K Following

37 Followers

76 Posts

MBagroy retweeted

Yuchen Jin

@Yuchenj_UW

1 day ago

Before AI, I’d spend a weekend building 1 useless app. Now I can build 67 useless apps over a weekend, each with a logo, a fancy webpage, and 0 user.

385

448

344

213K

MBagroy retweeted

Patrick OShaughnessy

@patrick_oshag

26 days ago

Krishna Rao is the CFO of Anthropic, and this is his first podcast appearance. He joined the company two years ago when run-rate revenue was about $250M. Today it is $30B. He has helped raise ~$75B and is responsible for the procurement and allocation of compute. I feel lucky we get to hear what it is like to sit inside a company this consequential at a moment this pivotal. We discuss: - The cone of uncertainty - How he allocates compute across Trainium, TPUs, and GPUs - What investors misunderstand about model companies - Why the returns to frontier intelligence keep rising - Platform vs application and where Anthropic builds its own products - How Anthropic uses Claude internally I have asked my closing question about the kindest thing more than 500 times. Krishna's answer is one I have never heard before. Enjoy! Timestamps: 0:00 Intro 2:38 The Compute Canvas 6:51 The "Cone of Uncertainty" 11:58 Why the Returns to Frontier Intelligence Are So High 16:45 Recursive Self-Improvement 20:20 Scaling Laws 23:30 Sourcing $100 Billion in Compute 28:05 Platform vs. Application Strategy 32:52 Pricing Dynamics 38:48 How Anthropic’s Finance Team Uses Claude 43:24 Raising Capital & Overcoming Investor Skepticism 52:32 Public Perception, Risks, and Government Regulation 57:25 Mythos Release 1:12:33 What Could Derail the AI Revolution? 1:13:47 Biotech and Healthcare 1:15:31 The Kindest Thing

802

12K

MBagroy retweeted

Mira Murati

@miramurati

28 days ago

Today we're sharing our work on interaction models. A new class of model trained from scratch to handle real-time interaction natively, instead of gluing it onto a turn-based one. https://t.co/MoS5s4cm60

338

937

MBagroy retweeted

Sebastian Raschka

@rasbt

3 months ago

I (finally) put together a new LLM Architecture Gallery that collects the architecture figures all in one place! https://t.co/NO7z6XSRHS

rasbt's tweet photo. I (finally) put together a new LLM Architecture Gallery that collects the architecture figures all in one place!
https://t.co/NO7z6XSRHS https://t.co/X41FrK4i94

202

733K

Who to follow

Prayer is NOT the solution to any problem!!! He/Him/His Royal Highness

MBagroy retweeted

Robert Scoble

@Scobleizer

4 months ago

The @CVPR Report. I've been seeing lots of computer vision papers being passed around here on X, since many AI researchers just learned their papers have been accepted. So I asked @blevlabs to find them all for me. It's not complete because I'm still not pulling down many posts each day, but it is interesting enough to share. Congrats to all the people who have been accepted. These papers give you a little taste of the future. https://t.co/baFRJiI03M

14K

MBagroy retweeted

Google DeepMind @GoogleDeepMind

6 months ago

To build safer AI, we need to understand how models "think". 🧠 Enter Gemma Scope 2, a new set of tools to interpret Gemma 3: our family of lightweight open models. It can help researchers trace internal reasoning, debug complex behaviors and identify risks → https://t.co/W3UmLx2DlN

819

114

352

219K

MBagroy retweeted

Gmail

@gmail

6 months ago

🥹🥹

325

701

634

451K

MBagroy retweeted

Two Minute Papers

@twominutepapers

6 months ago

My first ever camera appearance interviewing Nobel-Prize winner John Jumper. I was SO nervous! 😅 This was meant to be an exclusive one-off celebration for Episode #1000. Really hope you'll have as good as time as I had there! Full video: https://t.co/Pr5PWHWK4X

MBagroy retweeted

Ogilvy

@Ogilvy

8 months ago

With profound sadness, we say goodbye to Piyush Pandey. The world has lost an advertising giant, India its greatest storyteller, and Ogilvy a piece of its soul. His legacy and spirit will forever inspire us. https://t.co/wAzE0qRXHU #PiyushPandey

Ogilvy's tweet photo. With profound sadness, we say goodbye to Piyush Pandey. The world has lost an advertising giant, India its greatest storyteller, and Ogilvy a piece of its soul. His legacy and spirit will forever inspire us. https://t.co/wAzE0qRXHU

#PiyushPandey https://t.co/7c1Dnvfwl0

162

43K

MBagroy retweeted

alphaXiv

@askalphaxiv

8 months ago

Introducing NotebookLM for arXiv papers 🚀 Transform dense AI research into an engaging conversation With context across thousands of related papers, it captures motivations, draws connections to SOTA, and explains key insights like a professor who's read the entire field

503

216K

MBagroy retweeted

Sundar Pichai

@sundarpichai

8 months ago

An exciting milestone for AI in science: Our C2S-Scale 27B foundation model, built with @Yale and based on Gemma, generated a novel hypothesis about cancer cellular behavior, which scientists experimentally validated in living cells. With more preclinical and clinical tests, this discovery may reveal a promising new pathway for developing therapies to fight cancer.

539

22K

MBagroy retweeted

Thinking Machines

@thinkymachines

9 months ago

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. https://t.co/lrJioBmpbT

thinkymachines's tweet photo. Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly.

The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains.

https://t.co/lrJioBmpbT

230

MBagroy retweeted

Paul Graham

@paulg

9 months ago

Wow, I didn't realize the difference between cars and motorcycles was this big. Motorcycles are 29x more dangerous.

17K

MBagroy retweeted

Wall Street Apes

@WallStreetApes

about 1 year ago

I watched this like 3 times because it is so mind blowing “This is a microprocessor, how did humans make this. How as human beings did we even come close to making something like this” He takes a microscope and keeps zooming:

16K

MBagroy retweeted

NIK HUNO 🦉

@NikHuno

about 1 year ago

I’m Russian. My girlfriend is Italian. We’ve been together 2+ years… And her culture still blows my mind. 11 bizarre things about Italian life I just can’t comprehend: 🧵🤌

NikHuno's tweet photo. I’m Russian.
My girlfriend is Italian.

We’ve been together 2+ years…
And her culture still blows my mind.

11 bizarre things about Italian life
I just can’t comprehend: 🧵🤌 https://t.co/dpAqos8PTC

226

703

MBagroy retweeted

Demis Hassabis

@demishassabis

about 1 year ago

Very excited to share the best coding model we’ve ever built! Today we’re launching Gemini 2.5 Pro Preview 'I/O edition' with massively improved coding capabilities. Ranks no.1 on LMArena in Coding and no.1 on the WebDev Arena Leaderboard. It’s especially good at building interactive web apps - this demo shows how it can be helpful for prototyping ideas. Try it in @GeminiApp, Vertex AI, and AI Studio https://t.co/7FbP3R1cmF Enjoy the pre-I/O goodies !

194

694

995K

MBagroy retweeted

Wayne Yap

@wayneyap

about 1 year ago

I'm Singaporean. Everyone credits Lee Kuan Yew for Singapore’s success. But in every great nation, there's also a real genius behind the scenes person - The "COO of the nation". Here's the story of the greatest right-hand man (and lessons you can learn): 🧵

wayneyap's tweet photo. I'm Singaporean.

Everyone credits Lee Kuan Yew for Singapore’s success.

But in every great nation, there's also a real genius behind the scenes person - The "COO of the nation".

Here's the story of the greatest right-hand man (and lessons you can learn): 🧵 https://t.co/NxsFFS3nDh

304

431K

MBagroy retweeted

Cerebras

@cerebras

about 1 year ago

Cerebras and @Meta Collaborate to Drive Fast Inference for Developers in New Llama API 🦙The world’s most popular open-source models — now with the world’s fastest inference. 🔑 Native to @AIatMeta Llama API with 1-click API key generation. 🗣️ Unlock next-generation applications like real-time voice assistants, instant agents, sub-second reasoning

cerebras's tweet photo. Cerebras and @Meta Collaborate to Drive Fast Inference for Developers in New Llama API

🦙The world’s most popular open-source models — now with the world’s fastest inference.
🔑 Native to @AIatMeta Llama API with 1-click API key generation.
🗣️ Unlock next-generation applications like real-time voice assistants, instant agents, sub-second reasoning

146

68K

Madhu Bagroy @MBagroy

about 1 year ago

@karpathy There is something for everyone—a great leveller. A kid learning math ✏️, a grandpa writing poems ✍️, a homemaker planning meals 🥘, a CEO brainstorming 💼, a lawyer drafting ⚖️, a driver navigating 🚚, the blind using voice 🧑‍🦯, the deaf chatting 🧏, the lonely finding company ❤️

205

MBagroy retweeted

Andrej Karpathy

@karpathy

over 1 year ago

I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check. Thinking ✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan question: "Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of "rings" using a slider. For example in Catan the radius is 3 hexes. Single html page please." Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not. ❌ It did not solve my "Emoji mystery" question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message. ❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 "tricky" tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro. ✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails. I like that the model *will* attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day... The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at. DeepSearch Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call "Deep Research", together with thinking. Except instead of "Deep Research" it is "Deep Search" (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went: - ✅ "What's up with the upcoming Apple Launch? Any rumors?" - ✅ "Why is Palantir stock surging recently?" - ✅ "White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?" - ✅ "What toothpaste does Bryan Johnson use?" - ❌ "Singles Inferno Season 4 cast where are they now?" - ❌ "What speech to text program has Simon Willison mentioned he's using?" ❌ I did find some sharp edges here. E.g. the model doesn't seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don't exist. A few times it said factual things that I think are incorrect and it didn't provide a citation for it (it probably doesn't exist). E.g. it told me that "Kim Jeong-su is still dating Kim Min-seol" of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI). The impression I get of DeepSearch is that it's approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI's recently released "Deep Research", which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a "major LLM labs" when I tried with it...). Random LLM "gotcha"s I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on. ✅ Grok 3 knows there are 3 "r" in "strawberry", but then it also told me there are only 3 "L" in LOLLAPALOOZA. Turning on Thinking solves this. ✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it. ✅ Few simple puzzles worked ok even without thinking, e.g. *"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"*. E.g. GPT4o says 2 (incorrectly). ❌ Sadly the model's sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse. ❌ Model still appears to be just a bit too overly sensitive to "complex ethical issues", e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying. ❌ Simon Willison's "*Generate an SVG of a pelican riding a bicycle*". It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can't "see" like people do, so it's arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude's are best, but imo I suspect they specifically targeted SVG capability during training. Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

karpathy's tweet photo. I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check.

Thinking
✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan question:

"Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of "rings" using a slider. For example in Catan the radius is 3 hexes. Single html page please."

Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not.

❌ It did not solve my "Emoji mystery" question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I've seen is from DeepSeek-R1 which once partially decoded the message.

❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 "tricky" tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro.

✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails.

I like that the model *will* attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day...

The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at.

DeepSearch
Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call "Deep Research", together with thinking. Except instead of "Deep Research" it is "Deep Search" (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went:

- ✅ "What's up with the upcoming Apple Launch? Any rumors?"
- ✅ "Why is Palantir stock surging recently?"
- ✅ "White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?"
- ✅ "What toothpaste does Bryan Johnson use?"
- ❌ "Singles Inferno Season 4 cast where are they now?"
- ❌ "What speech to text program has Simon Willison mentioned he's using?"

❌ I did find some sharp edges here. E.g. the model doesn't seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don't exist. A few times it said factual things that I think are incorrect and it didn't provide a citation for it (it probably doesn't exist). E.g. it told me that "Kim Jeong-su is still dating Kim Min-seol" of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI).

The impression I get of DeepSearch is that it's approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI's recently released "Deep Research", which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a "major LLM labs" when I tried with it...).

Random LLM "gotcha"s

I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on.

✅ Grok 3 knows there are 3 "r" in "strawberry", but then it also told me there are only 3 "L" in LOLLAPALOOZA. Turning on Thinking solves this.
✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it.
✅ Few simple puzzles worked ok even without thinking, e.g. *"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"*. E.g. GPT4o says 2 (incorrectly).
❌ Sadly the model's sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I'm not sure that it is state of the art humor. Example generated joke: "*Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!*". In quick testing, thinking did not help, possibly it made it a bit worse.
❌ Model still appears to be just a bit too overly sensitive to "complex ethical issues", e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying.
❌ Simon Willison's "*Generate an SVG of a pelican riding a bicycle*". It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can't "see" like people do, so it's arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude's are best, but imo I suspect they specifically targeted SVG capability during training.

Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.

664

17K

Madhu Bagroy

@MBagroy

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users