MissIndia @999Topshot - Twitter Profile

As a fun Saturday vibe code project and following up on this tweet earlier, I hacked up an **llm-council** web app. It looks exactly like ChatGPT except each user query is 1) dispatched to multiple models on your council using OpenRouter, e.g. currently: "openai/gpt-5.1", "google/gemini-3-pro-preview", "anthropic/claude-sonnet-4.5", "x-ai/grok-4", Then 2) all models get to see each other's (anonymized) responses and they review and rank them, and then 3) a "Chairman LLM" gets all of that as context and produces the final response. It's interesting to see the results from multiple models side by side on the same query, and even more amusingly, to read through their evaluation and ranking of each other's responses. Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally. For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between. But I'm not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively I find GPT 5.1 a little too wordy and sprawled and Gemini 3 a bit more condensed and processed. Claude is too terse in this domain. That said, there's probably a whole design space of the data flow of your LLM council. The construction of LLM ensembles seems under-explored. I pushed the vibe coded app to https://t.co/EZyOqwXd2k if others would like to play. ty nano banana pro for fun header image for the repo

karpathy's tweet photo. As a fun Saturday vibe code project and following up on this tweet earlier, I hacked up an **llm-council** web app. It looks exactly like ChatGPT except each user query is 1) dispatched to multiple models on your council using OpenRouter, e.g. currently:

"openai/gpt-5.1",
"google/gemini-3-pro-preview",
"anthropic/claude-sonnet-4.5",
"x-ai/grok-4",

Then 2) all models get to see each other's (anonymized) responses and they review and rank them, and then 3) a "Chairman LLM" gets all of that as context and produces the final response.

It's interesting to see the results from multiple models side by side on the same query, and even more amusingly, to read through their evaluation and ranking of each other's responses.

Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally. For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between. But I'm not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively I find GPT 5.1 a little too wordy and sprawled and Gemini 3 a bit more condensed and processed. Claude is too terse in this domain.

That said, there's probably a whole design space of the data flow of your LLM council. The construction of LLM ensembles seems under-explored.

I pushed the vibe coded app to
https://t.co/EZyOqwXd2k
if others would like to play. ty nano banana pro for fun header image for the repo

904

17K

1K

13K

5M

MissIndia

@999Topshot

7 months ago

@trikcode Data 😂

0

38

MissIndia

@999Topshot

7 months ago

@reachvaldo Art

0

14

MissIndia

@999Topshot

7 months ago

My referral code is https://t.co/qEUSRZ5duy

0

13

MissIndia

@999Topshot

7 months ago

Session Ended! — I scored 1,860 points in Caves - Dash & Game! Join me: https://t.co/HOBsl7n8eQ

1

0

28

999Topshot retweeted

Ahmad

@TheAhmadOsman

7 months ago

> be you > want to actually learn how LLMs work > sick of “just start with linear algebra and come back in 5 years” > decide to build my own roadmap > no fluff. no detours. no 200-hour generic ML playlists > just the stuff that actually gets you from “what’s a token?” to “I trained a mini-GPT with LoRA adapters and FlashAttention” > goal: build, fine-tune, and ship LLMs > not vibe with them. not "learn the theory" forever > build them > you will: > > build an autograd engine from scratch > > write a mini-GPT from scratch > > implement LoRA and fine-tune a model on real data > > hate CUDA at least once > > cry > > keep going > 5 phases > if you already know something? skip > if you're lost? rewatch > if you’re stuck? use DeepResearch > this is a roadmap, not a leash > by the end: you either built the thing or you didn’t > phase 0: foundations > > if matrix multiplication is scary, you’re not ready yet > > watch 3Blue1Brown’s linear algebra series > > MIT 18.06 with Strang, yes, he’s still the GOAT > > code Micrograd from scratch (Karpathy) > > train a mini-MLP on MNIST > > no frameworks, no shortcuts, no mercy > phase 1: transformers > > the name is scary > > it’s just stacked matrix multiplies and attention blocks > > Jay Alammar + 3Blue1Brown for the “aha” > > Stanford CS224N for the theory > > read "Attention Is All You Need" only AFTER building mental models > > Karpathy's "Let's Build GPT" will break your brain in a good way > > project: build a decoder-only GPT from scratch > > bonus: swap tokenizers, try BPE/SentencePiece > phase 2: scaling > > LLMs got good by scaling, not magic > > Kaplan paper -> Chinchilla paper > > learn Data, Tensor, Pipeline parallelism > > spin up multi-GPU jobs using HuggingFace Accelerate > > run into VRAM issues > > fix them > > welcome to real training hell > phase 3: alignment & fine-tuning > > RLHF: OpenAI blog -> Ouyang paper > > SFT -> reward model -> PPO (don’t get lost here) > > Anthropic's Constitutional AI = smart constraints > > LoRA/QLoRA: read, implement, inject into HuggingFace models > > fine-tune on real data > > project: fine-tune gpt2 or distilbert with your own adapters > > not toy examples. real use cases or bust > phase 4: production > this is the part people skip to, but you earned it > inference optimization: FlashAttention, quantization, sub-second latency > read the paper, test with quantized models > resources: > math/coding: > > 3Blue1Brown, MIT 18.06, Goodfellow’s book > PyTorch: > > Karpathy, Zero to Mastery > > transformers: > > Alammar, Karpathy, CS224N, Vaswani et al > > scaling: > > Kaplan, Chinchilla, HuggingFace Accelerate > > alignment: > > OpenAI, Anthropic, LoRA, QLoRA > > inference: > > FlashAttention > the endgame: > > understand how these models actually work > > see through hype > > ignore LinkedIn noise > > build tooling > > train real stuff > > ship your own stack > > look at a paper and think “yeah I get it” > > build your own AI assistant, infra, whatever > make it all the way through? > ship something real? > DM me. > I wanna see what you built. > happy hacking.

21

1K

73

2K

90K

MissIndia

@999Topshot

8 months ago

@elonmusk I don’t trust OAI ONLY trust XAI

0

17

MissIndia

@999Topshot

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users