Math β’ Physics β’ CS β’ AI β’ Robotics. Formerly Mathology Overflow. Bridging the gap between the blackboard and the motherboard. π§ βοΈ
Founder : @mythkernel
What's wrong with this post @elonmusk@nikitabier@X@XCorpIndia that you guys have suspended the account with a premium subscription?
Can anyone please tell me what's my fault here?
Creating a full blown thread of 25 posts with painstakingly created 24 images with AI?
Wanted people to genuinely learn about LLMs from the scratch for FREE in 30 weeks?
Or for commenting on this upcoming work of mine on related "LLM education" posts to increase more awareness among the people?
Is it possible for you to overturn this suspension with a warning?
Iβm excited to share that Iβll be joining OpenAI and look forward to working with the exceptional team there.
It was a difficult decision to move on. Iβm incredibly proud of the amazing team at Google and everything weβve built together. It has been an honor and a pleasure to work with all of you.
@AnthropicAI Outside the US, from India. I bought the @claudeai max 20x sub on 11th Jun, because of Fable 5. Paid 236$ (200+18%) just because of Fable 5. And now you guys are rug-pulling your foreign customers @DarioAmodei.
What's the procedure to get a refund?
My two cents on how India should proceed to build sovereign AIs. @narendramodi@PMOIndia
P.S. Been working & building custom language models since 2016, pre-transformers era. Feel free to correct me. Always happy to learn something new.
The US export controls blocking non-US access to @AnthropicAI latest frontier models (Fable 5 / Mythos 5) mark a structural shift: advanced AI is now explicitly strategic infrastructure. This accelerates the need for sovereign capability.
Building on calls for an ambitious India AI Mission, here is a rigorous, from-scratch analysis of what it realistically costs to develop production-ready foundational models.
I treat @deepseek_ai's public figures (e.g., V3βs ~2.788M H800 GPU-hours / ~$5.6M reference) with healthy skepticism. These "almost certainly" reflect only the final successful training run, not total R&D (experiments, ablations, failed runs, data pipelines, talent, or infrastructure CapEx). Independent scaling and industry benchmarks support significantly higher full-project costs, even with genuine architectural efficiencies.
Two Target Classes:
DeepSeek-V4-Pro class (efficient MoE path): 1.6T total parameters / ~49B active per token, native 1M context, hybrid attention (CSA + HCA), mHC stability, Muon optimizer, >32T tokens. Strong reasoning/agentic performance at lower compute intensity.
GPT-5.5-Pro class (higher-end / denser or larger-scale path): Significantly higher effective compute (dense-like or very large MoE), targeting maximum capability through greater scale.
@deepseek_ai@OpenAI
The following are the probable technical detail + capital allocation at every stage, with conservative-to-realistic ranges based on FLOPs scaling, hardware specs (H100/H800-class ~400β700 TFLOPS sustained effective), realistic MFU (35β55%), and MoE communication overhead.
Rough FLOPs estimate (6 Γ active params Γ tokens for core training compute): V3 reference (~37B active, 14.8T tokens): ~3.29 Γ 10Β²β΄ FLOPs.
V4 scaling (~49B active, ~32β33T tokens): ~2.9Γ multiplier β ~9.7 Γ 10Β²β΄ FLOPs.
Theoretical GPU-hours (at ~500β600 TFLOPS effective sustained) for V4 final pre-training: ~4β9 million GPU-hours equivalent.
At $2β6/GPU-hour effective (rental/amortized + power): $10β60M for the final pre-training run only.
Key adjustments:MFU 35β55% typical (higher end achievable with custom kernels, FP8, good parallelism).
MoE adds routing/communication overhead vs pure dense.
Full project multiplies final-run compute by 2β5Γ+ for R&D/experiments.
Architecture wins (hybrid sparse attention cutting effective FLOPs/KV cache ~70%+ at 1M context, mHC for stability with low overhead) are real and reduce waste.
Stage-by-Stage Breakdown
1. Data Curation, Acquisition & Synthetic Generation
Curate/filter 32β50T+ high-quality tokens (web, code, science, long documents, agentic traces). Heavy synthetic flywheel for reasoning chains, trajectories, and preference data.
Domain balancing + versioning. Petabyte-scale storage with lineage.
For GPT-class: even larger/more diverse corpus.
Costs: Acquisition/licensing + pipelines: $15β50M.
Synthetic generation (inference on intermediates): $20β80M (major driver).
Human/expert annotation (targeted): $5β20M.
Storage/versioning platform: $10β25M.
DeepSeek-class subtotal: $50β175M.
GPT-class subtotal: $80β300M (larger scale).
2. Infrastructure & Hardware Setup
Sovereign cluster targeting 50kβ150k+ B200/H200-class GPUs (or mixed optimized silicon) with high-bandwidth fabrics. Sustained MFU >50%.
Liquid cooling, redundant power (50β200+ MW peak). Custom kernels for hybrid attention, expert parallelism, Muon, and mHC.
Costs: GPUs/accelerators (procurement or long-term lease): $150β800M+.
Servers, networking, high-speed storage: $50β200M.
Data center/power/cooling build-out: $80β300M (power infrastructure often 30β50% of infra).
Early electricity & setup: $5β20M.
DeepSeek-class subtotal: $285β1,320M.
GPT-class subtotal: $500β2,500M+ (larger/more dense clusters).
3. Pre-Training
DeepSeek-class:
1.6T MoE with 49B active/token. Hybrid attention (CSA + HCA interleaved with sparse attention) for ~27% FLOPs and ~10% KV cache vs prior gen at 1M context. mHC (residual matrices projected onto Birkhoff polytope via Sinkhorn-Knopp) for stability at trillion scale (~6β7% overhead). Muon optimizer, mixed FP4/FP8. High MFU target.
GPT-class: Denser or much larger effective scale; higher raw FLOPs; less reliance on sparsity tricks.
Costs (final run + R&D/experiments multiplier):
GPU-hours/compute: $20β150M (DeepSeek-class final run lower due to efficiency; GPT-class much higher).
Electricity during training: $10β50M.
Experiments/ablations (2β5Γ final run): $40β400M+.
DeepSeek-class subtotal: $80β400M.
GPT-class subtotal: $300β1,500M+.
4. Post-Training, Alignment & Reasoning
Two-stage (domain-expert SFT + GRPO cultivation β on-policy distillation).
Synthetic preference data dominant. GRPO/DPO-style + distillation for reasoning/agentic gains without monolithic RL blowup.
GPT-class may need heavier RL or more iterations.
Costs: Compute (inference + RL loops): $15β80M.
Synthetic data & modeling: $10β40M.
Iteration & human oversight: $5β25M.
DeepSeek-class subtotal: $30β145M.
GPT-class subtotal: $60β300M.
5. Evaluation, Safety, Red-Teaming & Iteration
Full benchmark suite (SWE-Bench, GPQA, agentic, long-context, safety). Adversarial testing + constitutional frameworks. Multiple feedback loops.
Costs: $20β100M (both classes; GPT-class potentially higher iteration volume).
6. Inference, Deployment & Serving:
Optimized engines (vLLM/SGLang-style) with continuous batching, speculative decoding, quantization (FP8/INT4), and MoE routing. Efficient 1M-context KV management. Production clusters sized for target QPS/latency.
Costs (initial capex): Serving clusters + optimization: $40β250M.
Electricity/ops (recurring): Scales with usage ($10β50M+/year initial).
DeepSeek-class subtotal (initial): $40β200M.
GPT-class subtotal (initial): $80β400M.
7. Talent, Operations & Ecosystem
150β500+ team (researchers focused on MoE scaling, hybrid attention, mHC extensions, distillation; infra engineers for high-MFU kernels; safety/evals specialists).
Costs: $100β500M+ over project duration (salaries $400kβ$1.5M+ total comp for top talent + equity).
Overall Totals (End-to-End, First System + Initial Scaling)
DeepSeek-V4-Pro class (efficient MoE path): $500 million β $1.5β2 billion (midpoint ~$800Mβ$1.2B realistic with strong execution). Leverages sparsity, hybrid attention, and mHC for lower intensity. Final runs can be remarkably efficient; full project still substantial due to R&D and infra.
GPT-5.5-Pro class (higher-end path): $2 billion β $5 billion+. Driven by significantly higher raw compute, denser scaling, and potentially more iteration.
Annual ongoing opex (power, talent, maintenance, inference at scale):
$100β400M+ after launch (scales with usage).
Strategic Recommendations for India AI MissionPrioritize efficient path first (DeepSeek-class architecture) for faster ROI and capability at manageable cost, then scale toward higher-end.
Sovereign cluster: 50kβ150k+ GPU-class with mixed sourcing and high-MFU focus.
R&D focus: Hybrid attention, mHC-style residuals, Muon/MoE optimizations, synthetic data pipelines.
Phased funding: Tied to milestones (e.g., stable 1M-context pre-train, agentic benchmark leadership).
Ecosystem: National data trust + talent incentives.
This is executable with disciplined engineering. The architectural efficiencies are real and create a genuine cost advantage, even after conservative adjustments for full project scope. The US controls make sovereign leadership not just desirable but urgent. This is the complete, from-scratch picture across both model classes.
@narendramodi@AmitShah@TVMohandasPai@svembu@SarvamAI
P.S. My two cents after working with language models since 2016, pre-transformers era.
The US export controls blocking non-US access to @AnthropicAI latest frontier models (Fable 5 / Mythos 5) mark a structural shift: advanced AI is now explicitly strategic infrastructure. This accelerates the need for sovereign capability.
Building on calls for an ambitious India AI Mission, here is a rigorous, from-scratch analysis of what it realistically costs to develop production-ready foundational models.
I treat @deepseek_ai's public figures (e.g., V3βs ~2.788M H800 GPU-hours / ~$5.6M reference) with healthy skepticism. These "almost certainly" reflect only the final successful training run, not total R&D (experiments, ablations, failed runs, data pipelines, talent, or infrastructure CapEx). Independent scaling and industry benchmarks support significantly higher full-project costs, even with genuine architectural efficiencies.
Two Target Classes:
DeepSeek-V4-Pro class (efficient MoE path): 1.6T total parameters / ~49B active per token, native 1M context, hybrid attention (CSA + HCA), mHC stability, Muon optimizer, >32T tokens. Strong reasoning/agentic performance at lower compute intensity.
GPT-5.5-Pro class (higher-end / denser or larger-scale path): Significantly higher effective compute (dense-like or very large MoE), targeting maximum capability through greater scale.
@deepseek_ai@OpenAI
The following are the probable technical detail + capital allocation at every stage, with conservative-to-realistic ranges based on FLOPs scaling, hardware specs (H100/H800-class ~400β700 TFLOPS sustained effective), realistic MFU (35β55%), and MoE communication overhead.
Rough FLOPs estimate (6 Γ active params Γ tokens for core training compute): V3 reference (~37B active, 14.8T tokens): ~3.29 Γ 10Β²β΄ FLOPs.
V4 scaling (~49B active, ~32β33T tokens): ~2.9Γ multiplier β ~9.7 Γ 10Β²β΄ FLOPs.
Theoretical GPU-hours (at ~500β600 TFLOPS effective sustained) for V4 final pre-training: ~4β9 million GPU-hours equivalent.
At $2β6/GPU-hour effective (rental/amortized + power): $10β60M for the final pre-training run only.
Key adjustments:MFU 35β55% typical (higher end achievable with custom kernels, FP8, good parallelism).
MoE adds routing/communication overhead vs pure dense.
Full project multiplies final-run compute by 2β5Γ+ for R&D/experiments.
Architecture wins (hybrid sparse attention cutting effective FLOPs/KV cache ~70%+ at 1M context, mHC for stability with low overhead) are real and reduce waste.
Stage-by-Stage Breakdown
1. Data Curation, Acquisition & Synthetic Generation
Curate/filter 32β50T+ high-quality tokens (web, code, science, long documents, agentic traces). Heavy synthetic flywheel for reasoning chains, trajectories, and preference data.
Domain balancing + versioning. Petabyte-scale storage with lineage.
For GPT-class: even larger/more diverse corpus.
Costs: Acquisition/licensing + pipelines: $15β50M.
Synthetic generation (inference on intermediates): $20β80M (major driver).
Human/expert annotation (targeted): $5β20M.
Storage/versioning platform: $10β25M.
DeepSeek-class subtotal: $50β175M.
GPT-class subtotal: $80β300M (larger scale).
2. Infrastructure & Hardware Setup
Sovereign cluster targeting 50kβ150k+ B200/H200-class GPUs (or mixed optimized silicon) with high-bandwidth fabrics. Sustained MFU >50%.
Liquid cooling, redundant power (50β200+ MW peak). Custom kernels for hybrid attention, expert parallelism, Muon, and mHC.
Costs: GPUs/accelerators (procurement or long-term lease): $150β800M+.
Servers, networking, high-speed storage: $50β200M.
Data center/power/cooling build-out: $80β300M (power infrastructure often 30β50% of infra).
Early electricity & setup: $5β20M.
DeepSeek-class subtotal: $285β1,320M.
GPT-class subtotal: $500β2,500M+ (larger/more dense clusters).
PM @narendramodi Sir we need an India AI Mission under you with @NandanNilekani as vice chair and others from the private sector and govt. to Help India tackle the AI Revolution. We are way behind and need a national mission to get going quickly. Existing govt programs are too slow, way too small to make any large impact. We need an annual 50000 cr fund for deep tech and AI, a 200,000 cr ELGS Guarantee Fund to build Hyper cloud, hardware and chips. @AshwiniVaishnaw@nsitharaman@PiyushGoyal@FinMinIndia@RBI We need a Very Large National Mission. @AmitShah@amitmalviya
@AnthropicAI Outside the US, from India. I bought the @claudeai max 20x sub on 11th Jun, because of Fable 5. Paid 236$ (200+18%) just because of Fable 5. And now you guys are rug-pulling your foreign customers @DarioAmodei.
What's the procedure to get a refund?
What's wrong with this post @elonmusk@nikitabier@X@XCorpIndia that you guys have suspended the account with a premium subscription?
Can anyone please tell me what's my fault here?
Creating a full blown thread of 25 posts with painstakingly created 24 images with AI?
Wanted people to genuinely learn about LLMs from the scratch for FREE in 30 weeks?
Or for commenting on this upcoming work of mine on related "LLM education" posts to increase more awareness among the people?
Is it possible for you to overturn this suspension with a warning?
I appealed as well, but this was the immediate instant reply. What's going on here?
Any help would be highly appreciated. @elonmusk@nikitabier@X@XCorpIndia btw this one is a "Premium +" account. Please don't suspend this one as well.
βTwo sequences walk into a proofβ¦β
One goes down, one goes up, both meet at (\sqrt{ab}).
Then a twist: swap + reciprocal symmetry still preserves the story.
Animated end-to-end in Manim.
Math has choreography. β¨
Find the YT Video here : https://t.co/417CnIxYFS
#math #manim #visualization #invariance #arthurengel