🚨Just IN: MIT proved you can delete 90% of a neural network without losing accuracy.
Researchers found that inside every massive model, there is a "winning ticket”, a tiny subnetwork that does all the heavy lifting.
They proved if you find it and reset it to its original state, it performs exactly like the giant version.
But there was a catch that killed adoption instantly..
you had to train the massive model first to find the ticket. nobody wanted to train twice just to deploy once. it was a cool academic flex, but useless for production.
The original 2018 paper was mind-blowing:
But today, after 8 years…
We finally have the silicon-level breakthrough we were waiting for: structured sparsity.
Modern GPUs (NVIDIA Ampere+) don’t just “simulate” pruning anymore.
They have native support for block sparsity (2:4 patterns) built directly into the hardware.
It’s not theoretical, it’s silicon-level acceleration.
The math is terrifyingly good: a 90% sparse network = 50% less memory bandwidth + 2× compute throughput. Real speed.. zero accuracy loss.
Three things just made this production-ready in 2026:
- pruning-aware training (you train sparse from day one)
- native support in pytorch 2.0 and the apple neural engine
- the realization that ai models are 90% redundant by design
Evolution over-parameterizes everything. We’re finally learning how to prune.
The era of bloated, inefficient models is officially over. The tooling finally caught up to the theory, and the winners are going to be the ones who stop paying for 90% of weights they don’t even need.
The future of AI is smaller, faster, and smarter.
"Hyperloop Transformers"
This paper propose a memory-efficient LLM via looped Transformers.
They basically reuse the middle block across depth, then add hyper-connections only between loops.
Key result is that this restores flexibility lost from weight sharing, letting the model beat depth-matched Transformers with ~50% fewer parameters. The result still holds after INT4 quantization too.
Latent space reasoning via looped transformers has gained attention lately. It is rooted in optimization unrolling , where each loop implicitly models a GD step on hidden states. Our ICLR paper studied what if we explicitly run GD in latent space at test time?
1/
Flat minima theory is breaking. At modern scale, gradient descent doesn't settle into a nice convex bowl. It bounces chaotically at the "Edge of Stability."
Turns out, this chaos is exactly why massively overparameterized networks generalize. 🧵
Hamming's talk is so important that I reproduced it on my site. It's one of the only things on my site written by someone else.
https://t.co/kWvKdwIiOm
🇫🇷 A French tax official was arrested for selling crypto investors' home addresses and financial records to criminal networks.
41 kidnappings followed. One every 2.5 days since January 2026.
The criminals didn't need to hack anything. They bought a list from someone inside the government.
France is the most dangerous country in the world right now if you hold crypto and someone knows about it 💀
Source: Le Mond
Our newest @OriginsProject podcast, What's New in Science with @skdh & Lawrence Krauss: From Ghost Murmers to AI Cures, will premiere at 4 PM ET today. Don't miss it! https://t.co/y2zc0KEJc2 via
@YouTube
The Japanese railway privatization of 1987 stands as one of the most devastating defeats ever dealt to statist transportation mythology. The government split the bloated Japan National Railways into seven regional companies, sold them off, and watched private ownership transform a bankruptcy-bound disaster into the world's most efficient rail system.
JNR hemorrhaged money for decades before privatization. By 1987, the state railway carried debt equivalent to $200 billion in today's money while delivering mediocre service plagued by strikes and inefficiency. Politicians treated it as a jobs program rather than a transportation service. The predictable result: chronic losses, deteriorating infrastructure, and customer service that reflected government monopoly arrogance.
Private ownership changed everything overnight. The new JR companies slashed operating costs by 40% within five years while dramatically improving service quality. JR East alone now generates annual profits exceeding $3 billion. These companies invest billions in cutting-edge technology, maintain punctuality rates above 99%, and operate the world's most advanced high-speed rail networks. They achieved this without a single yen of operational subsidies.
The transformation reveals a core dynamic of transportation infrastructure: private companies must satisfy customers to survive, while government monopolies need only satisfy politicians. JR companies diversified into real estate, retail, and hospitality around their stations, creating integrated profit centers that cross-subsidize rail operations. Government railways never innovate this way because bureaucrats face no market pressure to generate returns.
Meanwhile, Amtrak burns through $2 billion in annual subsidies while delivering third-world service across most routes, and European state railways require massive taxpayer bailouts every few years to stay solvent.
Attention sinks and compression valleys? Same coin.
Presenting at #ICLR2026 this morning: "Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin." Two phenomena everyone's been studying separately turn out to have the same root cause - massive activations in the residual stream. We prove it, show it across models from 410M to 120B, and use it to propose Mix-Compress-Refine: a three-phase view of how transformers organize computation in depth.
w/ Enrique* @arroyo_alvr@fedzbar@epomqo@mmbronstein@ylecun
Pavilion 3, P3-#2002, 10:30 AM AM local time
Tenemos el honor de invitarlos a un debate encabezado por el Presidente de la Nación, @JMilei , junto al diputado nacional @AdrianRavier y Juan Carlos de Pablo, en el Palacio Libertad, donde se analizará "La teoría general del empleo, el interés y el dinero" de John Maynard Keynes y sus consecuencias en las economías modernas.
Link de inscripción: https://t.co/hoA8MSv2sJ
INFORMACIÓN IMPORTANTE: La inscripción es gratuita mediante formulario online, con ubicaciones por orden de llegada. La inscripción no garantiza lugar y el cupo es limitado.
📅 28 de abril
🕡 18:00 hs
📍 Palacio Libertad
Marc Andreessen reveals the exact framework Elon Musk uses to run six companies at once and outpace entire industries. It comes down to a rare combination of old-school industrialism and extreme, hands-on engineering.
"The CEO has to not just be a great CEO, they also have to be like a great technologist," Andreessen explains. While most executives rely on distant memories of being a programmer, Elon has the encyclopedic knowledge to sit down with a chip designer at 2 AM in Austin and actually figure out what is wrong with the hardware. He is able to go hands-on with rocket designers, AI engineers, and everything in between.
Instead of traditional corporate management, he treats everything as a production line. Every week, he maps out the entire operation on monitors, identifies the one critical bottleneck slowing things down, and goes directly to the engineers to solve it.
This is the secret to his speed. While a normal company might take six months to clear a single issue, Elon is fixing the critical production bottleneck at his companies 52 times a year himself. He runs this loop over and over again.
This relentless approach creates what one former SpaceX employee described as a "zone of shocking competence." Because Elon talks directly to the people actually doing the work, he instantly sniffs out incompetence. Anyone who cannot cut it is let go.
But it is also the ultimate talent magnet. The absolute best engineers in the world want to work for him because he is the rare CEO who can actually be an engineering peer. It is a highly systematic way of optimizing a company to take on profound challenges and solve them at an unmatched speed.
New episode of The Information Bottleneck is out, this time with @liuzhuang1234 (Princeton).
We talked about ConvNeXt and whether architecture still matters; dataset bias and what "good data" actually looks like; ImageBind and why vision is the natural bridge across modalities; CLIP's blind spots; memory as the real bottleneck behind the agent hype; whether LLMs have world models; and Transformers Without Normalization.
For years, the vision community debated what actually matters: architecture, inductive bias, self-attention vs convolution. After a lot of back-and-forth, we ended up in a funny place: ViT and ConvNet give roughly the same performance once you tune the details.
What I find interesting is that once you reach a certain performance level, it becomes much easier to swap and tweak components without really changing the outcome.
Talking to Zhuang on this episode, I kept wondering whether the same is now true for LLMs. If we wil spent serious time on an alternative architecture today, would you actually get a meaningfully different model, or just land on the same Pareto curve with extra steps?
I'm starting to suspect it's the latter. Architecture matters less than we think. Data, compute, and a handful of pillars do most of the work.
No es casualidad que ayer se reuniera el encargado de negocios de EE.UU en Venezuela John Barrett con Delcy y Diosdado, y de inmediato la OFAC emitiera una licencia permitiendo que Delcy pague los millonarios honorarios a los abogados de Maduro y Cilia.
¿Qué cedió Delcy a cambio?