If you're building a startup that is making a 510k medical devices that are for elective procedures in the USA, I might be interested in investing. dm me.
I'm weak on biology - but have trained in operations research/industrial engineering + polymer-textile-fiber eng.
In a world of AI - as a software developer - why are you building a platform that requires hours of user configuration and meddling.
F* that.
Take the customer to the promised land they could only half fathom before your existence.
We’ve been struggling with GEMM efficiency on M=16,K=4096,N=16 shapes (common in cross‑attention for video).
‘DecomposeK’ + fusion of elementwise ops could be a game‑changer for our per‑step training time.
Congrats to the PyTorch team!
Super excited to share some work the torch.compile team has done on generating state-of-the-art GEMMs through Inductor! We present DecomposeK, a new way to do Split-k GEMM initially presented at PyTorch conference Europe that regularly beats cuBLAS for split-k shapes.
🧵👇
I don’t know why some people argue, that brain is a receiver of consciousness, Brain could at most be a sensor, a receiver, an ammeter, and a processor of consciousness. We are writing a paper on SWP device, that is sensing whole in part. So universe is fractal network of brains.
Beff is right: backprop isn't going away just because we found a "cleaner" algorithm. It's dying because the hardware is finally catching up to the laws of physics.
We're on our way to Thermodynamic Equilibrium Learning hybrids, which will train by physically annealing chips that are analog or stochastic (Extropic's TSUs, noisy oscillators, and photonic arrays). Learned representations are the same as equilibrium states. Built-in thermal noise means free exploration (no more hand-made schedulers or dropout). No activation storage, no backward pass, and no von Neumann wall.
Early 2026 demos show that non-toy tasks (CIFAR/ImageNet subsets, medical imaging, and even seq modeling) can save 10 to 25 times more energy while still getting backprop accuracy. By 2027, frontier pilots should be able to handle real generative workloads with less than 20% energy draw....physics wins the race for efficiency.
For pure-digital superclusters, backprop hangs on for a little while longer, but it becomes obsolete like perceptrons do today.
*****The biggest change since transformers? Hardware-software co-design that lets the chip optimize itself in real time.***** 🤯
Physics goes beyond von Neumann bottlenecks. Who's betting against the heat? 🔥
Even with strategies like paged attention and flash offloading, which helped companies like SanDisk reach $150 billion or higher valuations by resolving inference latency through DMA-overlapped HBM loads, Transformers' quadratic scaling is hitting walls.
However, SSMs (like the Mamba variants) are linear-time monsters that are ideal for condensing lengthy sequences into small states. Hybrid Transformer-SSM distills, which save 5–6 times the memory on 1B+ models, retain only 2% of attention heads for retrieval while offloading the remainder to SSMs.
This changes by 2027: Consider pre-encoding a video database or a corpus of 10M tokens into an SSM "trajectory" bundle, which is essentially a learned recurrent path that records dependencies, without keeping track of each KV pair. With the help of end-to-end fine-tuning that incorporates selection logic, the model only unrolls the pertinent sub-paths during queries.
Bloated KV caches or even CAG hybrids won't be used in the future for multi-modal LLMs over large datasets. Instead, it will switch to "State-Space Compression" (SSC), which involves offline distillation of databases and multi-modal inputs (text, images, and video) into recurrent SSM trajectories that are then unrolled on-demand during inference.
By learning the compression heuristics for you, gradient descent will transform retrieval into smooth state transitions, likely reducing memory by 70–90%, while allowing for "infinite" context, without causing attentional explosions.