La medicina especializada debe llegar a cada rincón del planeta. El mayor desafío de los sistemas de salud globales tiene una respuesta: la tecnología. 🌍🏥💻
La mejor tecnología del mundo debe estar en manos de los problemas más importantes del mundo.
UN DESARROLLADOR ACABA DE HACER LO QUE GOOGLE LLEVA AÑOS IGNORANDO
Creo un navegador en Rust pensado especificamente para automatizar tareas, web scraping y agentes de IA
> Consume 30MB de RAM
> Las paginas cargan en 85ms
> Bloquea +3.500 trackers automaticamente
> Evita anuncios, analitica y scripts de tracking
Se llama Obscura
Y tiene algo que Chrome jamas va a tener
Cada sesion genera una huella digital diferente. GPU, canvas, audio, bateria... todo randomizado
Los detectores no pueden atraparlo porque se comporta exactamente como Chrome real
Es un reemplazo directo de Puppeteer y Playwright
Sin Nodejs. Sin dependencias. Un solo binario
Tiene +16k estrellas en GitHub. 100% open source. Gratis
Guardalo para no perderlo 👇
1/
We have been treating GPU memory all wrong.
What if the GPU didn't need to store your model at all?
MegaTrain enables full-precision training of 100B+ LLMs on a single GPU by turning VRAM into a transient, stateless cache.
The secret? Inverting the memory hierarchy. 🧵
I often go back to one of my favorite diffusion model theory papers coz it closely matches and formally explains my vibe-intutions for why they either perform really well or epically fail in different settings
When debugging your DDPM or flow code/implementation, the rough advice I was given is to overfit on a single sample, then a minibatch, then a few minibatches, and eventually the whole dataset. That's great but the stuff in the middle with minibatches has always been so difficult to _actually_ do, but somehow magically starts working again when I moved to full dataset training
The paper says diffusion models factorise local, independent (sub)structures in the data, composing different local motifs to create the final image/molecule. From a percolation theory POV, there's a phase transition after which the denoiser "snaps" particles in place during sampling. If you pass that threshold for transition, the denoiser follows the underlying score a lot better and if you dont, your model can't factorisee and compose properly and you get garbage out during sampling
> Overfitting on a single sample is easy. The sample _IS_ the mode and the model just burns that sample into the weights. There's no need for factorisation or real composition at all coz every sampling step obviously moves you closer to the same target without any conflicting gradient signals
> When overfitting on a minibatch or a few minibatches, the loss is super erratic and the denoiser never accurately reproduces all the samples in the minibatch perfectly like it did with the single sample. Coz of so much variation even in the minibatch, there's super weak factorisation coz we don't yet cross the percolation threshold needed to achieve decent sample coherence. I think this is also a major reason for poor performance aside from the simpler "noisy SGD gradient" explanation
> Training on the full dataset works again properly and I see some good novelty from generated samples coz that's what I want to optimise for (eg: new proteins/binders, etc). At this full dataset scale, we cross a percolation threshold that's associated with strong factorisation. The score also is a lot smoother
When training diffusion models from scratch now, I've mostly been skipping the minibatch stuff in the middle and just moving to full dataset after overfitting on a single thing, and it's been OK
IMHO, one of the greatest theorems of Bregman Manifolds that has been often used for decomposing Kullback-Leibler divergence in applications:
Dual Foliations from mixed primal/dual coordinates.
Professor Amari's 2016 textbook
A tricky LLM interview question:
You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces.
So you add KV cache compression and evict 90% of the cached tokens.
VRAM usage stays as is and GPU still runs out of memory.
Why?
(answer below)
Evicting 90% of the KV cache can free almost none of the memory it was using.
This sounds counterintuitive, but it follows directly from how production servers store the cache today.
The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues.
This is the dominant memory cost for reasoning models.
If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU.
One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it.
But this does not solve the memory problem yet.
The reason is paged attention, which is the memory manager behind vLLM and most production servers.
Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens.
This block returns to the allocator only when every slot inside it is empty.
Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks...
...so despite eviction, almost every block is left with at least some survivor tokens.
For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token.
This means the allocator frees almost nothing.
Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout.
Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order.
Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds.
This introduces another bookkeeping cost that an in-order layout inherently avoids.
So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server.
There's another problem.
Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected).
But fast attention kernels used in production, like FlashAttention, never save those scores.
They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast.
So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide.
NVIDIA published a method called TriAttention to solve both these problems.
It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters.
For the memory problem, it runs a compaction pass every 128 decoded tokens.
The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order.
On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory.
KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens.
You can find the NVIDIA write-up here: https://t.co/ZwXv7VezVu
I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching.
Read it below.
Stephen Boyd, Stanford EE professor:
"Wall Street pays quant researchers $400K–$750K a year to estimate a hidden state from noisy data. It's the same filter that landed Apollo on the moon, and it's the final lecture of a free Stanford course."
this free stanford lecture from 17 years ago holds the entire "kalman filter edge" the 2026 quant threads sell you. in his last ee263 lecture, boyd builds state estimation from scratch: you can't see the true state, only noisy measurements, so you blend what your model predicts with what you just observed, weighted by how much you trust each. that blend is the kalman gain. that's the whole thing.
it's the exact filter the thread codes up for a dynamic hedge ratio. swap "spacecraft position" for "the true relationship between two assets" and the equations are identical. rudolf kalman published it in 1960. boyd has taught it free since the 2000s.
so the math was never the moat. the hedge ratio that updates every tick, the uncertainty that calibrates your z-score, all of it is standard state estimation, public for over sixty years.
and here's the honest part the thread is actually good about. the filter is only optimal if your assumptions hold, linear dynamics, gaussian noise, a Q and R you set correctly. get the noise model wrong and it tracks confidently in the wrong direction. the lecture is free. the judgment in how fast you let the hedge ratio drift, that ratio of process to measurement noise tuned to the pair in front of you, is the part that actually takes skill.
🧠 Today we introduce Un-0 from @unconvAI : the first large-scale generative model build on physics as a compute primitive. This represents a “hello world” moment for physics-based models. We use the inherent time-varying behavior of physical systems to do compute for us. The result is a new way to build a computer that can be VASTLY more power efficient. 🧵
https://t.co/zYU0ezXJUq
🧠 Can we recover the equations governing a system directly from noisy, high-dimensional observations?
#DYSCO, by Paolo Muratore, is a first step: it learns latent spaces that capture the underlying dynamics & enables recovery of governing equations
📄https://t.co/gFCj7JCxMX
Improved Large Language Diffusion Models
"We introduce iLLaDA (improved LLaDA), an 8B fully bidirectional masked diffusion language model trained from scratch. For pre-training, iLLaDA scales the corpus to 12T tokens, uses grouped-query attention to reduce cache-style inference memory and tied input/output embeddings to reduce parameter count, and modifies the learning-rate schedule for large-scale training. For post-training, iLLaDA modifies the SFT strategy for variable-length generation and trains on a 25B-token instruction corpus for 12 epochs. For inference and evaluation, iLLaDA uses variable-length generation for efficiency and confidence-based scoring for multiple-choice benchmarks"
"Against Qwen2.5 7B, iLLaDA-Base is slightly stronger on average, while iLLaDA-Instruct still lags behind Qwen2.5 7B Instruct."
Autodata: An agentic data scientist to create high quality synthetic data
"We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data."
Data creation stage + data analysis stage+meta-optimization
Neural networks for dynamical systems map high-dimensional fields to low-dimensional manifold latent spaces, where dynamics evolve before the decoder reconstructs the predicted field.
The encoder, consisting of convolutional layers and multilayer perceptrons, compresses the initial field into latent coordinates w and z on the manifold.
The upper part visualizes the encode-project-decode process with example point cloud manifolds.
It is used to accelerate long-term predictions of complex fields in scientific computing applications like fluid dynamics simulations.
The Platonic Representation Hypothesis is mostly a statistical illusion.
New research shows that the apparent "global convergence" of scaled AI models is actually a mathematical artifact of model width and depth selection bias.
Once calibrated, global convergence vanishes. 🧵
We just launched a new premium feature to help users prepare for ML interviews from start to finish.
It helps you:
- Build project to put on your resume
Create strong ML projects that are tailored for the company and the role
- Prepare round by round
Understand what is usually asked in each stage of the ML interview process.
- Practice with mock assessments
Get realistic mock interview practice and feedback to improve before the real thing.
Currently we only have support for Google DeepMind but we are working on adding more and more roles for different companies