Robert Gomez AI

@robertgomezai

CTO in AIMEDIC| Researcher| Maths #AI4HEALTH

Joined March 2021

495 Following

237 Followers

4K Posts

Pinned Tweet

Robert Gomez AI @robertgomezai

over 1 year ago

La medicina especializada debe llegar a cada rincón del planeta. El mayor desafío de los sistemas de salud globales tiene una respuesta: la tecnología. 🌍🏥💻 La mejor tecnología del mundo debe estar en manos de los problemas más importantes del mundo.

700

robertgomezai retweeted

Pramod Goyal

@goyal__pramod

1 day ago

Software is evolving, so should you! These are the best blogs I read to understand GPUs and CUDA!

594

802

19K

robertgomezai retweeted

Jokker

@0xJokker

1 day ago

UN DESARROLLADOR ACABA DE HACER LO QUE GOOGLE LLEVA AÑOS IGNORANDO Creo un navegador en Rust pensado especificamente para automatizar tareas, web scraping y agentes de IA > Consume 30MB de RAM > Las paginas cargan en 85ms > Bloquea +3.500 trackers automaticamente > Evita anuncios, analitica y scripts de tracking Se llama Obscura Y tiene algo que Chrome jamas va a tener Cada sesion genera una huella digital diferente. GPU, canvas, audio, bateria... todo randomizado Los detectores no pueden atraparlo porque se comporta exactamente como Chrome real Es un reemplazo directo de Puppeteer y Playwright Sin Nodejs. Sin dependencias. Un solo binario Tiene +16k estrellas en GitHub. 100% open source. Gratis Guardalo para no perderlo 👇

0xJokker's tweet photo. UN DESARROLLADOR ACABA DE HACER LO QUE GOOGLE LLEVA AÑOS IGNORANDO

Creo un navegador en Rust pensado especificamente para automatizar tareas, web scraping y agentes de IA

> Consume 30MB de RAM
> Las paginas cargan en 85ms
> Bloquea +3.500 trackers automaticamente
> Evita anuncios, analitica y scripts de tracking

Se llama Obscura

Y tiene algo que Chrome jamas va a tener

Cada sesion genera una huella digital diferente. GPU, canvas, audio, bateria... todo randomizado

Los detectores no pueden atraparlo porque se comporta exactamente como Chrome real

Es un reemplazo directo de Puppeteer y Playwright

Sin Nodejs. Sin dependencias. Un solo binario

Tiene +16k estrellas en GitHub. 100% open source. Gratis

Guardalo para no perderlo 👇

254

92K

robertgomezai retweeted

Grigory Sapunov

@che_shr_cat

2 days ago

1/ We have been treating GPU memory all wrong. What if the GPU didn't need to store your model at all? MegaTrain enables full-precision training of 100B+ LLMs on a single GPU by turning VRAM into a transient, stateless cache. The secret? Inverting the memory hierarchy. 🧵

che_shr_cat's tweet photo. 1/
We have been treating GPU memory all wrong.

What if the GPU didn't need to store your model at all?

MegaTrain enables full-precision training of 100B+ LLMs on a single GPU by turning VRAM into a transient, stateless cache.

The secret? Inverting the memory hierarchy. 🧵 https://t.co/CXJVbW2By3

136

78K

Who to follow

Sag 🐘

@sagmmd

🌸 She/Her • ⛪️ Catholic • 🎙️ Speaker • 💻 Software Engineer • 🍷 Sommelier • 🎓 Graduate @IPN_MX

Cafecita reloaded ☕

@keymetu

Jr cloud|Cuenta enfocada a #tech| Technolatina, programando por hobbie| La próxima Mourhino de la programación. Siempre es hoy 💚💙

moviedot

@Moviedo95

Just a develop

robertgomezai retweeted

rish (in sf)

@rishabh16_

2 days ago

I often go back to one of my favorite diffusion model theory papers coz it closely matches and formally explains my vibe-intutions for why they either perform really well or epically fail in different settings When debugging your DDPM or flow code/implementation, the rough advice I was given is to overfit on a single sample, then a minibatch, then a few minibatches, and eventually the whole dataset. That's great but the stuff in the middle with minibatches has always been so difficult to _actually_ do, but somehow magically starts working again when I moved to full dataset training The paper says diffusion models factorise local, independent (sub)structures in the data, composing different local motifs to create the final image/molecule. From a percolation theory POV, there's a phase transition after which the denoiser "snaps" particles in place during sampling. If you pass that threshold for transition, the denoiser follows the underlying score a lot better and if you dont, your model can't factorisee and compose properly and you get garbage out during sampling > Overfitting on a single sample is easy. The sample _IS_ the mode and the model just burns that sample into the weights. There's no need for factorisation or real composition at all coz every sampling step obviously moves you closer to the same target without any conflicting gradient signals > When overfitting on a minibatch or a few minibatches, the loss is super erratic and the denoiser never accurately reproduces all the samples in the minibatch perfectly like it did with the single sample. Coz of so much variation even in the minibatch, there's super weak factorisation coz we don't yet cross the percolation threshold needed to achieve decent sample coherence. I think this is also a major reason for poor performance aside from the simpler "noisy SGD gradient" explanation > Training on the full dataset works again properly and I see some good novelty from generated samples coz that's what I want to optimise for (eg: new proteins/binders, etc). At this full dataset scale, we cross a percolation threshold that's associated with strong factorisation. The score also is a lot smoother When training diffusion models from scratch now, I've mostly been skipping the minibatch stuff in the middle and just moving to full dataset after overfitting on a single thing, and it's been OK

rishabh16_'s tweet photo. I often go back to one of my favorite diffusion model theory papers coz it closely matches and formally explains my vibe-intutions for why they either perform really well or epically fail in different settings

When debugging your DDPM or flow code/implementation, the rough advice I was given is to overfit on a single sample, then a minibatch, then a few minibatches, and eventually the whole dataset. That's great but the stuff in the middle with minibatches has always been so difficult to _actually_ do, but somehow magically starts working again when I moved to full dataset training

The paper says diffusion models factorise local, independent (sub)structures in the data, composing different local motifs to create the final image/molecule. From a percolation theory POV, there's a phase transition after which the denoiser "snaps" particles in place during sampling. If you pass that threshold for transition, the denoiser follows the underlying score a lot better and if you dont, your model can't factorisee and compose properly and you get garbage out during sampling

> Overfitting on a single sample is easy. The sample _IS_ the mode and the model just burns that sample into the weights. There's no need for factorisation or real composition at all coz every sampling step obviously moves you closer to the same target without any conflicting gradient signals

> When overfitting on a minibatch or a few minibatches, the loss is super erratic and the denoiser never accurately reproduces all the samples in the minibatch perfectly like it did with the single sample. Coz of so much variation even in the minibatch, there's super weak factorisation coz we don't yet cross the percolation threshold needed to achieve decent sample coherence. I think this is also a major reason for poor performance aside from the simpler "noisy SGD gradient" explanation

> Training on the full dataset works again properly and I see some good novelty from generated samples coz that's what I want to optimise for (eg: new proteins/binders, etc). At this full dataset scale, we cross a percolation threshold that's associated with strong factorisation. The score also is a lot smoother

When training diffusion models from scratch now, I've mostly been skipping the minibatch stuff in the middle and just moving to full dataset after overfitting on a single thing, and it's been OK

185

213

15K

robertgomezai retweeted

Stat.ML Papers @StatMLPapers

2 days ago

Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models https://t.co/bwHQO0jxbV

267

12K

robertgomezai retweeted

Frank Nielsen @FrnkNlsn

2 days ago

IMHO, one of the greatest theorems of Bregman Manifolds that has been often used for decomposing Kullback-Leibler divergence in applications: Dual Foliations from mixed primal/dual coordinates. Professor Amari's 2016 textbook

FrnkNlsn's tweet photo. IMHO, one of the greatest theorems of Bregman Manifolds that has been often used for decomposing Kullback-Leibler divergence in applications:

Dual Foliations from mixed primal/dual coordinates.

Professor Amari's 2016 textbook https://t.co/cjX6LAfCLK

288

237

12K

robertgomezai retweeted

Avi Chawla

@_avichawla

4 days ago

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: https://t.co/ZwXv7VezVu I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

307

259K

robertgomezai retweeted

Samuel 🇨🇴 @ExIncognitoP

3 days ago

Lo que ve Cristiano Ronaldo cada que tiene el balón vs Colombia

26K

386

351K

robertgomezai retweeted

Stat.ML Papers @StatMLPapers

5 days ago

The Geometry of Updates: Fisher Alignment at Vocabulary Scale https://t.co/it8in06FfG

robertgomezai retweeted

Rossst.03

@Rossst_03

5 days ago

Stephen Boyd, Stanford EE professor: "Wall Street pays quant researchers $400K–$750K a year to estimate a hidden state from noisy data. It's the same filter that landed Apollo on the moon, and it's the final lecture of a free Stanford course." this free stanford lecture from 17 years ago holds the entire "kalman filter edge" the 2026 quant threads sell you. in his last ee263 lecture, boyd builds state estimation from scratch: you can't see the true state, only noisy measurements, so you blend what your model predicts with what you just observed, weighted by how much you trust each. that blend is the kalman gain. that's the whole thing. it's the exact filter the thread codes up for a dynamic hedge ratio. swap "spacecraft position" for "the true relationship between two assets" and the equations are identical. rudolf kalman published it in 1960. boyd has taught it free since the 2000s. so the math was never the moat. the hedge ratio that updates every tick, the uncertainty that calibrates your z-score, all of it is standard state estimation, public for over sixty years. and here's the honest part the thread is actually good about. the filter is only optimal if your assumptions hold, linear dynamics, gaussian noise, a Q and R you set correctly. get the noise model wrong and it tracks confidently in the wrong direction. the lecture is free. the judgment in how fast you let the hedge ratio drift, that ratio of process to measurement noise tuned to the pair in front of you, is the part that actually takes skill.

375

750

65K

robertgomezai retweeted

Tom Dörr

@tom_doerr

6 days ago

Ultra-simplified explanations of common design patterns https://t.co/ChxVIc0HB7

418

451

17K

robertgomezai retweeted

Naveen Rao

@NaveenGRao

5 days ago

🧠 Today we introduce Un-0 from @unconvAI : the first large-scale generative model build on physics as a compute primitive. This represents a “hello world” moment for physics-based models. We use the inherent time-varying behavior of physical systems to do compute for us. The result is a new way to build a computer that can be VASTLY more power efficient. 🧵 https://t.co/zYU0ezXJUq

135

305

745

11M

robertgomezai retweeted

Tom Dörr

@tom_doerr

5 days ago

Curated GNN papers, datasets, and implementation tools https://t.co/T26AHgbOO2

108

robertgomezai retweeted

Mackenzie Weygandt Mathis, PhD @TrackingActions

5 days ago

🧠 Can we recover the equations governing a system directly from noisy, high-dimensional observations? #DYSCO, by Paolo Muratore, is a first step: it learns latent spaces that capture the underlying dynamics & enables recovery of governing equations 📄https://t.co/gFCj7JCxMX

204

968

91K

robertgomezai retweeted

Tom Dörr

@tom_doerr

6 days ago

15TB of physics simulation datasets for machine learning https://t.co/9Gnbrr7LnR

237

217

12K

robertgomezai retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

6 days ago

Improved Large Language Diffusion Models "We introduce iLLaDA (improved LLaDA), an 8B fully bidirectional masked diffusion language model trained from scratch. For pre-training, iLLaDA scales the corpus to 12T tokens, uses grouped-query attention to reduce cache-style inference memory and tied input/output embeddings to reduce parameter count, and modifies the learning-rate schedule for large-scale training. For post-training, iLLaDA modifies the SFT strategy for variable-length generation and trains on a 25B-token instruction corpus for 12 epochs. For inference and evaluation, iLLaDA uses variable-length generation for efficiency and confidence-based scoring for multiple-choice benchmarks" "Against Qwen2.5 7B, iLLaDA-Base is slightly stronger on average, while iLLaDA-Instruct still lags behind Qwen2.5 7B Instruct."

iScienceLuvr's tweet photo. Improved Large Language Diffusion Models

"We introduce iLLaDA (improved LLaDA), an 8B fully bidirectional masked diffusion language model trained from scratch. For pre-training, iLLaDA scales the corpus to 12T tokens, uses grouped-query attention to reduce cache-style inference memory and tied input/output embeddings to reduce parameter count, and modifies the learning-rate schedule for large-scale training. For post-training, iLLaDA modifies the SFT strategy for variable-length generation and trains on a 25B-token instruction corpus for 12 epochs. For inference and evaluation, iLLaDA uses variable-length generation for efficiency and confidence-based scoring for multiple-choice benchmarks"

"Against Qwen2.5 7B, iLLaDA-Base is slightly stronger on average, while iLLaDA-Instruct still lags behind Qwen2.5 7B Instruct."

157

robertgomezai retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

6 days ago

Autodata: An agentic data scientist to create high quality synthetic data "We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data." Data creation stage + data analysis stage+meta-optimization

iScienceLuvr's tweet photo. Autodata: An agentic data scientist to create high quality synthetic data

"We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data."

Data creation stage + data analysis stage+meta-optimization https://t.co/yO3noYHnV7

810

132

838

44K

robertgomezai retweeted

Mathematica

@mathemetica

7 days ago

Neural networks for dynamical systems map high-dimensional fields to low-dimensional manifold latent spaces, where dynamics evolve before the decoder reconstructs the predicted field. The encoder, consisting of convolutional layers and multilayer perceptrons, compresses the initial field into latent coordinates w and z on the manifold. The upper part visualizes the encode-project-decode process with example point cloud manifolds. It is used to accelerate long-term predictions of complex fields in scientific computing applications like fluid dynamics simulations.

mathemetica's tweet photo. Neural networks for dynamical systems map high-dimensional fields to low-dimensional manifold latent spaces, where dynamics evolve before the decoder reconstructs the predicted field.

The encoder, consisting of convolutional layers and multilayer perceptrons, compresses the initial field into latent coordinates w and z on the manifold.
The upper part visualizes the encode-project-decode process with example point cloud manifolds.

It is used to accelerate long-term predictions of complex fields in scientific computing applications like fluid dynamics simulations.

625

118

395

23K

robertgomezai retweeted

Grigory Sapunov

@che_shr_cat

6 days ago

The Platonic Representation Hypothesis is mostly a statistical illusion. New research shows that the apparent "global convergence" of scaled AI models is actually a mathematical artifact of model width and depth selection bias. Once calibrated, global convergence vanishes. 🧵

che_shr_cat's tweet photo. The Platonic Representation Hypothesis is mostly a statistical illusion.

New research shows that the apparent "global convergence" of scaled AI models is actually a mathematical artifact of model width and depth selection bias.

Once calibrated, global convergence vanishes. 🧵 https://t.co/vy9EtO7zQp

546

530

94K

robertgomezai retweeted

Deep-ML

@real_deep_ml

7 days ago

We just launched a new premium feature to help users prepare for ML interviews from start to finish. It helps you: - Build project to put on your resume Create strong ML projects that are tailored for the company and the role - Prepare round by round Understand what is usually asked in each stage of the ML interview process. - Practice with mock assessments Get realistic mock interview practice and feedback to improve before the real thing. Currently we only have support for Google DeepMind but we are working on adding more and more roles for different companies

real_deep_ml's tweet photo. We just launched a new premium feature to help users prepare for ML interviews from start to finish.

It helps you:

- Build project to put on your resume
Create strong ML projects that are tailored for the company and the role

- Prepare round by round
Understand what is usually asked in each stage of the ML interview process.

- Practice with mock assessments
Get realistic mock interview practice and feedback to improve before the real thing.

Currently we only have support for Google DeepMind but we are working on adding more and more roles for different companies

535

882

114K

Robert Gomez AI

@robertgomezai

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users