like, is gemini flash or whatever not system-prompted with "you are in a search-bar. the user has searched for <this term>. you are google search bar."
Waiting for the day I type half the name of a paper in the search bar and my agent goes boss your phd friend sent you this on whatsapp a few weeks ago already.
If you have say a blackbox O(context^2) op that underneath autoregressively evicts the current cache and reloads it with the post-write state (like a log), wouldn’t that be approximately indistinguishable from the write you mentioned?
Assuming the KV cache is expressive, etc.
i.e somewhere in the data/chip/algo trilemma, one gives up the “quick write” for an opportunity to scale on the data?
I used to think that the transformer isn't an NTM because it can't write to the KVCache, only read. but can't it technically generate its context-length worth of tokens to mutate the KVCache the way it wants? Isn't that just an expensive O(C) write?
good sir @recurseparadox i remember you posted something about NTMs, i'm sure you have the answers i need, I'm not sure if I should be convinced by GPT's answer.
@weeyev There's some good work on trainable log linear attention for DiTs this time at CVPR, i reckon that could be useful here too!
https://t.co/dZS46mx5aW
Proud to share our lab’s @MMLabNTU work Log-linear Sparse Attention (LLSA) - a trainable sparse attention mechanism that reduces attention complexity from O(N²) to O(N log N), making diffusion transformers much more efficient.
Also, special shout-out to the first author @zhouyifan1107 for presenting the poster in full costume - truly above and beyond. The level of dedication is impressive! 👏
#CVPR2026 #DiffusionModels #EfficientAI #SparseAttention
RandOpt is a great, cheap, fast bootstrapping method for raising your pass@K before you start your GRPO.
Here's a 1.5B model gaining 7.3% on the hardest parts of the MATH datasets on an NVIDIA L4(!!) in just 40 minutes.
What if we use an L40 instead? By injecting structured low-rank noise so the noise-swapping process is amortized, RandOpt directly scales with inference speed.
Here's a 16% raise in Countdown in just 200 seconds!
@_TarunKathuria I guess, the point was more that: most NTM/universal transformer-adjacent ideas employ hard-to-scale ideas like recursion to allow a write operation; but yesssss, i think most statements don't take into account the iterative nature of how most of these systems are deployed.
I used to think that the transformer isn't an NTM because it can't write to the KVCache, only read. but can't it technically generate its context-length worth of tokens to mutate the KVCache the way it wants? Isn't that just an expensive O(C) write?
good sir @recurseparadox i remember you posted something about NTMs, i'm sure you have the answers i need, I'm not sure if I should be convinced by GPT's answer.
hmmmmmmmmmm i'd be inclined to think if you have bounded context length C (say, 10) and you have "abcde" in memory, then an "update 3 : c" rule (which would turn it into "abcce") can be simulated by pushing "abcde" out of context, "abcde00000" and then incontext copying "0000abcce", which to a coarse blackbox observer should be indistinguishable.
of course you'd have to keep the command itself in memory, etc, etc.